Docs
All Templates
- class greykite.framework.templates.forecaster.Forecaster(model_template_enum: ~typing.Type[~enum.Enum] = <enum 'ModelTemplateEnum'>, default_model_template_name: str = 'AUTO')[source]
The main entry point to create a forecast.
Call the
run_forecast_config
method to create a forecast. It takes a dataset and forecast configuration parameters.Notes
This class can create forecasts using any of the model templates in
ModelTemplateEnum
. Model templates provide suitable default values for the available forecast estimators depending on the data characteristics.The model template is selected via the
config.model_template
parameter torun_forecast_config
.To add your own custom algorithms or template classes in our framework, pass
model_template_enum
anddefault_model_template_name
to the constructor.- model_template_enum: Type[Enum]
The available template names. An Enum class where names are template names, and values are of type
ModelTemplate
.
- default_model_template_name: str
The default template name if not provided by
config.model_template
. Should be a name inmodel_template_enum
or “auto”. Used by__get_template_class
.
- template_class: Optional[Type[TemplateInterface]]
Template class used. Must implement
TemplateInterface
and be one of the classes inself.model_template_enum
. Available for debugging purposes. Set byrun_forecast_config
.
- template: Optional[TemplateInterface]
Instance of
template_class
used to run the forecast. See the docstring of the specific template class used.Available for debugging purposes. Set by
run_forecast_config
.
- config: Optional[ForecastConfig]
ForecastConfig
passed to the template class. Set byrun_forecast_config
.
- pipeline_params: Optional[Dict]
Parameters used to call
forecast_pipeline
. Available for debugging purposes. Set byrun_forecast_config
.
- forecast_result: Optional[ForecastResult]
The forecast result, returned by
forecast_pipeline
. Set byrun_forecast_config
.
- apply_forecast_config(df: DataFrame, config: Optional[ForecastConfig] = None) Dict [source]
Fetches pipeline parameters from the
df
andconfig
, but does not run the pipeline to generate a forecast.run_forecast_config
calls this function and also runs the forecast pipeline.Available for debugging purposes to check pipeline parameters before running a forecast. Sets these attributes for debugging:
pipeline_params
: the parameters passed toforecast_pipeline
.template_class
,template
: the template class used to generate the pipeline parameters.config
: theForecastConfig
passed as input to template class, to translate into pipeline parameters.
Provides basic validation on the compatibility of
config.model_template
withconfig.model_components_param
.- Parameters
df (
pandas.DataFrame
) – Timeseries data to forecast. Contains columns [time_col, value_col], and optional regressor columns Regressor columns should include future values for predictionconfig (
ForecastConfig
or None) – Config object for template class to use. SeeForecastConfig
.
- Returns
pipeline_params – Input to
forecast_pipeline
.- Return type
dict [str, any]
- run_forecast_config(df: DataFrame, config: Optional[ForecastConfig] = None) ForecastResult [source]
Creates a forecast from input data and config. The result is also stored as
self.forecast_result
.- Parameters
df (
pandas.DataFrame
) – Timeseries data to forecast. Contains columns [time_col, value_col], and optional regressor columns Regressor columns should include future values for predictionconfig (
ForecastConfig
) – Config object for template class to use. SeeForecastConfig
.
- Returns
forecast_result – Forecast result, an object of type
ForecastResult
.The output of
forecast_pipeline
, according to thedf
andconfig
configuration parameters.- Return type
- run_forecast_json(df: DataFrame, json_str: str = '{}') ForecastResult [source]
Calls
forecast_pipeline
according to thejson_str
configuration parameters.- Parameters
df (
pandas.DataFrame
) – Timeseries data to forecast. Contains columns [time_col, value_col], and optional regressor columns Regressor columns should include future values for predictionjson_str (str) – Json string of the config object for Forecast to use. See
ForecastConfig
.
- Returns
forecast_result – Forecast result. The output of
forecast_pipeline
, called using the template class with specified configuration. SeeForecastResult
for details.- Return type
- dump_forecast_result(destination_dir, object_name='object', dump_design_info=True, overwrite_exist_dir=False)[source]
Dumps
self.forecast_result
to local pickle files.- Parameters
destination_dir (str) – The pickle destination directory.
object_name (str) – The stored file name.
dump_design_info (bool, default True) – Whether to dump design info. Design info is a patsy class that includes the design matrix information. It takes longer to dump design info.
overwrite_exist_dir (bool, default False) – What to do when
destination_dir
already exists. Removes the original directory when exists, if set to True.
- Return type
This function writes to local files and does not return anything.
- load_forecast_result(source_dir, load_design_info=True)[source]
Loads
self.forecast_result
from local files created byself.dump_result
.- Parameters
source_dir (str) – The source file directory.
load_design_info (bool, default True) – Whether to load design info. Design info is a patsy class that includes the design matrix information. It takes longer to load design info.
- class greykite.framework.templates.model_templates.ModelTemplate(template_class: Type[TemplateInterface], description: str)[source]
A model template consists of a template class, a description, and a name.
This class holds the template class and description. The model template name is the member name in
greykite.framework.templates.model_templates.ModelTemplateEnum
.
- class greykite.framework.templates.model_templates.ModelTemplateEnum(value)[source]
Available model templates.
Enumerates the possible values for the
model_template
attribute ofForecastConfig
.The value has type
ModelTemplate
which contains:the template class that recognizes the model_template. Template classes implement the
TemplateInterface
interface.a plain-text description of what the model_template is for,
The description should be unique across enum members. The template class can be shared, because a template class can recognize multiple model templates. For example, the same template class may use different default values for
ForecastConfig.model_components_param
depending onForecastConfig.model_template
.Notes
The template classes
SilverkiteTemplate
andProphetTemplate
recognize only the model templates explicitly enumerated here.However, the
SimpleSilverkiteTemplate
template class allows additional model templates to be specified generically. Any object of typeSimpleSilverkiteTemplateOptions
can be used as the model_template. These generic model templates are valid but not enumerated here.- SILVERKITE = ModelTemplate(template_class=<class 'greykite.framework.templates.simple_silverkite_template.SimpleSilverkiteTemplate'>, description='Silverkite model with automatic growth, seasonality, holidays, automatic autoregression, normalization and interactions. Best for hourly and daily frequencies.Uses `SimpleSilverkiteEstimator`.')
Silverkite model with automatic growth, seasonality, holidays, automatic autoregression, normalization and interactions. Best for hourly and daily frequencies. Uses SimpleSilverkiteEstimator.
- SILVERKITE_DAILY_1_CONFIG_1 = ModelTemplate(template_class=<class 'greykite.framework.templates.simple_silverkite_template.SimpleSilverkiteTemplate'>, description='Config 1 in template ``SILVERKITE_DAILY_1``. Compared to ``SILVERKITE``, it uses parameters specifically tuned for daily data and 1-day forecast.')
Config 1 in template
SILVERKITE_DAILY_1
. Compared toSILVERKITE
, it uses parameters specifically tuned for daily data and 1-day forecast.
- SILVERKITE_DAILY_1_CONFIG_2 = ModelTemplate(template_class=<class 'greykite.framework.templates.simple_silverkite_template.SimpleSilverkiteTemplate'>, description='Config 2 in template ``SILVERKITE_DAILY_1``. Compared to ``SILVERKITE``, it uses parameters specifically tuned for daily data and 1-day forecast.')
Config 2 in template
SILVERKITE_DAILY_1
. Compared toSILVERKITE
, it uses parameters specifically tuned for daily data and 1-day forecast.
- SILVERKITE_DAILY_1_CONFIG_3 = ModelTemplate(template_class=<class 'greykite.framework.templates.simple_silverkite_template.SimpleSilverkiteTemplate'>, description='Config 3 in template ``SILVERKITE_DAILY_1``. Compared to ``SILVERKITE``, it uses parameters specifically tuned for daily data and 1-day forecast.')
Config 3 in template
SILVERKITE_DAILY_1
. Compared toSILVERKITE
, it uses parameters specifically tuned for daily data and 1-day forecast.
- SILVERKITE_DAILY_1 = ModelTemplate(template_class=<class 'greykite.framework.templates.simple_silverkite_template.SimpleSilverkiteTemplate'>, description='Silverkite model specifically tuned for daily data and 1-day forecast. Contains 3 candidate configs for grid search, optimized the seasonality and changepoint parameters.')
Silverkite model specifically tuned for daily data and 1-day forecast. Contains 3 candidate configs for grid search, optimized the seasonality and changepoint parameters.
- SILVERKITE_DAILY_90 = ModelTemplate(template_class=<class 'greykite.framework.templates.simple_silverkite_template.SimpleSilverkiteTemplate'>, description='Silverkite model specifically tuned for daily data with 90 days forecast horizon. Contains 4 hyperparameter combinations for grid search. Uses `SimpleSilverkiteEstimator`.')
Silverkite model specifically tuned for daily data with 90 days forecast horizon. Contains 4 hyperparameter combinations for grid search. Uses SimpleSilverkiteEstimator.
- SILVERKITE_WEEKLY = ModelTemplate(template_class=<class 'greykite.framework.templates.simple_silverkite_template.SimpleSilverkiteTemplate'>, description='Silverkite model specifically tuned for weekly data. Contains 4 hyperparameter combinations for grid search. Uses `SimpleSilverkiteEstimator`.')
Silverkite model specifically tuned for weekly data. Contains 4 hyperparameter combinations for grid search. Uses SimpleSilverkiteEstimator.
- SILVERKITE_MONTHLY = ModelTemplate(template_class=<class 'greykite.framework.templates.simple_silverkite_template.SimpleSilverkiteTemplate'>, description='Silverkite model specifically tuned for monthly data. Uses `SimpleSilverkiteEstimator`.')
Silverkite model specifically tuned for monthly data. Uses SimpleSilverkiteEstimator.
- SILVERKITE_HOURLY_1 = ModelTemplate(template_class=<class 'greykite.framework.templates.simple_silverkite_template.SimpleSilverkiteTemplate'>, description='Silverkite model specifically tuned for hourly data with 1 hour forecast horizon. Contains 3 hyperparameter combinations for grid search. Uses `SimpleSilverkiteEstimator`.')
Silverkite model specifically tuned for hourly data with 1 hour forecast horizon. Contains 4 hyperparameter combinations for grid search. Uses SimpleSilverkiteEstimator.
- SILVERKITE_HOURLY_24 = ModelTemplate(template_class=<class 'greykite.framework.templates.simple_silverkite_template.SimpleSilverkiteTemplate'>, description='Silverkite model specifically tuned for hourly data with 24 hours (1 day) forecast horizon. Contains 4 hyperparameter combinations for grid search. Uses `SimpleSilverkiteEstimator`.')
Silverkite model specifically tuned for hourly data with 24 hours (1 day) forecast horizon. Contains 4 hyperparameter combinations for grid search. Uses SimpleSilverkiteEstimator.
- SILVERKITE_HOURLY_168 = ModelTemplate(template_class=<class 'greykite.framework.templates.simple_silverkite_template.SimpleSilverkiteTemplate'>, description='Silverkite model specifically tuned for hourly data with 168 hours (1 week) forecast horizon. Contains 4 hyperparameter combinations for grid search. Uses `SimpleSilverkiteEstimator`.')
Silverkite model specifically tuned for hourly data with 168 hours (1 week) forecast horizon. Contains 4 hyperparameter combinations for grid search. Uses SimpleSilverkiteEstimator.
- SILVERKITE_HOURLY_336 = ModelTemplate(template_class=<class 'greykite.framework.templates.simple_silverkite_template.SimpleSilverkiteTemplate'>, description='Silverkite model specifically tuned for hourly data with 336 hours (2 weeks) forecast horizon. Contains 4 hyperparameter combinations for grid search. Uses `SimpleSilverkiteEstimator`.')
Silverkite model specifically tuned for hourly data with 336 hours (2 weeks) forecast horizon. Contains 4 hyperparameter combinations for grid search. Uses SimpleSilverkiteEstimator.
- SILVERKITE_EMPTY = ModelTemplate(template_class=<class 'greykite.framework.templates.simple_silverkite_template.SimpleSilverkiteTemplate'>, description='Silverkite model with no component included by default. Fits only a constant intercept. Select and customize this template to add only the terms you want. Uses `SimpleSilverkiteEstimator`.')
Silverkite model with no component included by default. Fits only a constant intercept. Select and customize this template to add only the terms you want. Uses SimpleSilverkiteEstimator.
- SK = ModelTemplate(template_class=<class 'greykite.framework.templates.silverkite_template.SilverkiteTemplate'>, description='Silverkite model with low-level interface. For flexible model tuning if SILVERKITE template is not flexible enough. Not for use out-of-the-box: customization is needed for good performance. Uses `SilverkiteEstimator`.')
Silverkite model with low-level interface. For flexible model tuning if SILVERKITE template is not flexible enough. Not for use out-of-the-box: customization is needed for good performance. Uses SilverkiteEstimator.
- PROPHET = ModelTemplate(template_class=<class 'greykite.framework.templates.prophet_template.ProphetTemplate'>, description='Prophet model with growth, seasonality, holidays, additional regressors and prediction intervals. Uses `ProphetEstimator`.')
Prophet model with growth, seasonality, holidays, additional regressors and prediction intervals. Uses ProphetEstimator.
- AUTO_ARIMA = ModelTemplate(template_class=<class 'greykite.framework.templates.auto_arima_template.AutoArimaTemplate'>, description='Auto ARIMA model with fit and prediction intervals. Uses `AutoArimaEstimator`.')
ARIMA model with automatic order selection. Uses AutoArimaEstimator.
- SILVERKITE_TWO_STAGE = ModelTemplate(template_class=<class 'greykite.framework.templates.multistage_forecast_template.MultistageForecastTemplate'>, description="MultistageForecastTemplate's default model template. A two-stage model. The first step takes a longer history and learns the long-term effects, while the second step takes a shorter history and learns the short-term residuals.")
Multistage forecast model’s default model template. A two-stage model. ” “The first step takes a longer history and learns the long-term effects, ” “while the second step takes a shorter history and learns the short-term residuals.
- MULTISTAGE_EMPTY = ModelTemplate(template_class=<class 'greykite.framework.templates.multistage_forecast_template.MultistageForecastTemplate'>, description='Empty configuration for Multistage Forecast. All parameters will be exactly what user inputs. Not to be used without overriding.')
Empty configuration for Multistage Forecast. All parameters will be exactly what user inputs. Not to be used without overriding.
- AUTO = ModelTemplate(template_class=<class 'greykite.framework.templates.simple_silverkite_template.SimpleSilverkiteTemplate'>, description='Automatically selects the SimpleSilverkite model template that corresponds to the forecast problem. Selection is based on data frequency, forecast horizon, and CV configuration.')
Automatically selects the SimpleSilverkite model template that corresponds to the forecast problem. Selection is based on data frequency, forecast horizon, and CV configuration.
- LAG_BASED = ModelTemplate(template_class=<class 'greykite.framework.templates.lag_based_template.LagBasedTemplate'>, description='Uses aggregated past observations as predictions. Examples are past day, week-over-week, week-over-3-week median, etc.')
Uses aggregated past observations as predictions. Examples are past day, week-over-week, week-over-3-week median, etc.
- SILVERKITE_WOW = ModelTemplate(template_class=<class 'greykite.framework.templates.multistage_forecast_template.MultistageForecastTemplate'>, description="The Silverkite+WOW model uses Silverkite to model yearly/quarterly/monthly seasonality, growth and holiday effects first, then uses week over week to estimate the residuals. The final prediction is the total of the two models. This avoids the normal week over week (WOW) estimation's weakness in capturing growth and holidays.")
The Silverkite+WOW model uses Silverkite to model yearly/quarterly/monthly seasonality, growth and holiday effects first, then uses week over week to estimate the residuals. The final prediction is the total of the two models. This avoids the normal week over week (WOW) estimation’s weakness in capturing growth and holidays.
- class greykite.framework.templates.autogen.forecast_config.ForecastConfig(computation_param: Optional[ComputationParam] = None, coverage: Optional[float] = None, evaluation_metric_param: Optional[EvaluationMetricParam] = None, evaluation_period_param: Optional[EvaluationPeriodParam] = None, forecast_horizon: Optional[int] = None, forecast_one_by_one: Optional[Union[bool, int, List[int]]] = None, metadata_param: Optional[MetadataParam] = None, model_components_param: Optional[Union[ModelComponentsParam, List[Optional[ModelComponentsParam]]]] = None, model_template: Optional[Union[str, dataclass, List[Union[str, dataclass]]]] = None)[source]
Config for providing parameters to the Forecast library
- computation_param: Optional[ComputationParam] = None
How to compute the result. See
ComputationParam
.
- coverage: Optional[float] = None
Intended coverage of the prediction bands (0.0 to 1.0). If None, the upper/lower predictions are not returned.
- evaluation_metric_param: Optional[EvaluationMetricParam] = None
What metrics to evaluate. See
EvaluationMetricParam
.
- evaluation_period_param: Optional[EvaluationPeriodParam] = None
How to split data for evaluation. See
EvaluationPeriodParam
.
- forecast_horizon: Optional[int] = None
Number of periods to forecast into the future. Must be > 0. If None, default is determined from input data frequency.
- forecast_one_by_one: Optional[Union[bool, int, List[int]]] = None
The options to activate the forecast one-by-one algorithm. See
OneByOneEstimator
. Can be boolean, int, of list of int. If int, it has to be less than or equal to the forecast horizon. If list of int, the sum has to be the forecast horizon.
- metadata_param: Optional[MetadataParam] = None
Information about the input data. See
MetadataParam
.
- model_components_param: Optional[Union[ModelComponentsParam, List[Optional[ModelComponentsParam]]]] = None
Parameters to tune the model. Typically a single ModelComponentsParam, but the SimpleSilverkiteTemplate template also allows a list of ModelComponentsParam for grid search. A single ModelComponentsParam corresponds to one grid, and a list corresponds to a list of grids. See
ModelComponentsParam
.
- model_template: Optional[Union[str, dataclass, List[Union[str, dataclass]]]] = None
Name of the model template. Typically a single string, but the SimpleSilverkiteTemplate template also allows a list of string for grid search. See
ModelTemplateEnum
for valid names.
- static from_json(obj: Any) ForecastConfig [source]
Converts a json string to the corresponding instance of the
ForecastConfig
class. Raises ValueError if the input is not a json string.
- class greykite.framework.templates.autogen.forecast_config.MetadataParam(anomaly_info: Optional[Union[Dict[str, Any], List[Dict[str, Any]]]] = None, date_format: Optional[str] = None, freq: Optional[str] = None, time_col: Optional[str] = None, train_end_date: Optional[str] = None, value_col: Optional[str] = None)[source]
Properties of the input data
- anomaly_info: Optional[Union[Dict[str, Any], List[Dict[str, Any]]]] = None
Anomaly adjustment info. Anomalies in
df
are corrected before any forecasting is done. If None, no adjustments are made. Seeforecast_pipeline
.
- train_end_date: Optional[str] = None
Last date to use for fitting the model. Forecasts are generated after this date. If None, it is set to the last date with a non-null value in value_col df. See
forecast_pipeline
.
- class greykite.framework.templates.autogen.forecast_config.EvaluationMetricParam(agg_func: Optional[Callable] = None, agg_periods: Optional[int] = None, cv_report_metrics: Optional[Union[str, List[str]]] = None, cv_selection_metric: Optional[str] = None, null_model_params: Optional[Dict[str, Any]] = None, relative_error_tolerance: Optional[float] = None)[source]
What metrics to evaluate
- agg_func: Optional[Callable] = None
See
forecast_pipeline
.
- agg_periods: Optional[int] = None
See
forecast_pipeline
.
- cv_selection_metric: Optional[str] = None
See score_func in
forecast_pipeline
.
- relative_error_tolerance: Optional[float] = None
See
forecast_pipeline
.
- class greykite.framework.templates.autogen.forecast_config.EvaluationPeriodParam(cv_expanding_window: Optional[bool] = None, cv_horizon: Optional[int] = None, cv_max_splits: Optional[int] = None, cv_min_train_periods: Optional[int] = None, cv_periods_between_splits: Optional[int] = None, cv_periods_between_train_test: Optional[int] = None, cv_use_most_recent_splits: Optional[bool] = None, periods_between_train_test: Optional[int] = None, test_horizon: Optional[int] = None)[source]
How to split data for evaluation.
- cv_expanding_window: Optional[bool] = None
See
forecast_pipeline
.
- cv_horizon: Optional[int] = None
See
forecast_pipeline
.
- cv_max_splits: Optional[int] = None
See
forecast_pipeline
.
- cv_min_train_periods: Optional[int] = None
See
forecast_pipeline
.
- cv_periods_between_splits: Optional[int] = None
See
forecast_pipeline
.
- cv_periods_between_train_test: Optional[int] = None
See
forecast_pipeline
.
- cv_use_most_recent_splits: Optional[bool] = None
See
forecast_pipeline
.
- test_horizon: Optional[int] = None
See
forecast_pipeline
.
- class greykite.framework.templates.autogen.forecast_config.ModelComponentsParam(autoregression: Optional[Dict[str, Any]] = None, changepoints: Optional[Dict[str, Any]] = None, custom: Optional[Dict[str, Any]] = None, events: Optional[Dict[str, Any]] = None, growth: Optional[Dict[str, Any]] = None, hyperparameter_override: Optional[Union[Dict, List[Optional[Dict]]]] = None, regressors: Optional[Dict[str, Any]] = None, lagged_regressors: Optional[Dict[str, Any]] = None, seasonality: Optional[Dict[str, Any]] = None, uncertainty: Optional[Dict[str, Any]] = None)[source]
Parameters to tune the model.
- autoregression: Optional[Dict[str, Any]] = None
For modeling autoregression, see template for details
- custom: Optional[Dict[str, Any]] = None
Additional parameters used by template, see template for details
- hyperparameter_override: Optional[Union[Dict, List[Optional[Dict]]]] = None
After the above model components are used to create a hyperparameter grid, the result is updated by this dictionary, to create new keys or override existing ones. Allows for complete customization of the grid search.
- class greykite.framework.templates.autogen.forecast_config.ComputationParam(hyperparameter_budget: Optional[int] = None, n_jobs: Optional[int] = None, verbose: Optional[int] = None)[source]
How to compute the result.
- hyperparameter_budget: Optional[int] = None
See
forecast_pipeline
.
- n_jobs: Optional[int] = None
See
forecast_pipeline
.
- verbose: Optional[int] = None
See
forecast_pipeline
.
Silverkite Template
- class greykite.framework.templates.simple_silverkite_template.SimpleSilverkiteTemplate(constants: ~greykite.framework.templates.simple_silverkite_template_config.SimpleSilverkiteTemplateConstants = SimpleSilverkiteTemplateConstants(COMMON_MODELCOMPONENTPARAM_PARAMETERS={'SEAS': {'HOURLY': {'LT': {'auto_seasonality': False, 'yearly_seasonality': 8, 'quarterly_seasonality': 0, 'monthly_seasonality': 0, 'weekly_seasonality': 3, 'daily_seasonality': 5}, 'NM': {'auto_seasonality': False, 'yearly_seasonality': 15, 'quarterly_seasonality': 0, 'monthly_seasonality': 0, 'weekly_seasonality': 4, 'daily_seasonality': 8}, 'HV': {'auto_seasonality': False, 'yearly_seasonality': 25, 'quarterly_seasonality': 0, 'monthly_seasonality': 0, 'weekly_seasonality': 6, 'daily_seasonality': 12}, 'LTQM': {'auto_seasonality': False, 'yearly_seasonality': 8, 'quarterly_seasonality': 2, 'monthly_seasonality': 2, 'weekly_seasonality': 3, 'daily_seasonality': 5}, 'NMQM': {'auto_seasonality': False, 'yearly_seasonality': 15, 'quarterly_seasonality': 3, 'monthly_seasonality': 3, 'weekly_seasonality': 4, 'daily_seasonality': 8}, 'HVQM': {'auto_seasonality': False, 'yearly_seasonality': 25, 'quarterly_seasonality': 4, 'monthly_seasonality': 4, 'weekly_seasonality': 6, 'daily_seasonality': 12}, 'NONE': {'auto_seasonality': False, 'yearly_seasonality': 0, 'quarterly_seasonality': 0, 'monthly_seasonality': 0, 'weekly_seasonality': 0, 'daily_seasonality': 0}}, 'DAILY': {'LT': {'auto_seasonality': False, 'yearly_seasonality': 8, 'quarterly_seasonality': 0, 'monthly_seasonality': 0, 'weekly_seasonality': 3, 'daily_seasonality': 0}, 'NM': {'auto_seasonality': False, 'yearly_seasonality': 15, 'quarterly_seasonality': 0, 'monthly_seasonality': 0, 'weekly_seasonality': 3, 'daily_seasonality': 0}, 'HV': {'auto_seasonality': False, 'yearly_seasonality': 25, 'quarterly_seasonality': 0, 'monthly_seasonality': 0, 'weekly_seasonality': 4, 'daily_seasonality': 0}, 'LTQM': {'auto_seasonality': False, 'yearly_seasonality': 8, 'quarterly_seasonality': 3, 'monthly_seasonality': 2, 'weekly_seasonality': 3, 'daily_seasonality': 0}, 'NMQM': {'auto_seasonality': False, 'yearly_seasonality': 15, 'quarterly_seasonality': 4, 'monthly_seasonality': 4, 'weekly_seasonality': 3, 'daily_seasonality': 0}, 'HVQM': {'auto_seasonality': False, 'yearly_seasonality': 25, 'quarterly_seasonality': 6, 'monthly_seasonality': 4, 'weekly_seasonality': 4, 'daily_seasonality': 0}, 'NONE': {'auto_seasonality': False, 'yearly_seasonality': 0, 'quarterly_seasonality': 0, 'monthly_seasonality': 0, 'weekly_seasonality': 0, 'daily_seasonality': 0}}, 'WEEKLY': {'LT': {'auto_seasonality': False, 'yearly_seasonality': 8, 'quarterly_seasonality': 0, 'monthly_seasonality': 0, 'weekly_seasonality': 0, 'daily_seasonality': 0}, 'NM': {'auto_seasonality': False, 'yearly_seasonality': 15, 'quarterly_seasonality': 0, 'monthly_seasonality': 0, 'weekly_seasonality': 0, 'daily_seasonality': 0}, 'HV': {'auto_seasonality': False, 'yearly_seasonality': 25, 'quarterly_seasonality': 0, 'monthly_seasonality': 0, 'weekly_seasonality': 0, 'daily_seasonality': 0}, 'LTQM': {'auto_seasonality': False, 'yearly_seasonality': 8, 'quarterly_seasonality': 2, 'monthly_seasonality': 2, 'weekly_seasonality': 0, 'daily_seasonality': 0}, 'NMQM': {'auto_seasonality': False, 'yearly_seasonality': 15, 'quarterly_seasonality': 3, 'monthly_seasonality': 3, 'weekly_seasonality': 0, 'daily_seasonality': 0}, 'HVQM': {'auto_seasonality': False, 'yearly_seasonality': 25, 'quarterly_seasonality': 4, 'monthly_seasonality': 4, 'weekly_seasonality': 0, 'daily_seasonality': 0}, 'NONE': {'auto_seasonality': False, 'yearly_seasonality': 0, 'quarterly_seasonality': 0, 'monthly_seasonality': 0, 'weekly_seasonality': 0, 'daily_seasonality': 0}}}, 'GR': {'LINEAR': {'growth_term': 'linear'}, 'NONE': {'growth_term': None}}, 'CP': {'HOURLY': {'LT': {'method': 'auto', 'resample_freq': 'D', 'regularization_strength': 0.6, 'potential_changepoint_distance': '7D', 'no_changepoint_distance_from_end': '30D', 'yearly_seasonality_order': 15, 'yearly_seasonality_change_freq': None}, 'NM': {'method': 'auto', 'resample_freq': 'D', 'regularization_strength': 0.5, 'potential_changepoint_distance': '15D', 'no_changepoint_distance_from_end': '30D', 'yearly_seasonality_order': 15, 'yearly_seasonality_change_freq': '365D'}, 'HV': {'method': 'auto', 'resample_freq': 'D', 'regularization_strength': 0.3, 'potential_changepoint_distance': '15D', 'no_changepoint_distance_from_end': '30D', 'yearly_seasonality_order': 15, 'yearly_seasonality_change_freq': '365D'}, 'NONE': None}, 'DAILY': {'LT': {'method': 'auto', 'resample_freq': '7D', 'regularization_strength': 0.6, 'potential_changepoint_distance': '15D', 'no_changepoint_distance_from_end': '90D', 'yearly_seasonality_order': 15, 'yearly_seasonality_change_freq': None}, 'NM': {'method': 'auto', 'resample_freq': '7D', 'regularization_strength': 0.5, 'potential_changepoint_distance': '15D', 'no_changepoint_distance_from_end': '180D', 'yearly_seasonality_order': 15, 'yearly_seasonality_change_freq': '365D'}, 'HV': {'method': 'auto', 'resample_freq': '7D', 'regularization_strength': 0.3, 'potential_changepoint_distance': '15D', 'no_changepoint_distance_from_end': '180D', 'yearly_seasonality_order': 15, 'yearly_seasonality_change_freq': '365D'}, 'NONE': None}, 'WEEKLY': {'LT': {'method': 'auto', 'resample_freq': '7D', 'regularization_strength': 0.6, 'potential_changepoint_distance': '14D', 'no_changepoint_distance_from_end': '180D', 'yearly_seasonality_order': 15, 'yearly_seasonality_change_freq': None}, 'NM': {'method': 'auto', 'resample_freq': '7D', 'regularization_strength': 0.5, 'potential_changepoint_distance': '14D', 'no_changepoint_distance_from_end': '180D', 'yearly_seasonality_order': 15, 'yearly_seasonality_change_freq': '365D'}, 'HV': {'method': 'auto', 'resample_freq': '7D', 'regularization_strength': 0.3, 'potential_changepoint_distance': '14D', 'no_changepoint_distance_from_end': '180D', 'yearly_seasonality_order': 15, 'yearly_seasonality_change_freq': '365D'}, 'NONE': None}}, 'HOL': {'SP1': {'auto_holiday': False, 'holidays_to_model_separately': 'auto', 'holiday_lookup_countries': 'auto', 'holiday_pre_num_days': 1, 'holiday_post_num_days': 1, 'holiday_pre_post_num_dict': None, 'daily_event_df_dict': None, 'daily_event_neighbor_impact': None, 'daily_event_shifted_effect': None}, 'SP2': {'auto_holiday': False, 'holidays_to_model_separately': 'auto', 'holiday_lookup_countries': 'auto', 'holiday_pre_num_days': 2, 'holiday_post_num_days': 2, 'holiday_pre_post_num_dict': None, 'daily_event_df_dict': None, 'daily_event_neighbor_impact': None, 'daily_event_shifted_effect': None}, 'SP4': {'auto_holiday': False, 'holidays_to_model_separately': 'auto', 'holiday_lookup_countries': 'auto', 'holiday_pre_num_days': 4, 'holiday_post_num_days': 4, 'holiday_pre_post_num_dict': None, 'daily_event_df_dict': None, 'daily_event_neighbor_impact': None, 'daily_event_shifted_effect': None}, 'TG': {'auto_holiday': False, 'holidays_to_model_separately': [], 'holiday_lookup_countries': 'auto', 'holiday_pre_num_days': 3, 'holiday_post_num_days': 3, 'holiday_pre_post_num_dict': None, 'daily_event_df_dict': None, 'daily_event_neighbor_impact': None, 'daily_event_shifted_effect': None}, 'NONE': {'auto_holiday': False, 'holidays_to_model_separately': [], 'holiday_lookup_countries': [], 'holiday_pre_num_days': 0, 'holiday_post_num_days': 0, 'holiday_pre_post_num_dict': None, 'daily_event_df_dict': None, 'daily_event_neighbor_impact': None, 'daily_event_shifted_effect': None}}, 'FEASET': {'AUTO': 'auto', 'ON': True, 'OFF': False}, 'ALGO': {'LINEAR': {'fit_algorithm': 'linear', 'fit_algorithm_params': None}, 'RIDGE': {'fit_algorithm': 'ridge', 'fit_algorithm_params': None}, 'SGD': {'fit_algorithm': 'sgd', 'fit_algorithm_params': None}, 'LASSO': {'fit_algorithm': 'lasso', 'fit_algorithm_params': None}}, 'AR': {'AUTO': {'autoreg_dict': 'auto', 'simulation_num': 10, 'fast_simulation': False}, 'OFF': {'autoreg_dict': None, 'simulation_num': 10, 'fast_simulation': False}}, 'DSI': {'HOURLY': {'AUTO': 5, 'OFF': 0}, 'DAILY': {'AUTO': 0, 'OFF': 0}, 'WEEKLY': {'AUTO': 0, 'OFF': 0}}, 'WSI': {'HOURLY': {'AUTO': 2, 'OFF': 0}, 'DAILY': {'AUTO': 2, 'OFF': 0}, 'WEEKLY': {'AUTO': 0, 'OFF': 0}}}, MULTI_TEMPLATES={'SILVERKITE_DAILY_1': ['SILVERKITE_DAILY_1_CONFIG_1', 'SILVERKITE_DAILY_1_CONFIG_2', 'SILVERKITE_DAILY_1_CONFIG_3'], 'SILVERKITE_DAILY_90': ['DAILY_SEAS_LTQM_GR_LINEAR_CP_LT_HOL_SP2_FEASET_AUTO_ALGO_LINEAR_AR_OFF_DSI_AUTO_WSI_AUTO', 'DAILY_SEAS_LTQM_GR_LINEAR_CP_NONE_HOL_SP2_FEASET_AUTO_ALGO_LINEAR_AR_OFF_DSI_AUTO_WSI_AUTO', 'DAILY_SEAS_LTQM_GR_LINEAR_CP_LT_HOL_SP2_FEASET_AUTO_ALGO_RIDGE_AR_OFF_DSI_AUTO_WSI_AUTO', 'DAILY_SEAS_NM_GR_LINEAR_CP_LT_HOL_SP4_FEASET_AUTO_ALGO_RIDGE_AR_OFF_DSI_AUTO_WSI_AUTO'], 'SILVERKITE_WEEKLY': ['WEEKLY_SEAS_NM_GR_LINEAR_CP_NONE_HOL_NONE_FEASET_OFF_ALGO_LINEAR_AR_OFF_DSI_AUTO_WSI_AUTO', 'WEEKLY_SEAS_NM_GR_LINEAR_CP_LT_HOL_NONE_FEASET_OFF_ALGO_LINEAR_AR_OFF_DSI_AUTO_WSI_AUTO', 'WEEKLY_SEAS_HV_GR_LINEAR_CP_NM_HOL_NONE_FEASET_OFF_ALGO_RIDGE_AR_OFF_DSI_AUTO_WSI_AUTO', 'WEEKLY_SEAS_HV_GR_LINEAR_CP_LT_HOL_NONE_FEASET_OFF_ALGO_RIDGE_AR_OFF_DSI_AUTO_WSI_AUTO'], 'SILVERKITE_HOURLY_1': ['SILVERKITE', 'HOURLY_SEAS_LT_GR_LINEAR_CP_NM_HOL_SP4_FEASET_OFF_ALGO_RIDGE_AR_AUTO', 'HOURLY_SEAS_NM_GR_LINEAR_CP_NM_HOL_SP1_FEASET_AUTO_ALGO_RIDGE_AR_AUTO'], 'SILVERKITE_HOURLY_24': ['HOURLY_SEAS_LT_GR_LINEAR_CP_NM_HOL_SP4_FEASET_AUTO_ALGO_RIDGE_AR_AUTO', 'HOURLY_SEAS_LT_GR_LINEAR_CP_NONE_HOL_SP4_FEASET_AUTO_ALGO_RIDGE_AR_AUTO', 'HOURLY_SEAS_NM_GR_LINEAR_CP_LT_HOL_SP1_FEASET_OFF_ALGO_LINEAR_AR_AUTO', 'HOURLY_SEAS_NM_GR_LINEAR_CP_NM_HOL_SP4_FEASET_AUTO_ALGO_RIDGE_AR_AUTO'], 'SILVERKITE_HOURLY_168': ['HOURLY_SEAS_LT_GR_LINEAR_CP_LT_HOL_SP4_FEASET_AUTO_ALGO_RIDGE_AR_OFF', 'HOURLY_SEAS_LT_GR_LINEAR_CP_LT_HOL_SP2_FEASET_AUTO_ALGO_RIDGE_AR_OFF', 'HOURLY_SEAS_NM_GR_LINEAR_CP_NONE_HOL_SP4_FEASET_OFF_ALGO_LINEAR_AR_AUTO', 'HOURLY_SEAS_NM_GR_LINEAR_CP_NM_HOL_SP1_FEASET_AUTO_ALGO_RIDGE_AR_OFF'], 'SILVERKITE_HOURLY_336': ['HOURLY_SEAS_LT_GR_LINEAR_CP_LT_HOL_SP2_FEASET_AUTO_ALGO_RIDGE_AR_OFF', 'HOURLY_SEAS_LT_GR_LINEAR_CP_LT_HOL_SP4_FEASET_AUTO_ALGO_RIDGE_AR_OFF', 'HOURLY_SEAS_NM_GR_LINEAR_CP_LT_HOL_SP1_FEASET_AUTO_ALGO_LINEAR_AR_OFF', 'HOURLY_SEAS_NM_GR_LINEAR_CP_NM_HOL_SP1_FEASET_AUTO_ALGO_LINEAR_AR_AUTO']}, SILVERKITE=ModelComponentsParam(autoregression={'autoreg_dict': 'auto', 'simulation_num': 10, 'fast_simulation': False}, changepoints={'auto_growth': False, 'changepoints_dict': {'method': 'auto', 'yearly_seasonality_order': 15, 'resample_freq': '3D', 'regularization_strength': 0.6, 'actual_changepoint_min_distance': '30D', 'potential_changepoint_distance': '15D', 'no_changepoint_distance_from_end': '90D'}, 'seasonality_changepoints_dict': None}, custom={'fit_algorithm_dict': {'fit_algorithm': 'ridge', 'fit_algorithm_params': None}, 'feature_sets_enabled': 'auto', 'max_daily_seas_interaction_order': 5, 'max_weekly_seas_interaction_order': 2, 'extra_pred_cols': [], 'drop_pred_cols': None, 'explicit_pred_cols': None, 'min_admissible_value': None, 'max_admissible_value': None, 'regression_weight_col': None, 'normalize_method': 'zero_to_one', 'remove_intercept': False}, events={'auto_holiday': False, 'holidays_to_model_separately': 'auto', 'holiday_lookup_countries': 'auto', 'holiday_pre_num_days': 2, 'holiday_post_num_days': 2, 'holiday_pre_post_num_dict': None, 'daily_event_df_dict': None, 'daily_event_neighbor_impact': None, 'daily_event_shifted_effect': None}, growth={'growth_term': 'linear'}, hyperparameter_override=None, regressors={'regressor_cols': []}, lagged_regressors={'lagged_regressor_dict': None}, seasonality={'auto_seasonality': False, 'yearly_seasonality': 'auto', 'quarterly_seasonality': 'auto', 'monthly_seasonality': 'auto', 'weekly_seasonality': 'auto', 'daily_seasonality': 'auto'}, uncertainty={'uncertainty_dict': None}), SILVERKITE_MONTHLY=ModelComponentsParam(autoregression={'autoreg_dict': {'lag_dict': None, 'agg_lag_dict': {'orders_list': [[1, 2, 3]]}}, 'simulation_num': 50, 'fast_simulation': True}, changepoints={'auto_growth': False, 'changepoints_dict': {'method': 'auto', 'regularization_strength': 0.6, 'resample_freq': '28D', 'potential_changepoint_distance': '180D', 'potential_changepoint_n_max': 100, 'actual_changepoint_min_distance': '730D', 'no_changepoint_distance_from_end': '180D', 'yearly_seasonality_order': 6}, 'seasonality_changepoints_dict': None}, custom={'fit_algorithm_dict': {'fit_algorithm': 'ridge', 'fit_algorithm_params': None}, 'feature_sets_enabled': False, 'max_daily_seas_interaction_order': 0, 'max_weekly_seas_interaction_order': 0, 'extra_pred_cols': ['y_avglag_1_2_3*C(month, levels=list(range(1, 13)))', 'C(month, levels=list(range(1, 13)))'], 'drop_pred_cols': None, 'explicit_pred_cols': None, 'min_admissible_value': None, 'max_admissible_value': None, 'regression_weight_col': None, 'normalize_method': 'zero_to_one', 'remove_intercept': False}, events={'auto_holiday': False, 'holidays_to_model_separately': [], 'holiday_lookup_countries': [], 'holiday_pre_num_days': 0, 'holiday_post_num_days': 0, 'holiday_pre_post_num_dict': None, 'daily_event_df_dict': None, 'daily_event_neighbor_impact': None, 'daily_event_shifted_effect': None}, growth={'growth_term': 'linear'}, hyperparameter_override=None, regressors={'regressor_cols': []}, lagged_regressors={'lagged_regressor_dict': None}, seasonality={'auto_seasonality': False, 'yearly_seasonality': False, 'quarterly_seasonality': False, 'monthly_seasonality': False, 'weekly_seasonality': False, 'daily_seasonality': False}, uncertainty={'uncertainty_dict': None}), SILVERKITE_DAILY_1_CONFIG_1=ModelComponentsParam(autoregression={'autoreg_dict': 'auto', 'simulation_num': 10, 'fast_simulation': False}, changepoints={'auto_growth': False, 'changepoints_dict': {'method': 'auto', 'resample_freq': '7D', 'regularization_strength': 0.809, 'potential_changepoint_distance': '7D', 'no_changepoint_distance_from_end': '7D', 'yearly_seasonality_order': 8, 'yearly_seasonality_change_freq': None}, 'seasonality_changepoints_dict': None}, custom={'fit_algorithm_dict': {'fit_algorithm': 'ridge', 'fit_algorithm_params': None}, 'feature_sets_enabled': 'auto', 'max_daily_seas_interaction_order': 5, 'max_weekly_seas_interaction_order': 2, 'extra_pred_cols': [], 'drop_pred_cols': None, 'explicit_pred_cols': None, 'min_admissible_value': None, 'max_admissible_value': None, 'regression_weight_col': None, 'normalize_method': 'zero_to_one', 'remove_intercept': False}, events={'auto_holiday': False, 'holidays_to_model_separately': ("New Year's Day", 'Chinese New Year', 'Christmas Day', 'Independence Day', 'Thanksgiving', 'Labor Day', 'Good Friday', 'Easter Monday [England, Wales, Northern Ireland]', 'Memorial Day', 'Veterans Day'), 'holiday_lookup_countries': ('UnitedStates', 'UnitedKingdom', 'India', 'France', 'China'), 'holiday_pre_num_days': 2, 'holiday_post_num_days': 2, 'holiday_pre_post_num_dict': None, 'daily_event_df_dict': None, 'daily_event_neighbor_impact': None, 'daily_event_shifted_effect': None}, growth={'growth_term': 'linear'}, hyperparameter_override=None, regressors={'regressor_cols': []}, lagged_regressors={'lagged_regressor_dict': None}, seasonality={'auto_seasonality': False, 'yearly_seasonality': 8, 'quarterly_seasonality': 0, 'monthly_seasonality': 7, 'weekly_seasonality': 1, 'daily_seasonality': 0}, uncertainty={'uncertainty_dict': None}), SILVERKITE_DAILY_1_CONFIG_2=ModelComponentsParam(autoregression={'autoreg_dict': 'auto', 'simulation_num': 10, 'fast_simulation': False}, changepoints={'auto_growth': False, 'changepoints_dict': {'method': 'auto', 'resample_freq': '7D', 'regularization_strength': 0.624, 'potential_changepoint_distance': '7D', 'no_changepoint_distance_from_end': '17D', 'yearly_seasonality_order': 1, 'yearly_seasonality_change_freq': None}, 'seasonality_changepoints_dict': None}, custom={'fit_algorithm_dict': {'fit_algorithm': 'ridge', 'fit_algorithm_params': None}, 'feature_sets_enabled': 'auto', 'max_daily_seas_interaction_order': 5, 'max_weekly_seas_interaction_order': 2, 'extra_pred_cols': [], 'drop_pred_cols': None, 'explicit_pred_cols': None, 'min_admissible_value': None, 'max_admissible_value': None, 'regression_weight_col': None, 'normalize_method': 'zero_to_one', 'remove_intercept': False}, events={'auto_holiday': False, 'holidays_to_model_separately': ("New Year's Day", 'Chinese New Year', 'Christmas Day', 'Independence Day', 'Thanksgiving', 'Labor Day', 'Good Friday', 'Easter Monday [England, Wales, Northern Ireland]', 'Memorial Day', 'Veterans Day'), 'holiday_lookup_countries': ('UnitedStates', 'UnitedKingdom', 'India', 'France', 'China'), 'holiday_pre_num_days': 2, 'holiday_post_num_days': 2, 'holiday_pre_post_num_dict': None, 'daily_event_df_dict': None, 'daily_event_neighbor_impact': None, 'daily_event_shifted_effect': None}, growth={'growth_term': 'linear'}, hyperparameter_override=None, regressors={'regressor_cols': []}, lagged_regressors={'lagged_regressor_dict': None}, seasonality={'auto_seasonality': False, 'yearly_seasonality': 1, 'quarterly_seasonality': 0, 'monthly_seasonality': 4, 'weekly_seasonality': 6, 'daily_seasonality': 0}, uncertainty={'uncertainty_dict': None}), SILVERKITE_DAILY_1_CONFIG_3=ModelComponentsParam(autoregression={'autoreg_dict': 'auto', 'simulation_num': 10, 'fast_simulation': False}, changepoints={'auto_growth': False, 'changepoints_dict': {'method': 'auto', 'resample_freq': '7D', 'regularization_strength': 0.59, 'potential_changepoint_distance': '7D', 'no_changepoint_distance_from_end': '8D', 'yearly_seasonality_order': 40, 'yearly_seasonality_change_freq': None}, 'seasonality_changepoints_dict': None}, custom={'fit_algorithm_dict': {'fit_algorithm': 'ridge', 'fit_algorithm_params': None}, 'feature_sets_enabled': 'auto', 'max_daily_seas_interaction_order': 5, 'max_weekly_seas_interaction_order': 2, 'extra_pred_cols': [], 'drop_pred_cols': None, 'explicit_pred_cols': None, 'min_admissible_value': None, 'max_admissible_value': None, 'regression_weight_col': None, 'normalize_method': 'zero_to_one', 'remove_intercept': False}, events={'auto_holiday': False, 'holidays_to_model_separately': ("New Year's Day", 'Chinese New Year', 'Christmas Day', 'Independence Day', 'Thanksgiving', 'Labor Day', 'Good Friday', 'Easter Monday [England, Wales, Northern Ireland]', 'Memorial Day', 'Veterans Day'), 'holiday_lookup_countries': ('UnitedStates', 'UnitedKingdom', 'India', 'France', 'China'), 'holiday_pre_num_days': 2, 'holiday_post_num_days': 2, 'holiday_pre_post_num_dict': None, 'daily_event_df_dict': None, 'daily_event_neighbor_impact': None, 'daily_event_shifted_effect': None}, growth={'growth_term': 'linear'}, hyperparameter_override=None, regressors={'regressor_cols': []}, lagged_regressors={'lagged_regressor_dict': None}, seasonality={'auto_seasonality': False, 'yearly_seasonality': 40, 'quarterly_seasonality': 0, 'monthly_seasonality': 0, 'weekly_seasonality': 2, 'daily_seasonality': 0}, uncertainty={'uncertainty_dict': None}), SILVERKITE_COMPONENT_KEYWORDS=<enum 'SILVERKITE_COMPONENT_KEYWORDS'>, SILVERKITE_EMPTY='DAILY_SEAS_NONE_GR_NONE_CP_NONE_HOL_NONE_FEASET_OFF_ALGO_LINEAR_AR_OFF_DSI_OFF_WSI_OFF', VALID_FREQ=['HOURLY', 'DAILY', 'WEEKLY'], SimpleSilverkiteTemplateOptions=<class 'greykite.framework.templates.simple_silverkite_template_config.SimpleSilverkiteTemplateOptions'>), estimator: ~greykite.sklearn.estimator.base_forecast_estimator.BaseForecastEstimator = SimpleSilverkiteEstimator())[source]
A template for
SimpleSilverkiteEstimator
.Takes input data and optional configuration parameters to customize the model. Returns a set of parameters to call
forecast_pipeline
.Notes
The attributes of a
ForecastConfig
forSimpleSilverkiteEstimator
are:- computation_param: ComputationParam or None, default None
How to compute the result. See
ComputationParam
.- coverage: float or None, default None
Intended coverage of the prediction bands (0.0 to 1.0). Same as coverage in
forecast_pipeline
. You may tune how the uncertainty is computed via model_components.uncertainty[“uncertainty_dict”].- evaluation_metric_param: EvaluationMetricParam or None, default None
What metrics to evaluate. See
EvaluationMetricParam
.- evaluation_period_param: EvaluationPeriodParam or None, default None
How to split data for evaluation. See
EvaluationPeriodParam
.- forecast_horizon: int or None, default None
Number of periods to forecast into the future. Must be > 0 If None, default is determined from input data frequency Same as forecast_horizon in forecast_pipeline
- metadata_param: MetadataParam or None, default None
Information about the input data. See
MetadataParam
.- model_components_param:
ModelComponentsParam
, list [ModelComponentsParam
] or None, default None Parameters to tune the model. See
ModelComponentsParam
. The fields are dictionaries with the following items.See inline comments on which values accept lists for grid search.
- seasonality: dict [str, any] or None, optional
Seasonality configuration dictionary, with the following optional keys. (keys are SilverkiteSeasonalityEnum members in lower case).
The keys are parameters of forecast_simple_silverkite. Refer to that function for more details.
"auto_seasonality"
bool, default FalseWhether to automatically infer seasonality orders. If True, the seasonality orders will be automatically inferred from input timeseries and the following parameters will be ignored:
"yearly_seasonality"
"quarterly_seasonality"
"monthly_seasonality"
"weekly_seasonality"
"daily_seasonality"
For detail, see
SeasonalityInferrer
."yearly_seasonality"
: str or bool or int or a list of such values for grid search, default ‘auto’Determines the yearly seasonality ‘auto’, True, False, or a number for the Fourier order
"quarterly_seasonality"
: str or bool or int or a list of such values for grid search, default ‘auto’Determines the quarterly seasonality ‘auto’, True, False, or a number for the Fourier order
"monthly_seasonality"
: str or bool or int or a list of such values for grid search, default ‘auto’Determines the monthly seasonality ‘auto’, True, False, or a number for the Fourier order
"weekly_seasonality"
: str or bool or int or a list of such values for grid search, default ‘auto’Determines the weekly seasonality ‘auto’, True, False, or a number for the Fourier order
"daily_seasonality"
: str or bool or int or a list of such values for grid search, default ‘auto’Determines the daily seasonality ‘auto’, True, False, or a number for the Fourier order
- growth: dict [str, any] or None, optional
Growth configuration dictionary with the following optional key:
"growth_term"
: str or None or a list of such values for grid searchHow to model the growth. Valid options are “linear”, “quadratic”, “sqrt”, “cubic”, “cuberoot”. See
GrowthColEnum
. All these terms have their origin at the train start date.
- events: dict [str, any] or None, optional
Holiday/events configuration dictionary with the following optional keys:
"auto_holiday"
bool, default FalseWhether to automatically infer holiday configuration based on the input timeseries. If True, the following keys will be ignored:
"holiday_lookup_countries"
"holidays_to_model_separately"
"holiday_pre_num_days"
"holiday_post_num_days"
"holiday_pre_post_num_dict"
For details, see
HolidayInferrer
. Extra events specified indaily_event_df_dict
will be added to the inferred holidays."holiday_lookup_countries"
: list [str] or “auto” or None or a list of such values for grid search, default “auto”The countries that contain the holidays you intend to model (
holidays_to_model_separately
).If “auto”, uses a default list of countries that contain the default
holidays_to_model_separately
. SeeHOLIDAY_LOOKUP_COUNTRIES_AUTO
.If a list, must be a list of country names.
If None or an empty list, no holidays are modeled.
"holidays_to_model_separately"
: list [str] or “auto” orALL_HOLIDAYS_IN_COUNTRIES
or None or a list of such values for grid search, default “auto” # noqa: E501Which holidays to include in the model. The model creates a separate key, value for each item in
holidays_to_model_separately
. The other holidays in the countries are grouped together as a single effect.If “auto”, uses a default list of important holidays. See
HOLIDAYS_TO_MODEL_SEPARATELY_AUTO
.If
ALL_HOLIDAYS_IN_COUNTRIES
, uses all available holidays inholiday_lookup_countries
. This can often create a model that has too many parameters, and should typically be avoided.If a list, must be a list of holiday names.
If None or an empty list, all holidays in
holiday_lookup_countries
are grouped together as a single effect.
Use
holiday_lookup_countries
to provide a list of countries where these holiday occur."holiday_pre_num_days"
: int or a list of such values for grid search, default 2model holiday effects for pre_num days before the holiday. The unit is days, not periods. It does not depend on input data frequency.
"holiday_post_num_days"
: int or a list of such values for grid search, default 2model holiday effects for post_num days after the holiday. The unit is days, not periods. It does not depend on input data frequency.
"holiday_pre_post_num_dict"
: dict [str, (int, int)] or None, default NoneOverrides
pre_num
andpost_num
for each holiday inholidays_to_model_separately
. For example, ifholidays_to_model_separately
contains “Thanksgiving” and “Labor Day”, this parameter can be set to{"Thanksgiving": [1, 3], "Labor Day": [1, 2]}
, denoting that the “Thanksgiving”pre_num
is 1 andpost_num
is 3, and “Labor Day”pre_num
is 1 andpost_num
is 2. Holidays not specified use the default given bypre_num
andpost_num
."daily_event_df_dict"
: dict [str,pandas.DataFrame
] or None, default NoneA dictionary of data frames, each representing events data for the corresponding key. Specifies additional events to include besides the holidays specified above. The format is the same as in
forecast
. The DataFrame has two columns:The first column contains event dates. Must be in a format recognized by
pandas.to_datetime
. Must be at daily frequency for proper join. It is joined against the time indf
, converted to a day:pd.to_datetime(pd.DatetimeIndex(df[time_col]).date)
.the second column contains the event label for each date
The column order is important; column names are ignored. The event dates must span their occurrences in both the training and future prediction period.
During modeling, each key in the dictionary is mapped to a categorical variable named
f"{EVENT_PREFIX}_{key}"
, whose value at each timestamp is specified by the corresponding DataFrame.For example, to manually specify a yearly event on September 1 during a training/forecast period that spans 2020-2022:
daily_event_df_dict = { "custom_event": pd.DataFrame({ "date": ["2020-09-01", "2021-09-01", "2022-09-01"], "label": ["is_event", "is_event", "is_event"] }) }
It’s possible to specify multiple events in the same df. Two events,
"sep"
and"oct"
are specified below for 2020-2021:daily_event_df_dict = { "custom_event": pd.DataFrame({ "date": ["2020-09-01", "2020-10-01", "2021-09-01", "2021-10-01"], "event_name": ["sep", "oct", "sep", "oct"] }) }
Use multiple keys if two events may fall on the same date. These events must be in separate DataFrames:
daily_event_df_dict = { "fixed_event": pd.DataFrame({ "date": ["2020-09-01", "2021-09-01", "2022-09-01"], "event_name": "fixed_event" }), "moving_event": pd.DataFrame({ "date": ["2020-09-01", "2021-08-28", "2022-09-03"], "event_name": "moving_event" }), }
The multiple event specification can be used even if events never overlap. An equivalent specification to the second example:
daily_event_df_dict = { "sep": pd.DataFrame({ "date": ["2020-09-01", "2021-09-01"], "event_name": "is_event" }), "oct": pd.DataFrame({ "date": ["2020-10-01", "2021-10-01"], "event_name": "is_event" }), }
Note: All these events are automatically added to the model. There is no need to specify them in
extra_pred_cols
as you would forforecast
.Note: Do not use
EVENT_DEFAULT
in the second column. This is reserved to indicate dates that do not correspond to an event.
- changepoints: dict [str, dict] or None, optional
Specifies the changepoint configuration. Dictionary with the following optional key:
"auto_growth"
bool, default FalseWhether to automatically infer growth configuration. If True, the growth term and automatically changepoint detection configuration will be inferred from input timeseries, and the following parameters will be ignored:
"growth_term"
ingrowth
dictionary"changepoints_dict"
(All parameters but custom changepoint parameters to be combined with automatically detected changepoints.)
For detail, see
generate_trend_changepoint_detection_params
."changepoints_dict"
: dict or None or a list of such values for grid searchChangepoints dictionary passed to
forecast_simple_silverkite
. A dictionary with the following optional keys:"method"
: strThe method to locate changepoints. Valid options:
“uniform”. Places n_changepoints evenly spaced changepoints to allow growth to change.
“custom”. Places changepoints at the specified dates.
“auto”. Automatically detects change points.
Additional keys to provide parameters for each particular method are described below.
"continuous_time_col"
: str or NoneColumn to apply growth_func to, to generate changepoint features Typically, this should match the growth term in the model
"growth_func"
: callable or NoneGrowth function (numeric -> numeric). Changepoint features are created by applying growth_func to “continuous_time_col” with offsets. If None, uses identity function to use continuous_time_col directly as growth term
If changepoints_dict[“method”] == “uniform”, this other key is required:
"n_changepoints"
: intnumber of changepoints to evenly space across training period
If changepoints_dict[“method”] == “custom”, this other key is required:
"dates"
: list [int or float or str ordatetime
]Changepoint dates. Must be parsable by pd.to_datetime. Changepoints are set at the closest time on or after these dates in the dataset.
If changepoints_dict[“method”] == “auto”, optional keys can be passed that match the parameters in
find_trend_changepoints
(exceptdf
,time_col
andvalue_col
, which are already known). To add manually specified changepoints to the automatically detected ones, the keysdates
,combine_changepoint_min_distance
andkeep_detected
can be specified, which correspond to the three parameterscustom_changepoint_dates
,min_distance
andkeep_detected
incombine_detected_and_custom_trend_changepoints
."seasonality_changepoints_dict"
: dict or None or a list of such values for grid searchseasonality changepoints dictionary passed to
forecast_simple_silverkite
. The optional keys are the parameters infind_seasonality_changepoints
. You don’t need to providedf
,time_col
,value_col
ortrend_changepoints
, since they are passed with the class automatically.
- autoregression: dict [str, dict] or None, optional
Specifies the autoregression configuration. Dictionary with the following optional keys:
"autoreg_dict"
: dict or str or None or a list of such values for grid searchIf a dict: A dictionary with arguments for
build_autoreg_df
. That function’s parametervalue_col
is inferred from the input of current functionself.forecast
. Other keys are:"lag_dict"
: dict or None"agg_lag_dict"
: dict or None"series_na_fill_func"
: callableIf a str: The string will represent a method and a dictionary will be constructed using that str. Currently only implemented method is “auto” which uses __get_default_autoreg_dict to create a dictionary. See more details for above parameters in
build_autoreg_df
."simulation_num"
int, default 10The number of simulations to use. Applies only if any of the lags in
autoreg_dict
are smaller thanforecast_horizon
. In that case, simulations are needed to generate forecasts and prediction intervals."fast_simulation"
bool, default FalseDeterimes if fast simulations are to be used. This only impacts models which include auto-regression. This method will only generate one simulation without any error being added and then add the error using the volatility model. The advantage is a major boost in speed during inference and the disadvantage is potentially less accurate prediction intervals.
- regressors: dict [str, any] or None, optional
Specifies the regressors to include in the model (e.g. macro-economic factors). Dictionary with the following optional keys:
"regressor_cols"
: list [str] or None or a list of such values for grid searchThe columns in
df
to use as regressors. Note that regressor values must be available indf
for all prediction dates. Thus,df
will contain timestamps for both training and future prediction.regressors must be available on all dates
the response must be available for training dates (metadata[“value_col”])
Use
extra_pred_cols
to specify interactions of any model terms with the regressors.- lagged_regressors: dict [str, dict] or None, optional
Specifies the lagged regressors configuration. Dictionary with the following optional key:
"lagged_regressor_dict"
: dict or None or a list of such values for grid searchA dictionary with arguments for
build_autoreg_df_multi
. The keys of the dictionary are the target lagged regressor column names. It can leverage the regressors included indf
. The value of each key is either a dict or str. If dict, it has the following keys:"lag_dict"
: dict or None"agg_lag_dict"
: dict or None"series_na_fill_func"
: callableIf str, it represents a method and a dictionary will be constructed using that str. Currently the only implemented method is “auto” which uses
SilverkiteForecast
’s __get_default_lagged_regressor_dict to create a dictionary for each lagged regressor. An example:lagged_regressor_dict = { "regressor1": { "lag_dict": {"orders": [1, 2, 3]}, "agg_lag_dict": { "orders_list": [[7, 7 * 2, 7 * 3]], "interval_list": [(8, 7 * 2)]}, "series_na_fill_func": lambda s: s.bfill().ffill()}, "regressor2": "auto"}
Check the docstring of
build_autoreg_df_multi
for more details for each argument.
- uncertainty: dict [str, dict] or None, optional
Along with
coverage
, specifies the uncertainty interval configuration. Usecoverage
to set interval size. Useuncertainty
to tune the calculation."uncertainty_dict"
: str or dict or None or a list of such values for grid search“auto” or a dictionary on how to fit the uncertainty model. If a dictionary, valid keys are:
"uncertainty_method"
: strThe title of the method. Only
"simple_conditional_residuals"
is implemented infit_ml_model
which calculates intervals using residuals."params"
: dictA dictionary of parameters needed for the requested
uncertainty_method
. For example, foruncertainty_method="simple_conditional_residuals"
, see parameters ofconf_interval
:"conditional_cols"
"quantiles"
"quantile_estimation_method"
"sample_size_thresh"
"small_sample_size_method"
"small_sample_size_quantile"
The default value for
quantiles
is inferred from coverage.
If “auto”, see
get_silverkite_uncertainty_dict
for the default value. Ifcoverage
is not None anduncertainty_dict
is not provided, then the “auto” setting is used.If
coverage
is None anduncertainty_dict
is None, then no intervals are returned.
- custom: dict [str, any] or None, optional
Custom parameters that don’t fit the categories above. Dictionary with the following optional keys:
"fit_algorithm_dict"
: dict or a list of such values for grid searchHow to fit the model. A dictionary with the following optional keys.
"fit_algorithm"
: str, optional, default “ridge”The type of predictive model used in fitting.
See
fit_model_via_design_matrix
for available options and their parameters."fit_algorithm_params"
: dict or None, optional, default NoneParameters passed to the requested fit_algorithm. If None, uses the defaults in
fit_model_via_design_matrix
.
"feature_sets_enabled"
: dict [str, bool or “auto” or None] or bool or “auto” or None; or a list of such values for grid searchWhether to include interaction terms and categorical variables to increase model flexibility.
If a dict, boolean values indicate whether include various sets of features in the model. The following keys are recognized (from
SilverkiteColumn
):"COLS_HOUR_OF_WEEK"
: strConstant hour of week effect
"COLS_WEEKEND_SEAS"
: strDaily seasonality interaction with is_weekend
"COLS_DAY_OF_WEEK_SEAS"
: strDaily seasonality interaction with day of week
"COLS_TREND_DAILY_SEAS"
: strAllow daily seasonality to change over time by is_weekend
"COLS_EVENT_SEAS"
: strAllow sub-daily event effects
"COLS_EVENT_WEEKEND_SEAS"
: strAllow sub-daily event effect to interact with is_weekend
"COLS_DAY_OF_WEEK"
: strConstant day of week effect
"COLS_TREND_WEEKEND"
: strAllow trend (growth, changepoints) to interact with is_weekend
"COLS_TREND_DAY_OF_WEEK"
: strAllow trend to interact with day of week
"COLS_TREND_WEEKLY_SEAS"
: strAllow weekly seasonality to change over time
The following dictionary values are recognized:
True: include the feature set in the model
False: do not include the feature set in the model
None: do not include the feature set in the model
“auto” or not provided: use the default setting based on data frequency and size
If not a dict:
if a boolean, equivalent to a dictionary with all values set to the boolean.
if None, equivalent to a dictionary with all values set to False.
if “auto”, equivalent to a dictionary with all values set to “auto”.
"max_daily_seas_interaction_order"
: int or None or a list of such values for grid search, default 5Max fourier order to use for interactions with daily seasonality. (COLS_EVENT_SEAS, COLS_EVENT_WEEKEND_SEAS, COLS_WEEKEND_SEAS, COLS_DAY_OF_WEEK_SEAS, COLS_TREND_DAILY_SEAS).
Model includes interactions terms specified by
feature_sets_enabled
up to the order limited by this value and the available order fromseasonality
."max_weekly_seas_interaction_order"
int or None or a list of such values for grid search, default 2Max fourier order to use for interactions with weekly seasonality (COLS_TREND_WEEKLY_SEAS).
Model includes interactions terms specified by
feature_sets_enabled
up to the order limited by this value and the available order fromseasonality
."extra_pred_cols"
: list [str] or None or a list of such values for grid search, default NoneNames of extra predictor columns to pass to
forecast_silverkite
. The standard interactions can be controlled viafeature_sets_enabled
parameter. Accepts any valid patsy model formula term. Can be used to model complex interactions of time features, events, seasonality, changepoints, regressors. Columns should be generated bybuild_silverkite_features
or included with input data. These are added to any features already included byfeature_sets_enabled
and terms specified bymodel
."drop_pred_cols"
list [str] or None, default NoneNames of predictor columns to be dropped from the final model. Ignored if None.
"explicit_pred_cols"
list [str] or None, default NoneNames of the explicit predictor columns which will be the only variables in the final model. Note that this overwrites the generated predictors in the model and may include new terms not appearing in the predictors (e.g. interaction terms). Ignored if None.
"min_admissible_value"
: float or double or int or None, default NoneThe lowest admissible value for the forecasts and prediction intervals. Any value below this will be mapped back to this value. If None, there is no lower bound.
"max_admissible_value"
: float or double or int or None, default NoneThe highest admissible value for the forecasts and prediction intervals. Any value above this will be mapped back to this value. If None, there is no upper bound.
"normalize_method"
: str or None, default NoneThe normalization method for feature matrix. If a string is provided, it will be used as the normalization method in
normalize_df
, passed via the argumentmethod
. Available values are “statistical”, “zero_to_one”, “minus_half_to_half” and “zero_at_origin”. See that function for more details.
- hyperparameter_override: dict [str, any] or None or list [dict [str, any] or None], optional
After the above model components are used to create a hyperparameter grid, the result is updated by this dictionary, to create new keys or override existing ones. Allows for complete customization of the grid search.
Keys should have format
{named_step}__{parameter_name}
for the named steps of thesklearn.pipeline.Pipeline
returned by this function. Seesklearn.pipeline.Pipeline
.For example:
hyperparameter_override={ "estimator__silverkite": SimpleSilverkiteForecast(), "estimator__growth_term": "linear", "input__response__null__impute_algorithm": "ts_interpolate", "input__response__null__impute_params": {"orders": [7, 14]}, "input__regressors_numeric__normalize__normalize_algorithm": "RobustScaler", }
If a list of dictionaries, grid search will be done for each dictionary in the list. Each dictionary in the list override the defaults. This enables grid search over specific combinations of parameters to reduce the search space.
For example, the first dictionary could define combinations of parameters for a “complex” model, and the second dictionary could define combinations of parameters for a “simple” model, to prevent mixed combinations of simple and complex.
Or the first dictionary could grid search over fit algorithm, and the second dictionary could use a single fit algorithm and grid search over seasonality.
The result is passed as the
param_distributions
parameter tosklearn.model_selection.RandomizedSearchCV
.
- model_template: str, list`[`str] or None, default None
The simple silverkite template support single templates, multi templates or a list of single/multi templates. A valid single template must be one of
SILVERKITE
,SILVERKITE_MONTHLY
,SILVERKITE_DAILY_1_CONFIG_1
,SILVERKITE_DAILY_1_CONFIG_2
,SILVERKITE_DAILY_1_CONFIG_3
,SILVERKITE_EMPTY
, or that consists of{FREQ}_SEAS_{VAL}_GR_{VAL}_CP_{VAL}_HOL_{VAL}_FEASET_{VAL}_ALGO_{VAL}_AR_{VAL}
For example, we have DAILY_SEAS_NM_GR_LINEAR_CP_LT_HOL_NONE_FEASET_ON_ALGO_RIDGE_AR_ON. The valid FREQ and VAL can be found at
simple_silverkite_template_config
. The components stand for seasonality, growth, changepoints_dict, events, feature_sets_enabled, fit_algorithm and autoregression inModelComponentsParam
, which is used inSimpleSilverkiteTemplate
. Users are allowed toOmit any number of component-value pairs, and the omitted will be filled with default values.
Switch the order of different component-value pairs.
A valid multi template must belong to
MULTI_TEMPLATES
or must be a list of single or multi template names.
- DEFAULT_MODEL_TEMPLATE = 'SILVERKITE'
The default model template. See
ModelTemplateEnum
. Uses a string to avoid circular imports. Overrides the value fromForecastConfigDefaults
.
- property allow_model_template_list: bool
SimpleSilverkiteTemplate allows config.model_template to be a list.
- property allow_model_components_param_list: bool
SilverkiteTemplate allows config.model_components_param to be a list.
- property constants: SimpleSilverkiteTemplateConstants
Constants used by the template class. Includes the model templates and their default values.
- get_regressor_cols()[source]
Returns regressor column names from the model components.
Implements the method in
BaseTemplate
.Uses these attributes:
model_components:
ModelComponentsParam
, list [ModelComponentsParam
] or None, default NoneConfiguration of model growth, seasonality, holidays, etc. See
SimpleSilverkiteTemplate
for details.- Returns
regressor_cols – The names of regressor columns used in any hyperparameter set requested by
model_components
. None if there are no regressors.- Return type
list [str] or None
- get_lagged_regressor_info()[source]
Returns lagged regressor column names and minimal/maximal lag order. The lag order can be used to check potential imputation in the computation of lags.
Implements the method in
BaseTemplate
.- Returns
lagged_regressor_info – A dictionary that includes the lagged regressor column names and maximal/minimal lag order The keys are:
- lagged_regressor_colslist [str] or None
See
forecast_pipeline
.
overall_min_lag_order : int or None overall_max_lag_order : int or None
For example:
self.config.model_components_param.lagged_regressors["lagged_regressor_dict"] = [ {"regressor1": { "lag_dict": {"orders": [7]}, "agg_lag_dict": { "orders_list": [[7, 7 * 2, 7 * 3]], "interval_list": [(8, 7 * 2)]}, "series_na_fill_func": lambda s: s.bfill().ffill()} }, {"regressor2": { "lag_dict": {"orders": [2]}, "agg_lag_dict": { "orders_list": [[7, 7 * 2]], "interval_list": [(8, 7 * 2)]}, "series_na_fill_func": lambda s: s.bfill().ffill()} }, {"regressor3": "auto"} ]
Then the function returns:
lagged_regressor_info = { "lagged_regressor_cols": ["regressor1", "regressor2", "regressor3"], "overall_min_lag_order": 2, "overall_max_lag_order": 21 }
Note that “regressor3” is skipped as the “auto” option makes sure the lag order is proper.
- Return type
dict
- get_hyperparameter_grid()[source]
Returns hyperparameter grid.
Implements the method in
BaseTemplate
.Converts model components, time properties, and model template into
SimpleSilverkiteEstimator
hyperparameters.Uses these attributes:
model_components:
ModelComponentsParam
, list [ModelComponentsParam
] or None, default NoneConfiguration of model growth, seasonality, events, etc. See
SimpleSilverkiteTemplate
for details.- time_properties: dict [str, any] or None, default None
Time properties dictionary (likely produced by
get_forecast_time_properties
) with keys:"period"
: intPeriod of each observation (i.e. minimum time between observations, in seconds).
"simple_freq"
: SimpleTimeFrequencyEnumSimpleTimeFrequencyEnum
member corresponding to data frequency."num_training_points"
: intNumber of observations for training.
"num_training_days"
: intNumber of days for training.
"start_year"
: intStart year of the training period.
"end_year"
: intEnd year of the forecast period.
"origin_for_time_vars"
: floatContinuous time representation of the first date in
df
.
- model_template: str, default “SILVERKITE”
The name of model template, must be one of the valid templates defined in
SimpleSilverkiteTemplate
.
Notes
forecast_pipeline
handles the train/test splits according toEvaluationPeriodParam
, soestimator__train_test_thresh
andestimator__training_fraction
are always None.Similarly,
estimator__origin_for_time_vars
is set to None.- Returns
hyperparameter_grid – hyperparameter_grid for grid search in
forecast_pipeline
. The output dictionary values are lists, combined in grid search.- Return type
dict [str, list [any]] or list [ dict [str, list [any]] ]
- check_template_type(template)[source]
Checks the template name is valid and whether it is single or multi template. Raises an error if the template is not recognized.
A valid single template must be one of
SILVERKITE
,SILVERKITE_MONTHLY
,SILVERKITE_DAILY_1_CONFIG_1
,SILVERKITE_DAILY_1_CONFIG_2
,SILVERKITE_DAILY_1_CONFIG_3
,SILVERKITE_EMPTY
, or that consists of{FREQ}_SEAS_{VAL}_GR_{VAL}_CP_{VAL}_HOL_{VAL}_FEASET_{VAL}_ALGO_{VAL}_AR_{VAL}
For example, we have DAILY_SEAS_NM_GR_LINEAR_CP_LT_HOL_NONE_FEASET_ON_ALGO_RIDGE_AR_ON. The valid FREQ and VAL can be found at
simple_silverkite_template_config
. The components stand for seasonality, growth, changepoints_dict, events, feature_sets_enabled, fit_algorithm and autoregression inModelComponentsParam
, which is used inSimpleSilverkiteTemplate
. Users are allowed toOmit any number of component-value pairs, and the omitted will be filled with default values.
Switch the order of different component-value pairs.
A valid multi template must belong to
MULTI_TEMPLATES
or must be a list of single or multi template names.- Parameters
template (str, SimpleSilverkiteTemplateName or list`[`str, SimpleSilverkiteTemplateName]) – The
model_template
parameter fed intoForecastConfig
. for simple silverkite templates.- Returns
template_type – “single” or “multi”.
- Return type
str
- get_model_components_from_model_template(template)[source]
Gets the
ModelComponentsParam
class from model template.The template could be a name string, a SimpleSilverkiteTemplateOptions dataclass, or a list of such strings and/or dataclasses. If a list is given, a list of
ModelComponentsParam
is returned. If a single element is given, a list of length 1 is returned.- Parameters
template (str, SimpleSilverkiteTemplateOptions or list [str, SimpleSilverkiteTemplateOptions]) – The
model_template
in ForecastConfig, could be a name string, a SimpleSilverkiteTemplateOptions dataclass, or a list of such strings and/or dataclasses.- Returns
model_components_param – The list of
ModelComponentsParam
class(es) that correspond totemplate
.- Return type
list [
ModelComponentsParam
]
- static apply_computation_defaults(computation: Optional[ComputationParam] = None) ComputationParam
Applies the default ComputationParam values to the given object. If an expected attribute value is provided, the value is unchanged. Otherwise the default value for it is used. Other attributes are untouched. If the input object is None, it creates a ComputationParam object.
- Parameters
computation (
ComputationParam
or None) – The ComputationParam object.- Returns
computation – Valid ComputationParam object with the provided attribute values and the default attribute values if not.
- Return type
- static apply_evaluation_metric_defaults(evaluation: Optional[EvaluationMetricParam] = None) EvaluationMetricParam
Applies the default EvaluationMetricParam values to the given object. If an expected attribute value is provided, the value is unchanged. Otherwise the default value for it is used. Other attributes are untouched. If the input object is None, it creates a EvaluationMetricParam object.
- Parameters
evaluation (
EvaluationMetricParam
or None) – The EvaluationMetricParam object.- Returns
evaluation – Valid EvaluationMetricParam object with the provided attribute values and the default attribute values if not.
- Return type
- static apply_evaluation_period_defaults(evaluation: Optional[EvaluationPeriodParam] = None) EvaluationPeriodParam
Applies the default EvaluationPeriodParam values to the given object. If an expected attribute value is provided, the value is unchanged. Otherwise the default value for it is used. Other attributes are untouched. If the input object is None, it creates a EvaluationPeriodParam object.
- Parameters
evaluation (
EvaluationPeriodParam
or None) – The EvaluationMetricParam object.- Returns
evaluation – Valid EvaluationPeriodParam object with the provided attribute values and the default attribute values if not.
- Return type
- apply_forecast_config_defaults(config: Optional[ForecastConfig] = None) ForecastConfig
Applies the default Forecast Config values to the given config. If an expected attribute value is provided, the value is unchanged. Otherwise the default value for it is used. Other attributes are untouched. If the input config is None, it creates a Forecast Config.
- Parameters
config (
ForecastConfig
or None) – Forecast configuration if available. SeeForecastConfig
.- Returns
config – A valid Forecast Config which contains the provided attribute values and the default attribute values if not.
- Return type
ForecastConfig
- static apply_metadata_defaults(metadata: Optional[MetadataParam] = None) MetadataParam
Applies the default MetadataParam values to the given object. If an expected attribute value is provided, the value is unchanged. Otherwise the default value for it is used. Other attributes are untouched. If the input object is None, it creates a MetadataParam object.
- Parameters
metadata (
MetadataParam
or None) – The MetadataParam object.- Returns
metadata – Valid MetadataParam object with the provided attribute values and the default attribute values if not.
- Return type
- static apply_model_components_defaults(model_components: Optional[Union[ModelComponentsParam, List[Optional[ModelComponentsParam]]]] = None) Union[ModelComponentsParam, List[ModelComponentsParam]]
Applies the default ModelComponentsParam values to the given object.
Converts None to a ModelComponentsParam object. Unpacks a list of a single element to the element itself.
- Parameters
model_components (
ModelComponentsParam
or None or list of such items) – The ModelComponentsParam object.- Returns
model_components – Valid ModelComponentsParam object with the provided attribute values and the default attribute values if not.
- Return type
ModelComponentsParam
or list of such items
- apply_model_template_defaults(model_template: Optional[Union[str, List[Optional[str]]]] = None) Union[str, List[str]]
Applies the default model template to the given object.
Unpacks a list of a single element to the element itself. Sets default value if None.
- Parameters
model_template (str or None or list [None, str]) – The model template name. See valid names in
ModelTemplateEnum
.- Returns
model_template – The model template name, with defaults value used if not provided.
- Return type
str or list [str]
- static apply_template_decorator(func)
Decorator for
apply_template_for_pipeline_params
function.By default, this applies
apply_forecast_config_defaults
toconfig
.Subclass may override this for pre/post processing of
apply_template_for_pipeline_params
, such as input validation. In this case,apply_template_for_pipeline_params
must also be implemented in the subclass.
- apply_template_for_pipeline_params(df: DataFrame, config: Optional[ForecastConfig] = None) Dict
Implements template interface method. Takes input data and optional configuration parameters to customize the model. Returns a set of parameters to call
forecast_pipeline
.See template interface for parameters and return value.
Uses the methods in this class to set:
"regressor_cols"
: get_regressor_cols()lagged_regressor_cols
: get_lagged_regressor_info()"pipeline"
: get_pipeline()"time_properties"
: get_forecast_time_properties()"hyperparameter_grid"
: get_hyperparameter_grid()
All other parameters are taken directly from
config
.
- property estimator
The estimator instance to use as the final step in the pipeline. An instance of
greykite.sklearn.estimator.base_forecast_estimator.BaseForecastEstimator
.
- get_forecast_time_properties()
Returns forecast time parameters.
Uses
self.df
,self.config
,self.regressor_cols
.Available parameters:
self.df
self.config
self.score_func
self.score_func_greater_is_better
self.regressor_cols
self.lagged_regressor_cols
self.estimator
self.pipeline
- Returns
time_properties – Time properties dictionary (likely produced by
get_forecast_time_properties
) with keys:"period"
intPeriod of each observation (i.e. minimum time between observations, in seconds).
"simple_freq"
SimpleTimeFrequencyEnumSimpleTimeFrequencyEnum
member corresponding to data frequency."num_training_points"
intNumber of observations for training.
"num_training_days"
intNumber of days for training.
"start_year"
intStart year of the training period.
"end_year"
intEnd year of the forecast period.
"origin_for_time_vars"
floatContinuous time representation of the first date in
df
.
- Return type
dict [str, any] or None, default None
- get_pipeline()
Returns pipeline.
Implementation may be overridden by subclass if a different pipeline is desired.
Uses
self.estimator
,self.score_func
,self.score_func_greater_is_better
,self.config
,self.regressor_cols
.Available parameters:
self.df
self.config
self.score_func
self.score_func_greater_is_better
self.regressor_cols
self.estimator
- Returns
pipeline – See
forecast_pipeline
.- Return type
- score_func
Score function used to select optimal model in CV.
- score_func_greater_is_better
True if
score_func
is a score function, meaning higher is better, and False if it is a loss function, meaning lower is better.
- regressor_cols
A list of regressor columns used in the training and prediction DataFrames. If None, no regressor columns are used.
- lagged_regressor_cols
A list of lagged regressor columns used in the training and prediction DataFrames. If None, no lagged regressor columns are used.
- pipeline
Pipeline to fit. The final named step must be called “estimator”.
- time_properties
Time properties dictionary (likely produced by
get_forecast_time_properties
)
- hyperparameter_grid
Sets properties of the steps in the pipeline, and specifies combinations to search over. Should be valid input to
sklearn.model_selection.GridSearchCV
(param_grid) orsklearn.model_selection.RandomizedSearchCV
(param_distributions).
- df: Optional[pd.DataFrame]
Timeseries data to forecast.
- config: Optional[ForecastConfig]
Forecast configuration.
- pipeline_params: Optional[Dict]
Parameters (keyword arguments) to call
forecast_pipeline
.
- class greykite.sklearn.estimator.simple_silverkite_estimator.SimpleSilverkiteEstimator(silverkite: ~greykite.algo.forecast.silverkite.forecast_simple_silverkite.SimpleSilverkiteForecast = <greykite.algo.forecast.silverkite.forecast_simple_silverkite.SimpleSilverkiteForecast object>, score_func: callable = <function mean_squared_error>, coverage: ~typing.Optional[float] = None, null_model_params: ~typing.Optional[~typing.Dict] = None, time_properties: ~typing.Optional[~typing.Dict] = None, freq: ~typing.Optional[str] = None, forecast_horizon: ~typing.Optional[int] = None, origin_for_time_vars: ~typing.Optional[float] = None, train_test_thresh: ~typing.Optional[~datetime.datetime] = None, training_fraction: ~typing.Optional[float] = None, fit_algorithm_dict: ~typing.Optional[~typing.Dict] = None, auto_holiday: bool = False, holidays_to_model_separately: ~typing.Optional[~typing.Union[str, ~typing.List[str]]] = 'auto', holiday_lookup_countries: ~typing.Optional[~typing.Union[str, ~typing.List[str]]] = 'auto', holiday_pre_num_days: int = 2, holiday_post_num_days: int = 2, holiday_pre_post_num_dict: ~typing.Optional[~typing.Dict] = None, daily_event_df_dict: ~typing.Optional[~typing.Dict] = None, daily_event_neighbor_impact: ~typing.Optional[~typing.Union[int, ~typing.List[int], callable]] = None, daily_event_shifted_effect: ~typing.Optional[~typing.List[str]] = None, auto_growth: bool = False, changepoints_dict: ~typing.Optional[~typing.Dict] = None, auto_seasonality: bool = False, yearly_seasonality: ~typing.Union[bool, str, int] = 'auto', quarterly_seasonality: ~typing.Union[bool, str, int] = 'auto', monthly_seasonality: ~typing.Union[bool, str, int] = 'auto', weekly_seasonality: ~typing.Union[bool, str, int] = 'auto', daily_seasonality: ~typing.Union[bool, str, int] = 'auto', max_daily_seas_interaction_order: ~typing.Optional[int] = None, max_weekly_seas_interaction_order: ~typing.Optional[int] = None, autoreg_dict: ~typing.Optional[~typing.Dict] = None, past_df: ~typing.Optional[~pandas.core.frame.DataFrame] = None, lagged_regressor_dict: ~typing.Optional[~typing.Dict] = None, seasonality_changepoints_dict: ~typing.Optional[~typing.Dict] = None, min_admissible_value: ~typing.Optional[float] = None, max_admissible_value: ~typing.Optional[float] = None, uncertainty_dict: ~typing.Optional[~typing.Dict] = None, normalize_method: ~typing.Optional[str] = None, growth_term: ~typing.Optional[str] = 'linear', regressor_cols: ~typing.Optional[~typing.List[str]] = None, feature_sets_enabled: ~typing.Optional[~typing.Union[bool, ~typing.Dict[str, bool]]] = None, extra_pred_cols: ~typing.Optional[~typing.List[str]] = None, drop_pred_cols: ~typing.Optional[~typing.List[str]] = None, explicit_pred_cols: ~typing.Optional[~typing.List[str]] = None, regression_weight_col: ~typing.Optional[str] = None, simulation_based: ~typing.Optional[bool] = False, simulation_num: int = 10, fast_simulation: bool = False, remove_intercept: bool = False)[source]
Wrapper for forecast_simple_silverkite.
- Parameters
score_func (callable, optional, default mean_squared_error) – See
BaseForecastEstimator
.coverage (float between [0.0, 1.0] or None, optional) – See
BaseForecastEstimator
.null_model_params (dict or None, optional) – Dictionary with arguments to define
DummyRegressor
null model, default is None. SeeBaseForecastEstimator
.fit_algorithm_dict (dict or None, optional) –
How to fit the model. A dictionary with the following optional keys.
"fit_algorithm"
str, optional, default “ridge”The type of predictive model used in fitting.
See
fit_model_via_design_matrix
for available options and their parameters."fit_algorithm_params"
dict or None, optional, default NoneParameters passed to the requested fit_algorithm. If None, uses the defaults in
fit_model_via_design_matrix
.
uncertainty_dict (dict or str or None, optional) – How to fit the uncertainty model. See
forecast
. Note that this is allowed to be “auto”. If None or “auto”, will be set to a default value bycoverage
before callingforecast_silverkite
. SeeBaseForecastEstimator
for details.kwargs (additional parameters) –
Other parameters are the same as in forecast_simple_silverkite.
See source code
__init__
for the parameter names, and refer to forecast_simple_silverkite for their description.If this Estimator is called from
forecast_pipeline
,train_test_thresh
andtraining_fraction
should almost always be None, because train/test is handled outside this Estimator.
Notes
Attributes match those of
BaseSilverkiteEstimator
.See also
- fit(X, y=None, time_col='ts', value_col='y', **fit_params)[source]
Fits
Silverkite
forecast model.- Parameters
X (
pandas.DataFrame
) – Input timeseries, with timestamp column, value column, and any additional regressors. The value column is the response, included inX
to allow transformation bysklearn.pipeline
.y (ignored) – The original timeseries values, ignored. (The
y
for fitting is included inX
).time_col (str) – Time column name in
X
.value_col (str) – Value column name in
X
.fit_params (dict) – additional parameters for null model.
- Returns
self – Fitted model is stored in
self.model_dict
.- Return type
self
- finish_fit()
Makes important values of
self.model_dict
conveniently accessible.To be called by subclasses at the end of their
fit
method. Sets {pred_cols
,feature_cols
, andcoef_
}.
- fit_uncertainty(df: DataFrame, uncertainty_dict: dict, fit_params: Optional[dict] = None, **kwargs)
Fits the uncertainty model with a given
df
anduncertainty_dict
.- Parameters
df (
pandas.DataFrame
) – A dataframe representing the data to fit the uncertainty model.uncertainty_dict (dict [str, any]) –
The uncertainty model specification. It should have the following keys:
- ”uncertainty_method”: a string that is in
UncertaintyMethodEnum
.
”params”: a dictionary that includes any additional parameters needed by the uncertainty method.
fit_params (dict [str, any] or None, default None) – Parameters to be passed to the
fit
function.kwargs (additional parameters to be fed into the uncertainty method.) – These parameters are from the estimator attributes, not given by user.
- Return type
The function sets
self.uncertainty_model
and does not return anything.
- forecast_breakdown(grouping_regex_patterns_dict, forecast_x_mat=None, time_values=None, center_components=False, denominator=None, plt_title='breakdown of forecasts')
Generates silverkite forecast breakdown for groupings given in
grouping_regex_patterns_dict
. Note that this only works for additive regression models and not for models such as random forest.- Parameters
grouping_regex_patterns_dict (dict {str: str}) – A dictionary with group names as keys and regexes as values. This dictionary is used to partition the columns into various groups
forecast_x_mat (
pd.DataFrame
, default None) – The dataframe of design matrix of regression model. If None, this will be extracted from the estimator.time_values (list or np.array, default None) – A collection of values (usually timestamps) to be used in the figure. It can also be used to join breakdown data with other data when needed. If None, and
forecast_x_mat
is not passed, timestamps will be extracted from the estimator to match the``forecast_x_mat`` which is also extracted from the estimator. If None, and``forecast_x_mat`` is passed, the timestamps cannot be inferred. Therefore we simply create an integer index with size offorecast_x_mat
.center_components (bool, default False) – It determines if components should be centered at their mean and the mean be added to the intercept. More concretely if a component is “x” then it will be mapped to “x - mean(x)”; and “mean(x)” will be added to the intercept so that the sum of the components remains the same.
denominator (str, default None) –
If not None, it will specify a way to divide the components. There are two options implemented:
- ”abs_y_mean”float
The absolute value of the observed mean of the response
- ”y_std”float
The standard deviation of the observed response
plt_title (str, default “prediction breakdown”) – The title of generated plot
- Returns
result – Dictionary returned by
breakdown_regression_based_prediction
- Return type
dict
- get_max_ar_order()
Gets the maximum autoregression order.
- Returns
max_ar_order – The maximum autoregression order.
- Return type
int
- get_params(deep=True)
Get parameters for this estimator.
- plot_components(grouping_regex_patterns_dict=None, center_components=True, denominator=None, predict_phase=False, title=None)
Class method to plot the components of a
Silverkite
model on datasets passed to eitherfit
orpredict
.- Parameters
grouping_regex_patterns_dict (dict, optional, default None) – If None, it is set to
DEFAULT_COMPONENTS_REGEX_DICT
. An alternative dictionary is available that provides a more detailed breakdown of seasonality components (e.g., weekly, monthly, quarterly, yearly, etc.), See:DETAILED_SEASONALITY_COMPONENTS_REGEX_DICT
.center_components (bool, optional, default True) – It determines if components should be centered at their mean and the mean be added to the intercept. More concretely if a component is “x” then it will be mapped to “x - mean(x)”; and “mean(x)” will be added to the intercept so that the sum of the components remains the same. See forecast_breakdown.
denominator (str, optional, default None) –
If not None, it will specify a way to divide the components. There are two options implemented:
- ”abs_y_mean”float
The absolute value of the observed mean of the response
- ”y_std”float
The standard deviation of the observed response
See forecast_breakdown.
predict_phase (bool, optional, default False) – If False, plots the components of the training data and shows three plots: 1) Component Plot, 2) Trend Plot + Change points, and 3) Residuals + Smoothed Residuals. If set to True, plots the component breakdown of the predicted values. When set to True, it only plots one plot, the component plot, as there are no change points or residuals in this time frame.
title (str, optional, default None) – Title of the plot.
- Returns
fig – Figure plotting components against appropriate time scale. Plot layout includes: - Plot 1, “Component Plot” - breakdown from forecast_breakdown - Plot 2, “Trend + Change Points” - Plot 3, “Residuals + Smoothed Residuals”; smoothing done using exponentially weighted moving average
- Return type
- plot_trend_changepoint_detection(params=None)
Convenience function to plot the original trend changepoint detection results.
- Parameters
params (dict or None, default None) –
The parameters in
plot
. If set to None, all components will be plotted.Note: seasonality components plotting is not supported currently.
plot
parameter must be False.- Returns
fig – Figure.
- Return type
- property pred_category
A dictionary that stores the predictor names in each category.
This property is not initialized until used. This speeds up the fitting process. The categories includes
“intercept” : the intercept.
“time_features” : the predictors that include
TimeFeaturesEnum
but notSEASONALITY_REGEX
.“event_features” : the predictors that include
EVENT_PREFIX
.“trend_features” : the predictors that include
TREND_REGEX
but notSEASONALITY_REGEX
.“seasonality_features” : the predictors that include
SEASONALITY_REGEX
.“lag_features” : the predictors that include
LAG_REGEX
.“regressor_features” : external regressors and other predictors manually passed to
extra_pred_cols
, but not in the categories above.“interaction_features” : the predictors that include interaction terms, i.e., including a colon.
Note that each predictor falls into at least one category. Some “time_features” may also be “trend_features”. Predictors with an interaction are classified into all categories matched by the interaction components. Thus, “interaction_features” are already included in the other categories.
- predict(X, y=None)
Creates forecast for the dates specified in
X
.- Parameters
X (
pandas.DataFrame
) – Input timeseries with timestamp column and any additional regressors. Timestamps are the dates for prediction. Value column, if provided inX
, is ignored.y (ignored.) –
- Returns
predictions –
Forecasted values for the dates in
X
. Columns:TIME_COL
: datesPREDICTED_COL
: predictionsPREDICTED_LOWER_COL
: lower bound of predictions, optionalPREDICTED_UPPER_COL
: upper bound of predictions, optional[other columns], optional
PREDICTED_LOWER_COL
andPREDICTED_UPPER_COL
are present ifself.coverage
is not None.- Return type
- predict_uncertainty(df: DataFrame, predict_params: Optional[dict] = None)
Makes predictions of prediction intervals for
df
based on the predictions andself.uncertainty_model
.- Parameters
df (
pandas.DataFrame
) – The dataframe to calculate prediction intervals upon. It should have eitherself.value_col_
or PREDICT_COL which the prediction interval is based on.predict_params (dict [str, any] or None, default None) – Parameters to be passed to the
predict
function.
- Returns
result_df – The
df
with prediction interval columns.- Return type
- score(X, y, sample_weight=None)
Default scorer for the estimator (Used in GridSearchCV/RandomizedSearchCV if scoring=None)
Notes
If null_model_params is not None, returns R2_null_model_score of model error relative to null model, evaluated by score_func.
If null_model_params is None, returns score_func of the model itself.
By default, grid search (with no scoring parameter) optimizes improvement of
score_func
against null model.To optimize a different score function, pass scoring to GridSearchCV/RandomizedSearchCV.
- Parameters
X (
pandas.DataFrame
) – Input timeseries with timestamp column and any additional regressors. Value column, if provided in X, is ignoredy (
pandas.Series
ornumpy.array
) – Actual value, used to compute errorsample_weight (
pandas.Series
ornumpy.array
) – ignored
- Returns
score – Comparison of predictions against null predictions, according to specified score function
- Return type
float or None
- set_params(**params)
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters
**params (dict) – Estimator parameters.
- Returns
self – Estimator instance.
- Return type
estimator instance
- summary(max_colwidth=20)
Creates the model summary for the given model
- Parameters
max_colwidth (int) – The maximum length for predictors to be shown in their original name. If the maximum length of predictors exceeds this parameter, all predictors name will be suppressed and only indices are shown.
- Returns
model_summary – The model summary for this model. See
ModelSummary
- Return type
ModelSummary
- class greykite.sklearn.estimator.silverkite_estimator.SilverkiteEstimator(silverkite: ~greykite.algo.forecast.silverkite.forecast_silverkite.SilverkiteForecast = <greykite.algo.forecast.silverkite.forecast_silverkite.SilverkiteForecast object>, score_func=<function mean_squared_error>, coverage=None, null_model_params=None, freq=None, origin_for_time_vars=None, extra_pred_cols=None, drop_pred_cols=None, explicit_pred_cols=None, train_test_thresh=None, training_fraction=None, fit_algorithm_dict=None, daily_event_df_dict=None, daily_event_neighbor_impact=None, daily_event_shifted_effect=None, fs_components_df= name period order seas_names 0 tod 24.0 3 daily 1 tow 7.0 3 weekly 2 conti_year 1.0 5 yearly, autoreg_dict=None, past_df=None, lagged_regressor_dict=None, changepoints_dict=None, seasonality_changepoints_dict=None, changepoint_detector=None, min_admissible_value=None, max_admissible_value=None, uncertainty_dict=None, normalize_method=None, adjust_anomalous_dict=None, impute_dict=None, regression_weight_col=None, forecast_horizon=None, simulation_based=False, simulation_num=10, fast_simulation=False, remove_intercept=False)[source]
Wrapper for
forecast
.- Parameters
score_func (callable, optional, default mean_squared_error) – See
BaseForecastEstimator
.coverage (float between [0.0, 1.0] or None, optional) – See
BaseForecastEstimator
.null_model_params (dict or None, optional) – Dictionary with arguments to define
DummyRegressor
null model, default is None. SeeBaseForecastEstimator
.fit_algorithm_dict (dict or None, optional) –
How to fit the model. A dictionary with the following optional keys.
"fit_algorithm"
str, optional, default “linear”The type of predictive model used in fitting.
See
fit_model_via_design_matrix
for available options and their parameters."fit_algorithm_params"
dict or None, optional, default NoneParameters passed to the requested fit_algorithm. If None, uses the defaults in
fit_model_via_design_matrix
.
uncertainty_dict (dict or str or None, optional) – How to fit the uncertainty model. See
forecast
. Note that this is allowed to be “auto”. If None or “auto”, will be set to a default value bycoverage
before callingforecast_silverkite
. SeeBaseForecastEstimator
for details.fs_components_df (
pandas.DataFrame
or None, optional) –A dataframe with information about fourier series generation. If provided, it must contain columns with following names:
”name”: name of the timeseries feature (e.g.
tod
,tow
etc.).”period”: Period of the fourier series.
”order”: Order of the fourier series. “seas_names”: Label for the type of seasonality (e.g.
daily
,weekly
etc.) and should be unique.validate_fs_components_df
checks for it, so that component plots don’t have duplicate y-axis labels.
This differs from the expected input of forecast_silverkite where “period”, “order” and “seas_names” are optional. This restriction is to facilitate appropriate computation of component (e.g. trend, seasonalities and holidays) effects. See Notes section in this docstring for a more detailed explanation with examples.
Other parameters are the same as in
forecast
.If this Estimator is called from
forecast_pipeline
,train_test_thresh
andtraining_fraction
should almost always be None, because train/test is handled outside this Estimator.The attributes are the same as
BaseSilverkiteEstimator
.See also
- fit(X, y=None, time_col='ts', value_col='y', **fit_params)[source]
Fits
Silverkite
forecast model.- Parameters
X (
pandas.DataFrame
) – Input timeseries, with timestamp column, value column, and any additional regressors. The value column is the response, included inX
to allow transformation bysklearn.pipeline
.y (ignored) – The original timeseries values, ignored. (The
y
for fitting is included inX
).time_col (str) – Time column name in
X
.value_col (str) – Value column name in
X
.fit_params (dict) – additional parameters for null model.
- static validate_fs_components_df(fs_components_df)[source]
Validates the inputs of a fourier series components dataframe called by
SilverkiteEstimator
to validate the inputfs_components_df
.- Parameters
fs_components_df (
pandas.DataFrame
) –A DataFrame with information about fourier series generation. Must contain columns with following names:
”name”: name of the timeseries feature (e.g. “tod”, “tow” etc.)
”period”: Period of the fourier series
”order”: Order of the fourier series
”seas_names”: seas_name corresponding to the name (e.g. “daily”, “weekly” etc.).
- finish_fit()
Makes important values of
self.model_dict
conveniently accessible.To be called by subclasses at the end of their
fit
method. Sets {pred_cols
,feature_cols
, andcoef_
}.
- fit_uncertainty(df: DataFrame, uncertainty_dict: dict, fit_params: Optional[dict] = None, **kwargs)
Fits the uncertainty model with a given
df
anduncertainty_dict
.- Parameters
df (
pandas.DataFrame
) – A dataframe representing the data to fit the uncertainty model.uncertainty_dict (dict [str, any]) –
The uncertainty model specification. It should have the following keys:
- ”uncertainty_method”: a string that is in
UncertaintyMethodEnum
.
”params”: a dictionary that includes any additional parameters needed by the uncertainty method.
fit_params (dict [str, any] or None, default None) – Parameters to be passed to the
fit
function.kwargs (additional parameters to be fed into the uncertainty method.) – These parameters are from the estimator attributes, not given by user.
- Return type
The function sets
self.uncertainty_model
and does not return anything.
- forecast_breakdown(grouping_regex_patterns_dict, forecast_x_mat=None, time_values=None, center_components=False, denominator=None, plt_title='breakdown of forecasts')
Generates silverkite forecast breakdown for groupings given in
grouping_regex_patterns_dict
. Note that this only works for additive regression models and not for models such as random forest.- Parameters
grouping_regex_patterns_dict (dict {str: str}) – A dictionary with group names as keys and regexes as values. This dictionary is used to partition the columns into various groups
forecast_x_mat (
pd.DataFrame
, default None) – The dataframe of design matrix of regression model. If None, this will be extracted from the estimator.time_values (list or np.array, default None) – A collection of values (usually timestamps) to be used in the figure. It can also be used to join breakdown data with other data when needed. If None, and
forecast_x_mat
is not passed, timestamps will be extracted from the estimator to match the``forecast_x_mat`` which is also extracted from the estimator. If None, and``forecast_x_mat`` is passed, the timestamps cannot be inferred. Therefore we simply create an integer index with size offorecast_x_mat
.center_components (bool, default False) – It determines if components should be centered at their mean and the mean be added to the intercept. More concretely if a component is “x” then it will be mapped to “x - mean(x)”; and “mean(x)” will be added to the intercept so that the sum of the components remains the same.
denominator (str, default None) –
If not None, it will specify a way to divide the components. There are two options implemented:
- ”abs_y_mean”float
The absolute value of the observed mean of the response
- ”y_std”float
The standard deviation of the observed response
plt_title (str, default “prediction breakdown”) – The title of generated plot
- Returns
result – Dictionary returned by
breakdown_regression_based_prediction
- Return type
dict
- get_max_ar_order()
Gets the maximum autoregression order.
- Returns
max_ar_order – The maximum autoregression order.
- Return type
int
- get_params(deep=True)
Get parameters for this estimator.
- plot_components(grouping_regex_patterns_dict=None, center_components=True, denominator=None, predict_phase=False, title=None)
Class method to plot the components of a
Silverkite
model on datasets passed to eitherfit
orpredict
.- Parameters
grouping_regex_patterns_dict (dict, optional, default None) – If None, it is set to
DEFAULT_COMPONENTS_REGEX_DICT
. An alternative dictionary is available that provides a more detailed breakdown of seasonality components (e.g., weekly, monthly, quarterly, yearly, etc.), See:DETAILED_SEASONALITY_COMPONENTS_REGEX_DICT
.center_components (bool, optional, default True) – It determines if components should be centered at their mean and the mean be added to the intercept. More concretely if a component is “x” then it will be mapped to “x - mean(x)”; and “mean(x)” will be added to the intercept so that the sum of the components remains the same. See forecast_breakdown.
denominator (str, optional, default None) –
If not None, it will specify a way to divide the components. There are two options implemented:
- ”abs_y_mean”float
The absolute value of the observed mean of the response
- ”y_std”float
The standard deviation of the observed response
See forecast_breakdown.
predict_phase (bool, optional, default False) – If False, plots the components of the training data and shows three plots: 1) Component Plot, 2) Trend Plot + Change points, and 3) Residuals + Smoothed Residuals. If set to True, plots the component breakdown of the predicted values. When set to True, it only plots one plot, the component plot, as there are no change points or residuals in this time frame.
title (str, optional, default None) – Title of the plot.
- Returns
fig – Figure plotting components against appropriate time scale. Plot layout includes: - Plot 1, “Component Plot” - breakdown from forecast_breakdown - Plot 2, “Trend + Change Points” - Plot 3, “Residuals + Smoothed Residuals”; smoothing done using exponentially weighted moving average
- Return type
- plot_trend_changepoint_detection(params=None)
Convenience function to plot the original trend changepoint detection results.
- Parameters
params (dict or None, default None) –
The parameters in
plot
. If set to None, all components will be plotted.Note: seasonality components plotting is not supported currently.
plot
parameter must be False.- Returns
fig – Figure.
- Return type
- property pred_category
A dictionary that stores the predictor names in each category.
This property is not initialized until used. This speeds up the fitting process. The categories includes
“intercept” : the intercept.
“time_features” : the predictors that include
TimeFeaturesEnum
but notSEASONALITY_REGEX
.“event_features” : the predictors that include
EVENT_PREFIX
.“trend_features” : the predictors that include
TREND_REGEX
but notSEASONALITY_REGEX
.“seasonality_features” : the predictors that include
SEASONALITY_REGEX
.“lag_features” : the predictors that include
LAG_REGEX
.“regressor_features” : external regressors and other predictors manually passed to
extra_pred_cols
, but not in the categories above.“interaction_features” : the predictors that include interaction terms, i.e., including a colon.
Note that each predictor falls into at least one category. Some “time_features” may also be “trend_features”. Predictors with an interaction are classified into all categories matched by the interaction components. Thus, “interaction_features” are already included in the other categories.
- predict(X, y=None)
Creates forecast for the dates specified in
X
.- Parameters
X (
pandas.DataFrame
) – Input timeseries with timestamp column and any additional regressors. Timestamps are the dates for prediction. Value column, if provided inX
, is ignored.y (ignored.) –
- Returns
predictions –
Forecasted values for the dates in
X
. Columns:TIME_COL
: datesPREDICTED_COL
: predictionsPREDICTED_LOWER_COL
: lower bound of predictions, optionalPREDICTED_UPPER_COL
: upper bound of predictions, optional[other columns], optional
PREDICTED_LOWER_COL
andPREDICTED_UPPER_COL
are present ifself.coverage
is not None.- Return type
- predict_uncertainty(df: DataFrame, predict_params: Optional[dict] = None)
Makes predictions of prediction intervals for
df
based on the predictions andself.uncertainty_model
.- Parameters
df (
pandas.DataFrame
) – The dataframe to calculate prediction intervals upon. It should have eitherself.value_col_
or PREDICT_COL which the prediction interval is based on.predict_params (dict [str, any] or None, default None) – Parameters to be passed to the
predict
function.
- Returns
result_df – The
df
with prediction interval columns.- Return type
- score(X, y, sample_weight=None)
Default scorer for the estimator (Used in GridSearchCV/RandomizedSearchCV if scoring=None)
Notes
If null_model_params is not None, returns R2_null_model_score of model error relative to null model, evaluated by score_func.
If null_model_params is None, returns score_func of the model itself.
By default, grid search (with no scoring parameter) optimizes improvement of
score_func
against null model.To optimize a different score function, pass scoring to GridSearchCV/RandomizedSearchCV.
- Parameters
X (
pandas.DataFrame
) – Input timeseries with timestamp column and any additional regressors. Value column, if provided in X, is ignoredy (
pandas.Series
ornumpy.array
) – Actual value, used to compute errorsample_weight (
pandas.Series
ornumpy.array
) – ignored
- Returns
score – Comparison of predictions against null predictions, according to specified score function
- Return type
float or None
- set_params(**params)
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters
**params (dict) – Estimator parameters.
- Returns
self – Estimator instance.
- Return type
estimator instance
- summary(max_colwidth=20)
Creates the model summary for the given model
- Parameters
max_colwidth (int) – The maximum length for predictors to be shown in their original name. If the maximum length of predictors exceeds this parameter, all predictors name will be suppressed and only indices are shown.
- Returns
model_summary – The model summary for this model. See
ModelSummary
- Return type
ModelSummary
- class greykite.sklearn.estimator.base_silverkite_estimator.BaseSilverkiteEstimator(silverkite: ~greykite.algo.forecast.silverkite.forecast_silverkite.SilverkiteForecast = <greykite.algo.forecast.silverkite.forecast_silverkite.SilverkiteForecast object>, score_func: callable = <function mean_squared_error>, coverage: ~typing.Optional[float] = None, null_model_params: ~typing.Optional[~typing.Dict] = None, uncertainty_dict: ~typing.Optional[~typing.Dict] = None)[source]
A base class for forecast estimators that fit using
forecast
.Notes
Allows estimators that fit using
forecast
to share the same functions for input data validation, fit postprocessing, predict, summary, plot_components, etc.Subclasses should:
Implement their own
__init__
that uses a superset of the parameters here.Implement their own
fit
, with this sequence of steps:calls
super().fit
calls
SilverkiteForecast.forecast
orSimpleSilverkiteForecast.forecast_simple
and stores the result inself.model_dict
calls
super().finish_fit
Uses
coverage
to set prediction band width. Even though coverage is not needed byforecast_silverkite
, it is included in everyBaseForecastEstimator
to be used universally for forecast evaluation.Therefore,
uncertainty_dict
must be consistent withcoverage
if provided as a dictionary. Ifuncertainty_dict
is None or “auto”, an appropriate default value is set, according tocoverage
.- Parameters
score_func (callable, optional, default mean_squared_error) – See
BaseForecastEstimator
.coverage (float between [0.0, 1.0] or None, optional) – See
BaseForecastEstimator
.null_model_params (dict, optional) – Dictionary with arguments to define DummyRegressor null model, default is None. See
BaseForecastEstimator
.uncertainty_dict (dict or str or None, optional) – How to fit the uncertainty model. See
forecast
. Note that this is allowed to be “auto”. If None or “auto”, will be set to a default value bycoverage
before callingforecast_silverkite
.
- silverkite
The silverkite algorithm instance used for forecasting
- Type
Class or a derived class of
SilverkiteForecast
- pred_cols
Names of the features used in the model.
- Type
list [str] or None
- feature_cols
Column names of the patsy design matrix built by
design_mat_from_formula
.- Type
list [str] or None
- df
The training data used to fit the model.
- Type
pandas.DataFrame
or None
- coef_
Estimated coefficient matrix for the model. Not available for
random forest
andgradient boosting
methods and set to the default value None.- Type
pandas.DataFrame
or None
- _pred_category
A dictionary with keys being the predictor category and values being the predictors belonging to the category. For details, see
pred_category
.- Type
dict or None
- extra_pred_cols
User provided extra predictor names, for details, see
SimpleSilverkiteEstimator
orSilverkiteEstimator
.- Type
list or None
- past_df
The extra past data before training data used to generate autoregression terms.
- Type
pandas.DataFrame
or None
- forecast
Output of
predict_silverkite
, set byself.predict
.- Type
pandas.DataFrame
or None
- forecast_x_mat
The design matrix of the model at the predict time.
- Type
pandas.DataFrame
or None
- model_summary
The
ModelSummary
class.- Type
class or None
See also
None
Function performing the fit and predict.
Notes
The subclasses will pass
fs_components_df
toforecast_silverkite
. The model terms it creates internally are used to generate the component plots.fourier_series_multi_fcn
usesfs_components_df["names"]
(e.g.tod
,tow
) to build the fourier series and to create column names.fs_components_df["seas_names"]
(e.g.daily
,weekly
) is appended to the column names, if provided.
plot_components
relies on a regular expression dictionary to group components together. There are two available in the library, seeconstants
for the two definitions“DEFAULT_COMPONENTS_REGEX_DICT” Grouped seasonality that is the default
“DETAILED_SEASONALITY_COMPONENTS_REGEX_DICT”: A detailed seasonality breakdown where the user can view daily/weekly/monthly/quarterly/yearly seasonality
- fit(X, y=None, time_col='ts', value_col='y', **fit_params)[source]
Pre-processing before fitting
Silverkite
forecast model.- Parameters
X (
pandas.DataFrame
) – Input timeseries, with timestamp column, value column, and any additional regressors. The value column is the response, included inX
to allow transformation bysklearn.pipeline
.y (ignored) – The original timeseries values, ignored. (The
y
for fitting is included inX
).time_col (str) – Time column name in
X
.value_col (str) – Value column name in
X
.fit_params (dict) – additional parameters for null model.
Notes
Subclasses are expected to call this at the beginning of their
fit
method, before callingforecast
.
- finish_fit()[source]
Makes important values of
self.model_dict
conveniently accessible.To be called by subclasses at the end of their
fit
method. Sets {pred_cols
,feature_cols
, andcoef_
}.
- predict(X, y=None)[source]
Creates forecast for the dates specified in
X
.- Parameters
X (
pandas.DataFrame
) – Input timeseries with timestamp column and any additional regressors. Timestamps are the dates for prediction. Value column, if provided inX
, is ignored.y (ignored.) –
- Returns
predictions –
Forecasted values for the dates in
X
. Columns:TIME_COL
: datesPREDICTED_COL
: predictionsPREDICTED_LOWER_COL
: lower bound of predictions, optionalPREDICTED_UPPER_COL
: upper bound of predictions, optional[other columns], optional
PREDICTED_LOWER_COL
andPREDICTED_UPPER_COL
are present ifself.coverage
is not None.- Return type
- forecast_breakdown(grouping_regex_patterns_dict, forecast_x_mat=None, time_values=None, center_components=False, denominator=None, plt_title='breakdown of forecasts')[source]
Generates silverkite forecast breakdown for groupings given in
grouping_regex_patterns_dict
. Note that this only works for additive regression models and not for models such as random forest.- Parameters
grouping_regex_patterns_dict (dict {str: str}) – A dictionary with group names as keys and regexes as values. This dictionary is used to partition the columns into various groups
forecast_x_mat (
pd.DataFrame
, default None) – The dataframe of design matrix of regression model. If None, this will be extracted from the estimator.time_values (list or np.array, default None) – A collection of values (usually timestamps) to be used in the figure. It can also be used to join breakdown data with other data when needed. If None, and
forecast_x_mat
is not passed, timestamps will be extracted from the estimator to match the``forecast_x_mat`` which is also extracted from the estimator. If None, and``forecast_x_mat`` is passed, the timestamps cannot be inferred. Therefore we simply create an integer index with size offorecast_x_mat
.center_components (bool, default False) – It determines if components should be centered at their mean and the mean be added to the intercept. More concretely if a component is “x” then it will be mapped to “x - mean(x)”; and “mean(x)” will be added to the intercept so that the sum of the components remains the same.
denominator (str, default None) –
If not None, it will specify a way to divide the components. There are two options implemented:
- ”abs_y_mean”float
The absolute value of the observed mean of the response
- ”y_std”float
The standard deviation of the observed response
plt_title (str, default “prediction breakdown”) – The title of generated plot
- Returns
result – Dictionary returned by
breakdown_regression_based_prediction
- Return type
dict
- property pred_category
A dictionary that stores the predictor names in each category.
This property is not initialized until used. This speeds up the fitting process. The categories includes
“intercept” : the intercept.
“time_features” : the predictors that include
TimeFeaturesEnum
but notSEASONALITY_REGEX
.“event_features” : the predictors that include
EVENT_PREFIX
.“trend_features” : the predictors that include
TREND_REGEX
but notSEASONALITY_REGEX
.“seasonality_features” : the predictors that include
SEASONALITY_REGEX
.“lag_features” : the predictors that include
LAG_REGEX
.“regressor_features” : external regressors and other predictors manually passed to
extra_pred_cols
, but not in the categories above.“interaction_features” : the predictors that include interaction terms, i.e., including a colon.
Note that each predictor falls into at least one category. Some “time_features” may also be “trend_features”. Predictors with an interaction are classified into all categories matched by the interaction components. Thus, “interaction_features” are already included in the other categories.
- get_max_ar_order()[source]
Gets the maximum autoregression order.
- Returns
max_ar_order – The maximum autoregression order.
- Return type
int
- summary(max_colwidth=20)[source]
Creates the model summary for the given model
- Parameters
max_colwidth (int) – The maximum length for predictors to be shown in their original name. If the maximum length of predictors exceeds this parameter, all predictors name will be suppressed and only indices are shown.
- Returns
model_summary – The model summary for this model. See
ModelSummary
- Return type
ModelSummary
- plot_components(grouping_regex_patterns_dict=None, center_components=True, denominator=None, predict_phase=False, title=None)[source]
Class method to plot the components of a
Silverkite
model on datasets passed to eitherfit
orpredict
.- Parameters
grouping_regex_patterns_dict (dict, optional, default None) – If None, it is set to
DEFAULT_COMPONENTS_REGEX_DICT
. An alternative dictionary is available that provides a more detailed breakdown of seasonality components (e.g., weekly, monthly, quarterly, yearly, etc.), See:DETAILED_SEASONALITY_COMPONENTS_REGEX_DICT
.center_components (bool, optional, default True) – It determines if components should be centered at their mean and the mean be added to the intercept. More concretely if a component is “x” then it will be mapped to “x - mean(x)”; and “mean(x)” will be added to the intercept so that the sum of the components remains the same. See forecast_breakdown.
denominator (str, optional, default None) –
If not None, it will specify a way to divide the components. There are two options implemented:
- ”abs_y_mean”float
The absolute value of the observed mean of the response
- ”y_std”float
The standard deviation of the observed response
See forecast_breakdown.
predict_phase (bool, optional, default False) – If False, plots the components of the training data and shows three plots: 1) Component Plot, 2) Trend Plot + Change points, and 3) Residuals + Smoothed Residuals. If set to True, plots the component breakdown of the predicted values. When set to True, it only plots one plot, the component plot, as there are no change points or residuals in this time frame.
title (str, optional, default None) – Title of the plot.
- Returns
fig – Figure plotting components against appropriate time scale. Plot layout includes: - Plot 1, “Component Plot” - breakdown from forecast_breakdown - Plot 2, “Trend + Change Points” - Plot 3, “Residuals + Smoothed Residuals”; smoothing done using exponentially weighted moving average
- Return type
- plot_trend_changepoint_detection(params=None)[source]
Convenience function to plot the original trend changepoint detection results.
- Parameters
params (dict or None, default None) –
The parameters in
plot
. If set to None, all components will be plotted.Note: seasonality components plotting is not supported currently.
plot
parameter must be False.- Returns
fig – Figure.
- Return type
- fit_uncertainty(df: DataFrame, uncertainty_dict: dict, fit_params: Optional[dict] = None, **kwargs)
Fits the uncertainty model with a given
df
anduncertainty_dict
.- Parameters
df (
pandas.DataFrame
) – A dataframe representing the data to fit the uncertainty model.uncertainty_dict (dict [str, any]) –
The uncertainty model specification. It should have the following keys:
- ”uncertainty_method”: a string that is in
UncertaintyMethodEnum
.
”params”: a dictionary that includes any additional parameters needed by the uncertainty method.
fit_params (dict [str, any] or None, default None) – Parameters to be passed to the
fit
function.kwargs (additional parameters to be fed into the uncertainty method.) – These parameters are from the estimator attributes, not given by user.
- Return type
The function sets
self.uncertainty_model
and does not return anything.
- get_params(deep=True)
Get parameters for this estimator.
- predict_uncertainty(df: DataFrame, predict_params: Optional[dict] = None)
Makes predictions of prediction intervals for
df
based on the predictions andself.uncertainty_model
.- Parameters
df (
pandas.DataFrame
) – The dataframe to calculate prediction intervals upon. It should have eitherself.value_col_
or PREDICT_COL which the prediction interval is based on.predict_params (dict [str, any] or None, default None) – Parameters to be passed to the
predict
function.
- Returns
result_df – The
df
with prediction interval columns.- Return type
- score(X, y, sample_weight=None)
Default scorer for the estimator (Used in GridSearchCV/RandomizedSearchCV if scoring=None)
Notes
If null_model_params is not None, returns R2_null_model_score of model error relative to null model, evaluated by score_func.
If null_model_params is None, returns score_func of the model itself.
By default, grid search (with no scoring parameter) optimizes improvement of
score_func
against null model.To optimize a different score function, pass scoring to GridSearchCV/RandomizedSearchCV.
- Parameters
X (
pandas.DataFrame
) – Input timeseries with timestamp column and any additional regressors. Value column, if provided in X, is ignoredy (
pandas.Series
ornumpy.array
) – Actual value, used to compute errorsample_weight (
pandas.Series
ornumpy.array
) – ignored
- Returns
score – Comparison of predictions against null predictions, according to specified score function
- Return type
float or None
- set_params(**params)
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters
**params (dict) – Estimator parameters.
- Returns
self – Estimator instance.
- Return type
estimator instance
- class greykite.framework.templates.simple_silverkite_template_config.SimpleSilverkiteTemplateOptions(freq: SILVERKITE_FREQ = SILVERKITE_FREQ.DAILY, seas: SILVERKITE_SEAS = SILVERKITE_SEAS.LT, gr: SILVERKITE_GR = SILVERKITE_GR.LINEAR, cp: SILVERKITE_CP = SILVERKITE_CP.NONE, hol: SILVERKITE_HOL = SILVERKITE_HOL.NONE, feaset: SILVERKITE_FEASET = SILVERKITE_FEASET.OFF, algo: SILVERKITE_ALGO = SILVERKITE_ALGO.LINEAR, ar: SILVERKITE_AR = SILVERKITE_AR.OFF, dsi: SILVERKITE_DSI = SILVERKITE_DSI.AUTO, wsi: SILVERKITE_WSI = SILVERKITE_WSI.AUTO)[source]
Defines generic simple silverkite template options.
Attributes can be set to different values using
SILVERKITE_COMPONENT_KEYWORDS
for high level tuning.freq
represents data frequency.The other attributes stand for seasonality, growth, changepoints_dict, events, feature_sets_enabled, fit_algorithm and autoregression in
ModelComponentsParam
, which are used inSimpleSilverkiteTemplate
.- freq: SILVERKITE_FREQ = 'DAILY'
Valid values for simple silverkite template string name frequency. See
SILVERKITE_FREQ
.
- seas: SILVERKITE_SEAS = 'LT'
Valid values for simple silverkite template string name seasonality. See
SILVERKITE_SEAS
.
- gr: SILVERKITE_GR = 'LINEAR'
Valid values for simple silverkite template string name growth. See
SILVERKITE_GR
.
- cp: SILVERKITE_CP = 'NONE'
Valid values for simple silverkite template string name changepoints. See
SILVERKITE_CP
.
- hol: SILVERKITE_HOL = 'NONE'
Valid values for simple silverkite template string name holiday. See
SILVERKITE_HOL
.
- feaset: SILVERKITE_FEASET = 'OFF'
Valid values for simple silverkite template string name feature sets enabled. See
SILVERKITE_FEASET
.
- algo: SILVERKITE_ALGO = 'LINEAR'
Valid values for simple silverkite template string name fit algorithm. See
SILVERKITE_ALGO
.
- ar: SILVERKITE_AR = 'OFF'
Valid values for simple silverkite template string name autoregression. See
SILVERKITE_AR
.
- dsi: SILVERKITE_DSI = 'AUTO'
Valid values for simple silverkite template string name max daily seasonality interaction order. See
SILVERKITE_DSI
.
- wsi: SILVERKITE_WSI = 'AUTO'
Valid values for simple silverkite template string name max weekly seasonality interaction order. See
SILVERKITE_WSI
.
- class greykite.framework.templates.silverkite_template.SilverkiteTemplate[source]
A template for
SilverkiteEstimator
.Takes input data and optional configuration parameters to customize the model. Returns a set of parameters to call
forecast_pipeline
.Notes
The attributes of a
ForecastConfig
forSilverkiteEstimator
are:- computation_param: ComputationParam or None, default None
How to compute the result. See
ComputationParam
.- coverage: float or None, default None
Intended coverage of the prediction bands (0.0 to 1.0). Same as coverage in
forecast_pipeline
. You may tune how the uncertainty is computed via model_components.uncertainty[“uncertainty_dict”].- evaluation_metric_param: EvaluationMetricParam or None, default None
What metrics to evaluate. See
EvaluationMetricParam
.- evaluation_period_param: EvaluationPeriodParam or None, default None
How to split data for evaluation. See
EvaluationPeriodParam
.- forecast_horizon: int or None, default None
Number of periods to forecast into the future. Must be > 0 If None, default is determined from input data frequency Same as forecast_horizon in forecast_pipeline
- metadata_param: MetadataParam or None, default None
Information about the input data. See
MetadataParam
.- model_components_param:
ModelComponentsParam
or None, default None Parameters to tune the model. See
ModelComponentsParam
. The fields are dictionaries with the following items.See inline comments on which values accept lists for grid search.
- seasonality: dict [str, any] or None, optional
How to model the seasonality. A dictionary with keys corresponding to parameters in
forecast
.Allowed keys:
"fs_components_df"
.- growth: dict [str, any] or None, optional
How to model the growth.
Allowed keys: None. (Use
model_components.custom["extra_pred_cols"]
to specify growth terms.)- events: dict [str, any] or None, optional
How to model the holidays/events. A dictionary with keys corresponding to parameters in
forecast
.Allowed keys:
"daily_event_df_dict"
.Note
Event names derived from
daily_event_df_dict
must be specified viamodel_components.custom["extra_pred_cols"]
to be included in the model. This parameter has no effect on the model unless event names are passed toextra_pred_cols
.The function
get_event_pred_cols
can be used to extract all event names fromdaily_event_df_dict
.- changepoints: dict [str, any] or None, optional
How to model changes in trend and seasonality. A dictionary with keys corresponding to parameters in
forecast
.Allowed keys: “changepoints_dict”, “seasonality_changepoints_dict”, “changepoint_detector”.
- autoregression: dict [str, any] or None, optional
Specifies the autoregression configuration. Dictionary with the following optional key:
"autoreg_dict"
: dict or str or None or a list of such values for grid searchIf a dict: A dictionary with arguments for
build_autoreg_df
. That function’s parametervalue_col
is inferred from the input of current functionself.forecast
. Other keys are:"lag_dict"
: dict or None"agg_lag_dict"
: dict or None"series_na_fill_func"
: callableIf a str: The string will represent a method and a dictionary will be constructed using that str. Currently only implemented method is “auto” which uses __get_default_autoreg_dict to create a dictionary. See more details for above parameters in
build_autoreg_df
.
- regressors: dict [str, any] or None, optional
How to model the regressors.
Allowed keys: None. (Use
model_components.custom["extra_pred_cols"]
to specify regressors.)- lagged_regressors: dict [str, dict] or None, optional
Specifies the lagged regressors configuration. Dictionary with the following optional key:
"lagged_regressor_dict"
: dict or None or a list of such values for grid searchA dictionary with arguments for
build_autoreg_df_multi
. The keys of the dictionary are the target lagged regressor column names. It can leverage the regressors included indf
. The value of each key is either a dict or str. If dict, it has the following keys:"lag_dict"
: dict or None"agg_lag_dict"
: dict or None"series_na_fill_func"
: callableIf str, it represents a method and a dictionary will be constructed using that str. Currently the only implemented method is “auto” which uses
SilverkiteForecast
’s __get_default_lagged_regressor_dict to create a dictionary for each lagged regressor. An example:lagged_regressor_dict = { "regressor1": { "lag_dict": {"orders": [1, 2, 3]}, "agg_lag_dict": { "orders_list": [[7, 7 * 2, 7 * 3]], "interval_list": [(8, 7 * 2)]}, "series_na_fill_func": lambda s: s.bfill().ffill()}, "regressor2": "auto"}
Check the docstring of
build_autoreg_df_multi
for more details for each argument.
- uncertainty: dict [str, any] or None, optional
How to model the uncertainty. A dictionary with keys corresponding to parameters in
forecast
.Allowed keys:
"uncertainty_dict"
.- custom: dict [str, any] or None, optional
Custom parameters that don’t fit the categories above. A dictionary with keys corresponding to parameters in
forecast
.- Allowed keys:
"silverkite"
,"origin_for_time_vars"
,"extra_pred_cols"
,"drop_pred_cols"
,"explicit_pred_cols"
,"fit_algorithm_dict"
,"min_admissible_value"
,"max_admissible_value"
.
Note
"extra_pred_cols"
should contain the desired growth terms, regressor names, and event names.fit_algorithm_dict
is a dictionary withfit_algorithm
andfit_algorithm_params
parameters toforecast
:- fit_algorithm_dictdict or None, optional
How to fit the model. A dictionary with the following optional keys.
"fit_algorithm"
str, optional, default “linear”The type of predictive model used in fitting.
See
fit_model_via_design_matrix
for available options and their parameters."fit_algorithm_params"
dict or None, optional, default NoneParameters passed to the requested fit_algorithm. If None, uses the defaults in
fit_model_via_design_matrix
.
- hyperparameter_override: dict [str, any] or None or list [dict [str, any] or None], optional
After the above model components are used to create a hyperparameter grid, the result is updated by this dictionary, to create new keys or override existing ones. Allows for complete customization of the grid search.
Keys should have format
{named_step}__{parameter_name}
for the named steps of thesklearn.pipeline.Pipeline
returned by this function. Seesklearn.pipeline.Pipeline
.For example:
hyperparameter_override={ "estimator__origin_for_time_vars": 2018.0, "input__response__null__impute_algorithm": "ts_interpolate", "input__response__null__impute_params": {"orders": [7, 14]}, "input__regressors_numeric__normalize__normalize_algorithm": "RobustScaler", }
If a list of dictionaries, grid search will be done for each dictionary in the list. Each dictionary in the list override the defaults. This enables grid search over specific combinations of parameters to reduce the search space.
For example, the first dictionary could define combinations of parameters for a “complex” model, and the second dictionary could define combinations of parameters for a “simple” model, to prevent mixed combinations of simple and complex.
Or the first dictionary could grid search over fit algorithm, and the second dictionary could use a single fit algorithm and grid search over seasonality.
The result is passed as the
param_distributions
parameter tosklearn.model_selection.RandomizedSearchCV
.
- model_template: str
This class only accepts “SK”.
- DEFAULT_MODEL_TEMPLATE = 'SK'
The default model template. See
ModelTemplateEnum
. Uses a string to avoid circular imports. Overrides the value fromForecastConfigDefaults
.
- property allow_model_template_list
SilverkiteTemplate does not allow config.model_template to be a list.
- property allow_model_components_param_list
SilverkiteTemplate does not allow config.model_components_param to be a list.
- get_regressor_cols()[source]
Returns regressor column names.
Implements the method in
BaseTemplate
.The intersection of
extra_pred_cols
from model components andself.df
columns, excludingtime_col
andvalue_col
.- Returns
regressor_cols – See
forecast_pipeline
.- Return type
list [str] or None
- get_lagged_regressor_info()[source]
Returns lagged regressor column names and minimal/maximal lag order. The lag order can be used to check potential imputation in the computation of lags.
Implements the method in
BaseTemplate
.- Returns
lagged_regressor_info – A dictionary that includes the lagged regressor column names and maximal/minimal lag order The keys are:
- lagged_regressor_colslist [str] or None
See
forecast_pipeline
.
overall_min_lag_order : int or None overall_max_lag_order : int or None
For example:
self.config.model_components_param.lagged_regressors["lagged_regressor_dict"] = [ {"regressor1": { "lag_dict": {"orders": [7]}, "agg_lag_dict": { "orders_list": [[7, 7 * 2, 7 * 3]], "interval_list": [(8, 7 * 2)]}, "series_na_fill_func": lambda s: s.bfill().ffill()} }, {"regressor2": { "lag_dict": {"orders": [2]}, "agg_lag_dict": { "orders_list": [[7, 7 * 2]], "interval_list": [(8, 7 * 2)]}, "series_na_fill_func": lambda s: s.bfill().ffill()} }, {"regressor3": "auto"} ]
Then the function returns:
lagged_regressor_info = { "lagged_regressor_cols": ["regressor1", "regressor2", "regressor3"], "overall_min_lag_order": 2, "overall_max_lag_order": 21 }
Note that “regressor3” is skipped as the “auto” option makes sure the lag order is proper.
- Return type
dict
- get_hyperparameter_grid()[source]
Returns hyperparameter grid.
Implements the method in
BaseTemplate
.Uses
self.time_properties
andself.config
to generate the hyperparameter grid.Converts model components and time properties into
SilverkiteEstimator
hyperparameters.Notes
forecast_pipeline
handles the train/test splits according toEvaluationPeriodParam
, soestimator__train_test_thresh
andestimator__training_fraction
are always None.estimator__changepoint_detector
is always None, to prevent leaking future information into the past. Passchangepoints_dict
with method=”auto” for automatic detection.- Returns
hyperparameter_grid – See
forecast_pipeline
. The output dictionary values are lists, combined in grid search.- Return type
dict, list [dict] or None
- static apply_computation_defaults(computation: Optional[ComputationParam] = None) ComputationParam
Applies the default ComputationParam values to the given object. If an expected attribute value is provided, the value is unchanged. Otherwise the default value for it is used. Other attributes are untouched. If the input object is None, it creates a ComputationParam object.
- Parameters
computation (
ComputationParam
or None) – The ComputationParam object.- Returns
computation – Valid ComputationParam object with the provided attribute values and the default attribute values if not.
- Return type
- static apply_evaluation_metric_defaults(evaluation: Optional[EvaluationMetricParam] = None) EvaluationMetricParam
Applies the default EvaluationMetricParam values to the given object. If an expected attribute value is provided, the value is unchanged. Otherwise the default value for it is used. Other attributes are untouched. If the input object is None, it creates a EvaluationMetricParam object.
- Parameters
evaluation (
EvaluationMetricParam
or None) – The EvaluationMetricParam object.- Returns
evaluation – Valid EvaluationMetricParam object with the provided attribute values and the default attribute values if not.
- Return type
- static apply_evaluation_period_defaults(evaluation: Optional[EvaluationPeriodParam] = None) EvaluationPeriodParam
Applies the default EvaluationPeriodParam values to the given object. If an expected attribute value is provided, the value is unchanged. Otherwise the default value for it is used. Other attributes are untouched. If the input object is None, it creates a EvaluationPeriodParam object.
- Parameters
evaluation (
EvaluationPeriodParam
or None) – The EvaluationMetricParam object.- Returns
evaluation – Valid EvaluationPeriodParam object with the provided attribute values and the default attribute values if not.
- Return type
- apply_forecast_config_defaults(config: Optional[ForecastConfig] = None) ForecastConfig
Applies the default Forecast Config values to the given config. If an expected attribute value is provided, the value is unchanged. Otherwise the default value for it is used. Other attributes are untouched. If the input config is None, it creates a Forecast Config.
- Parameters
config (
ForecastConfig
or None) – Forecast configuration if available. SeeForecastConfig
.- Returns
config – A valid Forecast Config which contains the provided attribute values and the default attribute values if not.
- Return type
ForecastConfig
- static apply_metadata_defaults(metadata: Optional[MetadataParam] = None) MetadataParam
Applies the default MetadataParam values to the given object. If an expected attribute value is provided, the value is unchanged. Otherwise the default value for it is used. Other attributes are untouched. If the input object is None, it creates a MetadataParam object.
- Parameters
metadata (
MetadataParam
or None) – The MetadataParam object.- Returns
metadata – Valid MetadataParam object with the provided attribute values and the default attribute values if not.
- Return type
- static apply_model_components_defaults(model_components: Optional[Union[ModelComponentsParam, List[Optional[ModelComponentsParam]]]] = None) Union[ModelComponentsParam, List[ModelComponentsParam]]
Applies the default ModelComponentsParam values to the given object.
Converts None to a ModelComponentsParam object. Unpacks a list of a single element to the element itself.
- Parameters
model_components (
ModelComponentsParam
or None or list of such items) – The ModelComponentsParam object.- Returns
model_components – Valid ModelComponentsParam object with the provided attribute values and the default attribute values if not.
- Return type
ModelComponentsParam
or list of such items
- apply_model_template_defaults(model_template: Optional[Union[str, List[Optional[str]]]] = None) Union[str, List[str]]
Applies the default model template to the given object.
Unpacks a list of a single element to the element itself. Sets default value if None.
- Parameters
model_template (str or None or list [None, str]) – The model template name. See valid names in
ModelTemplateEnum
.- Returns
model_template – The model template name, with defaults value used if not provided.
- Return type
str or list [str]
- apply_template_for_pipeline_params(df: DataFrame, config: Optional[ForecastConfig] = None) Dict [source]
Explicitly calls the method in
BaseTemplate
to make use of the decorator in this class.- Parameters
df (
pandas.DataFrame
) – The time series dataframe withtime_col
andvalue_col
and optional regressor columns.config (
ForecastConfig
.) – TheForecastConfig
class that includes model training parameters.
- Returns
pipeline_parameters – The pipeline parameters consumable by
forecast_pipeline
.- Return type
dict
- property estimator
The estimator instance to use as the final step in the pipeline. An instance of
greykite.sklearn.estimator.base_forecast_estimator.BaseForecastEstimator
.
- get_forecast_time_properties()
Returns forecast time parameters.
Uses
self.df
,self.config
,self.regressor_cols
.Available parameters:
self.df
self.config
self.score_func
self.score_func_greater_is_better
self.regressor_cols
self.lagged_regressor_cols
self.estimator
self.pipeline
- Returns
time_properties – Time properties dictionary (likely produced by
get_forecast_time_properties
) with keys:"period"
intPeriod of each observation (i.e. minimum time between observations, in seconds).
"simple_freq"
SimpleTimeFrequencyEnumSimpleTimeFrequencyEnum
member corresponding to data frequency."num_training_points"
intNumber of observations for training.
"num_training_days"
intNumber of days for training.
"start_year"
intStart year of the training period.
"end_year"
intEnd year of the forecast period.
"origin_for_time_vars"
floatContinuous time representation of the first date in
df
.
- Return type
dict [str, any] or None, default None
- get_pipeline()
Returns pipeline.
Implementation may be overridden by subclass if a different pipeline is desired.
Uses
self.estimator
,self.score_func
,self.score_func_greater_is_better
,self.config
,self.regressor_cols
.Available parameters:
self.df
self.config
self.score_func
self.score_func_greater_is_better
self.regressor_cols
self.estimator
- Returns
pipeline – See
forecast_pipeline
.- Return type
- score_func
Score function used to select optimal model in CV.
- score_func_greater_is_better
True if
score_func
is a score function, meaning higher is better, and False if it is a loss function, meaning lower is better.
- regressor_cols
A list of regressor columns used in the training and prediction DataFrames. If None, no regressor columns are used.
- lagged_regressor_cols
A list of lagged regressor columns used in the training and prediction DataFrames. If None, no lagged regressor columns are used.
- pipeline
Pipeline to fit. The final named step must be called “estimator”.
- time_properties
Time properties dictionary (likely produced by
get_forecast_time_properties
)
- hyperparameter_grid
Sets properties of the steps in the pipeline, and specifies combinations to search over. Should be valid input to
sklearn.model_selection.GridSearchCV
(param_grid) orsklearn.model_selection.RandomizedSearchCV
(param_distributions).
- config: Optional[ForecastConfig]
Forecast configuration.
- pipeline_params: Optional[Dict]
Parameters (keyword arguments) to call
forecast_pipeline
.
Lag Based Template
- class greykite.framework.templates.lag_based_template.LagBasedTemplate(estimator: BaseForecastEstimator = LagBasedEstimator())[source]
A template for :class:
LagBasedEstimator
.- DEFAULT_MODEL_TEMPLATE = 'LAG_BASED'
The default model template. See
ModelTemplateEnum
. Uses a string to avoid circular imports.
- property allow_model_template_list
LagBasedTemplate does not allow config.model_template to be a list.
- property allow_model_components_param_list
LagBasedTemplate does not allow config.model_components_param to be a list.
- get_regressor_cols()[source]
Returns regressor column names from the model components. LagBasedTemplate does not support regressors.
- apply_lag_based_model_components_defaults(model_components: Optional[ModelComponentsParam] = None)[source]
Fills the default values to
model_components
if not provided.- Parameters
model_components (
ModelComponentsParam
or None, default None) – Configuration forLagBasedTemplate
. Should only have values in the “custom” key.- Returns
model_components – The provided
model_components
with default values set.- Return type
- get_hyperparameter_grid()[source]
Returns hyperparameter grid.
Implements the method in
BaseTemplate
.Uses
self.config
to generate the hyperparameter grid.Converts model components into
LagBasedEstimator
hyperparameters.- Returns
hyperparameter_grid – See
forecast_pipeline
. The output dictionary values are lists, combined in grid search.- Return type
dict, list [dict] or None
- apply_template_for_pipeline_params(df: DataFrame, config: Optional[ForecastConfig] = None) Dict [source]
Explicitly calls the method in
BaseTemplate
to make use of the decorator in this class.- Parameters
df (
pandas.DataFrame
) – The time series dataframe withtime_col
andvalue_col
and optional regressor columns.config (
ForecastConfig
.) – TheForecastConfig
class that includes model training parameters.
- Returns
pipeline_parameters – The pipeline parameters consumable by
forecast_pipeline
.- Return type
dict
- static apply_computation_defaults(computation: Optional[ComputationParam] = None) ComputationParam
Applies the default ComputationParam values to the given object. If an expected attribute value is provided, the value is unchanged. Otherwise the default value for it is used. Other attributes are untouched. If the input object is None, it creates a ComputationParam object.
- Parameters
computation (
ComputationParam
or None) – The ComputationParam object.- Returns
computation – Valid ComputationParam object with the provided attribute values and the default attribute values if not.
- Return type
- static apply_evaluation_metric_defaults(evaluation: Optional[EvaluationMetricParam] = None) EvaluationMetricParam
Applies the default EvaluationMetricParam values to the given object. If an expected attribute value is provided, the value is unchanged. Otherwise the default value for it is used. Other attributes are untouched. If the input object is None, it creates a EvaluationMetricParam object.
- Parameters
evaluation (
EvaluationMetricParam
or None) – The EvaluationMetricParam object.- Returns
evaluation – Valid EvaluationMetricParam object with the provided attribute values and the default attribute values if not.
- Return type
- static apply_evaluation_period_defaults(evaluation: Optional[EvaluationPeriodParam] = None) EvaluationPeriodParam
Applies the default EvaluationPeriodParam values to the given object. If an expected attribute value is provided, the value is unchanged. Otherwise the default value for it is used. Other attributes are untouched. If the input object is None, it creates a EvaluationPeriodParam object.
- Parameters
evaluation (
EvaluationPeriodParam
or None) – The EvaluationMetricParam object.- Returns
evaluation – Valid EvaluationPeriodParam object with the provided attribute values and the default attribute values if not.
- Return type
- apply_forecast_config_defaults(config: Optional[ForecastConfig] = None) ForecastConfig
Applies the default Forecast Config values to the given config. If an expected attribute value is provided, the value is unchanged. Otherwise the default value for it is used. Other attributes are untouched. If the input config is None, it creates a Forecast Config.
- Parameters
config (
ForecastConfig
or None) – Forecast configuration if available. SeeForecastConfig
.- Returns
config – A valid Forecast Config which contains the provided attribute values and the default attribute values if not.
- Return type
ForecastConfig
- static apply_metadata_defaults(metadata: Optional[MetadataParam] = None) MetadataParam
Applies the default MetadataParam values to the given object. If an expected attribute value is provided, the value is unchanged. Otherwise the default value for it is used. Other attributes are untouched. If the input object is None, it creates a MetadataParam object.
- Parameters
metadata (
MetadataParam
or None) – The MetadataParam object.- Returns
metadata – Valid MetadataParam object with the provided attribute values and the default attribute values if not.
- Return type
- static apply_model_components_defaults(model_components: Optional[Union[ModelComponentsParam, List[Optional[ModelComponentsParam]]]] = None) Union[ModelComponentsParam, List[ModelComponentsParam]]
Applies the default ModelComponentsParam values to the given object.
Converts None to a ModelComponentsParam object. Unpacks a list of a single element to the element itself.
- Parameters
model_components (
ModelComponentsParam
or None or list of such items) – The ModelComponentsParam object.- Returns
model_components – Valid ModelComponentsParam object with the provided attribute values and the default attribute values if not.
- Return type
ModelComponentsParam
or list of such items
- apply_model_template_defaults(model_template: Optional[Union[str, List[Optional[str]]]] = None) Union[str, List[str]]
Applies the default model template to the given object.
Unpacks a list of a single element to the element itself. Sets default value if None.
- Parameters
model_template (str or None or list [None, str]) – The model template name. See valid names in
ModelTemplateEnum
.- Returns
model_template – The model template name, with defaults value used if not provided.
- Return type
str or list [str]
- static apply_template_decorator(func)[source]
Decorator for
apply_template_for_pipeline_params
function.Overrides the method in
BaseTemplate
.- Raises
ValueError if config.model_template != "LAG_BASED" –
- property estimator
The estimator instance to use as the final step in the pipeline. An instance of
greykite.sklearn.estimator.base_forecast_estimator.BaseForecastEstimator
.
- get_forecast_time_properties()
Returns forecast time parameters.
Uses
self.df
,self.config
,self.regressor_cols
.Available parameters:
self.df
self.config
self.score_func
self.score_func_greater_is_better
self.regressor_cols
self.lagged_regressor_cols
self.estimator
self.pipeline
- Returns
time_properties – Time properties dictionary (likely produced by
get_forecast_time_properties
) with keys:"period"
intPeriod of each observation (i.e. minimum time between observations, in seconds).
"simple_freq"
SimpleTimeFrequencyEnumSimpleTimeFrequencyEnum
member corresponding to data frequency."num_training_points"
intNumber of observations for training.
"num_training_days"
intNumber of days for training.
"start_year"
intStart year of the training period.
"end_year"
intEnd year of the forecast period.
"origin_for_time_vars"
floatContinuous time representation of the first date in
df
.
- Return type
dict [str, any] or None, default None
- get_lagged_regressor_info()
Returns lagged regressor column names and minimal/maximal lag order. The lag order can be used to check potential imputation in the computation of lags.
Can be overridden by subclass.
- Returns
lagged_regressor_info – A dictionary that includes the lagged regressor column names and maximal/minimal lag order The keys are:
- lagged_regressor_colslist [str] or None
See
forecast_pipeline
.
overall_min_lag_order : int or None overall_max_lag_order : int or None
- Return type
dict
- get_pipeline()
Returns pipeline.
Implementation may be overridden by subclass if a different pipeline is desired.
Uses
self.estimator
,self.score_func
,self.score_func_greater_is_better
,self.config
,self.regressor_cols
.Available parameters:
self.df
self.config
self.score_func
self.score_func_greater_is_better
self.regressor_cols
self.estimator
- Returns
pipeline – See
forecast_pipeline
.- Return type
- score_func
Score function used to select optimal model in CV.
- score_func_greater_is_better
True if
score_func
is a score function, meaning higher is better, and False if it is a loss function, meaning lower is better.
- regressor_cols
A list of regressor columns used in the training and prediction DataFrames. If None, no regressor columns are used.
- lagged_regressor_cols
A list of lagged regressor columns used in the training and prediction DataFrames. If None, no lagged regressor columns are used.
- pipeline
Pipeline to fit. The final named step must be called “estimator”.
- time_properties
Time properties dictionary (likely produced by
get_forecast_time_properties
)
- hyperparameter_grid
Sets properties of the steps in the pipeline, and specifies combinations to search over. Should be valid input to
sklearn.model_selection.GridSearchCV
(param_grid) orsklearn.model_selection.RandomizedSearchCV
(param_distributions).
- config: Optional[ForecastConfig]
Forecast configuration.
- pipeline_params: Optional[Dict]
Parameters (keyword arguments) to call
forecast_pipeline
.
- class greykite.sklearn.estimator.lag_based_estimator.LagBasedEstimator(score_func=<function mean_squared_error>, coverage=None, null_model_params=None, freq: ~typing.Optional[str] = None, lag_unit: str = 'week', lags: ~typing.Optional[~typing.Union[int, ~typing.List[int]]] = None, agg_func: ~typing.Union[str, callable] = 'mean', agg_func_params: ~typing.Optional[dict] = None, uncertainty_dict: ~typing.Optional[dict] = None, past_df: ~typing.Optional[~pandas.core.frame.DataFrame] = None, series_na_fill_func: ~typing.Optional[callable] = None)[source]
The lag based estimator, using lagged observations with aggregation functions to forecast the future. This estimator includes the common week-over-week estimation method.
The algorithm support specifying the following:
- lag_unitthe unit to calculate lagged values. One of the values in
- lagsa list of lags indicating which lagged
lag_unit
data are used in prediction. For example, [1, 2] indicating using the past two
lag_unit
same time data.
agg_func : the aggregation function used over the lagged observations. agg_func_params : extra parameters used for
agg_func
.When certain lags are not available, extra data will be extrapolated. When predicting into the future and future data is not available, predicted values will be used.
- Parameters
freq (str or None, default None) – The data frequency, used to validate lags.
lag_unit (str, default “week”) – The unit to calculate lagged observations. Available options are in
LagUnitEnum
.lags (list [int] or None, default None) – The lags in
lag_unit
’s. [1, 2] indicates using the past twolag_unit
same time values. If not provided, the default is to use lag 1 observation only.agg_func (str or callable, default “mean”) – The aggregation functions used over lagged observations.
agg_func_params (dict or None, default None) – Extra parameters used for
agg_func
.uncertainty_dict (dict or None, default None) – How to fit the uncertainty model. See
UncertaintyMethodEnum
. If not provided butcoverage
is given, this falls back toSimpleConditionalResidualsModel
.past_df (
pandas.DataFrame
or None, default None) – The past data used to append to the training data. If not provided the past data needed will be interpolated.series_na_fill_func (callable or None, default lambda s: s.bfill().ffill()) – The function to fill NAs when they exist.
- df
The fitted and interpolated training data.
- Type
pandas.DataFrame
or None
- uncertainty_model
The trained uncertainty model.
- Type
any or None
- max_lag_order
The maximum lag order.
- Type
int or None
- min_lag_order
The minimum lag order.
- Type
int or None
- train_start
The training start timestamp.
- Type
pandas.Timestamp
or None
- train_end
The training end timestamp.
- Type
pandas.Timestamp
or None
- fit(X, y=None, time_col='ts', value_col='y', **fit_params)[source]
Fits the lag based forecast model.
- Parameters
X (
pandas.DataFrame
) – Input timeseries, with timestamp column and value column. The value column is the response, included inX
to allow transformation bysklearn.pipeline
.y (ignored) – The original timeseries values, ignored. (The
y
for fitting is included inX
).time_col (str) – Time column name in
X
.value_col (str) – Value column name in
X
.fit_params (dict) – additional parameters for null model.
- Returns
self – Fitted class instance.
- Return type
self
- predict(X, y=None)[source]
Creates forecast for the dates specified in
X
.- Parameters
X (
pandas.DataFrame
) – Input timeseries with timestamp column. Timestamps are the dates for prediction. Value column, if provided inX
, is ignored.y (ignored.) –
- Returns
predictions –
Forecasted values for the dates in
X
. Columns:TIME_COL
: datesPREDICTED_COL
: predictionsPREDICTED_LOWER_COL
: lower bound of predictions, optionalPREDICTED_UPPER_COL
: upper bound of predictions, optional
PREDICTED_LOWER_COL
andPREDICTED_UPPER_COL
are present ifself.coverage
is not None.- Return type
- fit_uncertainty(df: DataFrame, uncertainty_dict: dict, fit_params: Optional[dict] = None, **kwargs)
Fits the uncertainty model with a given
df
anduncertainty_dict
.- Parameters
df (
pandas.DataFrame
) – A dataframe representing the data to fit the uncertainty model.uncertainty_dict (dict [str, any]) –
The uncertainty model specification. It should have the following keys:
- ”uncertainty_method”: a string that is in
UncertaintyMethodEnum
.
”params”: a dictionary that includes any additional parameters needed by the uncertainty method.
fit_params (dict [str, any] or None, default None) – Parameters to be passed to the
fit
function.kwargs (additional parameters to be fed into the uncertainty method.) – These parameters are from the estimator attributes, not given by user.
- Return type
The function sets
self.uncertainty_model
and does not return anything.
- get_params(deep=True)
Get parameters for this estimator.
- predict_uncertainty(df: DataFrame, predict_params: Optional[dict] = None)
Makes predictions of prediction intervals for
df
based on the predictions andself.uncertainty_model
.- Parameters
df (
pandas.DataFrame
) – The dataframe to calculate prediction intervals upon. It should have eitherself.value_col_
or PREDICT_COL which the prediction interval is based on.predict_params (dict [str, any] or None, default None) – Parameters to be passed to the
predict
function.
- Returns
result_df – The
df
with prediction interval columns.- Return type
- score(X, y, sample_weight=None)
Default scorer for the estimator (Used in GridSearchCV/RandomizedSearchCV if scoring=None)
Notes
If null_model_params is not None, returns R2_null_model_score of model error relative to null model, evaluated by score_func.
If null_model_params is None, returns score_func of the model itself.
By default, grid search (with no scoring parameter) optimizes improvement of
score_func
against null model.To optimize a different score function, pass scoring to GridSearchCV/RandomizedSearchCV.
- Parameters
X (
pandas.DataFrame
) – Input timeseries with timestamp column and any additional regressors. Value column, if provided in X, is ignoredy (
pandas.Series
ornumpy.array
) – Actual value, used to compute errorsample_weight (
pandas.Series
ornumpy.array
) – ignored
- Returns
score – Comparison of predictions against null predictions, according to specified score function
- Return type
float or None
- set_params(**params)
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters
**params (dict) – Estimator parameters.
- Returns
self – Estimator instance.
- Return type
estimator instance
- class greykite.sklearn.estimator.lag_based_estimator.LagUnitEnum(value)[source]
Defines the lag units available in
LagBasedEstimator
. The keys are available string names and the values are the correspondingdateutil.relativedelta.relativedelta
objects.
Multistage Forecast Template
- class greykite.framework.templates.multistage_forecast_template.MultistageForecastTemplate(constants: ~greykite.framework.templates.multistage_forecast_template_config.MultistageForecastTemplateConstants = <class 'greykite.framework.templates.multistage_forecast_template_config.MultistageForecastTemplateConstants'>, estimator: ~greykite.sklearn.estimator.base_forecast_estimator.BaseForecastEstimator = MultistageForecastEstimator(forecast_horizon=1, model_configs=[]))[source]
The model template for Multistage Forecast Estimator.
- DEFAULT_MODEL_TEMPLATE = 'SILVERKITE_TWO_STAGE'
The default model template. See
ModelTemplateEnum
. Uses a string to avoid circular imports.
- property constants: MultistageForecastTemplateConstants
Constants used by the template class. Includes the model templates and their default values.
- get_regressor_cols()[source]
Gets the regressor columns in the model.
Iterates over each submodel to extract the regressor columns.
- Returns
regressor_cols – A list of the regressor column names used in any of the submodels.
- Return type
list [str]
- get_lagged_regressor_info()[source]
Gets the lagged regressor info for the model
Iterates over each submodel to extract the lagged regressor info.
- Returns
lagged_regressor_info – The combined lagged regressor info from all submodels.
- Return type
dict
- get_hyperparameter_grid()[source]
Gets the hyperparameter grid for the Multistage Forecast Model.
- Returns
hyperparameter_grid – hyperparameter_grid for grid search in
forecast_pipeline
. The output dictionary values are lists, combined in grid search.- Return type
dict [str, list [any]] or list [ dict [str, list [any]] ]
- property allow_model_template_list: bool
Whether the template accepts a list for config.model_template (bool)
- property allow_model_components_param_list: bool
Whether the template accepts a list for config.model_components_param (bool)
- static apply_computation_defaults(computation: Optional[ComputationParam] = None) ComputationParam
Applies the default ComputationParam values to the given object. If an expected attribute value is provided, the value is unchanged. Otherwise the default value for it is used. Other attributes are untouched. If the input object is None, it creates a ComputationParam object.
- Parameters
computation (
ComputationParam
or None) – The ComputationParam object.- Returns
computation – Valid ComputationParam object with the provided attribute values and the default attribute values if not.
- Return type
- static apply_evaluation_metric_defaults(evaluation: Optional[EvaluationMetricParam] = None) EvaluationMetricParam
Applies the default EvaluationMetricParam values to the given object. If an expected attribute value is provided, the value is unchanged. Otherwise the default value for it is used. Other attributes are untouched. If the input object is None, it creates a EvaluationMetricParam object.
- Parameters
evaluation (
EvaluationMetricParam
or None) – The EvaluationMetricParam object.- Returns
evaluation – Valid EvaluationMetricParam object with the provided attribute values and the default attribute values if not.
- Return type
- static apply_evaluation_period_defaults(evaluation: Optional[EvaluationPeriodParam] = None) EvaluationPeriodParam
Applies the default EvaluationPeriodParam values to the given object. If an expected attribute value is provided, the value is unchanged. Otherwise the default value for it is used. Other attributes are untouched. If the input object is None, it creates a EvaluationPeriodParam object.
- Parameters
evaluation (
EvaluationPeriodParam
or None) – The EvaluationMetricParam object.- Returns
evaluation – Valid EvaluationPeriodParam object with the provided attribute values and the default attribute values if not.
- Return type
- apply_forecast_config_defaults(config: Optional[ForecastConfig] = None) ForecastConfig
Applies the default Forecast Config values to the given config. If an expected attribute value is provided, the value is unchanged. Otherwise the default value for it is used. Other attributes are untouched. If the input config is None, it creates a Forecast Config.
- Parameters
config (
ForecastConfig
or None) – Forecast configuration if available. SeeForecastConfig
.- Returns
config – A valid Forecast Config which contains the provided attribute values and the default attribute values if not.
- Return type
ForecastConfig
- static apply_metadata_defaults(metadata: Optional[MetadataParam] = None) MetadataParam
Applies the default MetadataParam values to the given object. If an expected attribute value is provided, the value is unchanged. Otherwise the default value for it is used. Other attributes are untouched. If the input object is None, it creates a MetadataParam object.
- Parameters
metadata (
MetadataParam
or None) – The MetadataParam object.- Returns
metadata – Valid MetadataParam object with the provided attribute values and the default attribute values if not.
- Return type
- static apply_model_components_defaults(model_components: Optional[Union[ModelComponentsParam, List[Optional[ModelComponentsParam]]]] = None) Union[ModelComponentsParam, List[ModelComponentsParam]]
Applies the default ModelComponentsParam values to the given object.
Converts None to a ModelComponentsParam object. Unpacks a list of a single element to the element itself.
- Parameters
model_components (
ModelComponentsParam
or None or list of such items) – The ModelComponentsParam object.- Returns
model_components – Valid ModelComponentsParam object with the provided attribute values and the default attribute values if not.
- Return type
ModelComponentsParam
or list of such items
- apply_model_template_defaults(model_template: Optional[Union[str, List[Optional[str]]]] = None) Union[str, List[str]]
Applies the default model template to the given object.
Unpacks a list of a single element to the element itself. Sets default value if None.
- Parameters
model_template (str or None or list [None, str]) – The model template name. See valid names in
ModelTemplateEnum
.- Returns
model_template – The model template name, with defaults value used if not provided.
- Return type
str or list [str]
- static apply_template_decorator(func)
Decorator for
apply_template_for_pipeline_params
function.By default, this applies
apply_forecast_config_defaults
toconfig
.Subclass may override this for pre/post processing of
apply_template_for_pipeline_params
, such as input validation. In this case,apply_template_for_pipeline_params
must also be implemented in the subclass.
- apply_template_for_pipeline_params(df: DataFrame, config: Optional[ForecastConfig] = None) Dict
Implements template interface method. Takes input data and optional configuration parameters to customize the model. Returns a set of parameters to call
forecast_pipeline
.See template interface for parameters and return value.
Uses the methods in this class to set:
"regressor_cols"
: get_regressor_cols()lagged_regressor_cols
: get_lagged_regressor_info()"pipeline"
: get_pipeline()"time_properties"
: get_forecast_time_properties()"hyperparameter_grid"
: get_hyperparameter_grid()
All other parameters are taken directly from
config
.
- property estimator
The estimator instance to use as the final step in the pipeline. An instance of
greykite.sklearn.estimator.base_forecast_estimator.BaseForecastEstimator
.
- get_forecast_time_properties()
Returns forecast time parameters.
Uses
self.df
,self.config
,self.regressor_cols
.Available parameters:
self.df
self.config
self.score_func
self.score_func_greater_is_better
self.regressor_cols
self.lagged_regressor_cols
self.estimator
self.pipeline
- Returns
time_properties – Time properties dictionary (likely produced by
get_forecast_time_properties
) with keys:"period"
intPeriod of each observation (i.e. minimum time between observations, in seconds).
"simple_freq"
SimpleTimeFrequencyEnumSimpleTimeFrequencyEnum
member corresponding to data frequency."num_training_points"
intNumber of observations for training.
"num_training_days"
intNumber of days for training.
"start_year"
intStart year of the training period.
"end_year"
intEnd year of the forecast period.
"origin_for_time_vars"
floatContinuous time representation of the first date in
df
.
- Return type
dict [str, any] or None, default None
- get_pipeline()
Returns pipeline.
Implementation may be overridden by subclass if a different pipeline is desired.
Uses
self.estimator
,self.score_func
,self.score_func_greater_is_better
,self.config
,self.regressor_cols
.Available parameters:
self.df
self.config
self.score_func
self.score_func_greater_is_better
self.regressor_cols
self.estimator
- Returns
pipeline – See
forecast_pipeline
.- Return type
- score_func
Score function used to select optimal model in CV.
- score_func_greater_is_better
True if
score_func
is a score function, meaning higher is better, and False if it is a loss function, meaning lower is better.
- regressor_cols
A list of regressor columns used in the training and prediction DataFrames. If None, no regressor columns are used.
- lagged_regressor_cols
A list of lagged regressor columns used in the training and prediction DataFrames. If None, no lagged regressor columns are used.
- pipeline
Pipeline to fit. The final named step must be called “estimator”.
- time_properties
Time properties dictionary (likely produced by
get_forecast_time_properties
)
- hyperparameter_grid
Sets properties of the steps in the pipeline, and specifies combinations to search over. Should be valid input to
sklearn.model_selection.GridSearchCV
(param_grid) orsklearn.model_selection.RandomizedSearchCV
(param_distributions).
- df: Optional[pd.DataFrame]
Timeseries data to forecast.
- config: Optional[ForecastConfig]
Forecast configuration.
- pipeline_params: Optional[Dict]
Parameters (keyword arguments) to call
forecast_pipeline
.
- class greykite.sklearn.estimator.multistage_forecast_estimator.MultistageForecastEstimator(model_configs: ~typing.List[~greykite.sklearn.estimator.multistage_forecast_estimator.MultistageForecastModelConfig], forecast_horizon: int, freq: ~typing.Optional[str] = None, uncertainty_dict: ~typing.Optional[dict] = None, score_func: ~typing.Callable = <function mean_squared_error>, coverage: ~typing.Optional[float] = None, null_model_params: ~typing.Optional[dict] = None)[source]
The Multistage Forecast Estimator class. Implements the Multistage forecast method.
The Multistage forecast method allows users to fit multiple stages of models with each stage in the following fashions:
subseting: take a subset of data from the end of training data;
aggregation: aggregate the subset of data into desired frequency;
training: train a model with the desired estimator and parameters.
Users can just use one stage model to train on a subset/aggregation of the original data, or can specify multiple stages, where the later stages will be trained on the fitted residuals of the previous stages.
This can significantly speed up the training process if the original data is long and in fine granularity.
Notes
The following assumptions or special implementations are made in this class:
The actual
fit_length
, the length of data where the fitted values are calculated, is the longer oftrain_length
andfit_length
. The reason is that there is no benefit of calculating a shorter period of fitted values. The fitted values are already available during training (in Silverkite) so there is no loss to calculate fitted values on a super set of the training data.The estimator sorts the
model_configs
according to thetrain_length
in descending order. The corresponding aggregation frequency, aggregation function, fit length, estimator and parameters will be sorted accordingly. This is to ensure that we have enough data to use from the previous model when we fit the next model.When calculating the length of training data, the length of past df, etc, the actual length used may include 1 more period to avoid missing timestamps. For example, for an AR order of 5, you may see the length of
past_df
to be 6; or for a train length of “365D”, you may see the actual length to be 366. This is expected, just to avoid potential missing timestamps after dropping incomplete aggregation periods.Since the models in each stage may not fit on the entire training data, there could be periods where fitted values are not calculated. Leading fitted values in the training period may be NA. These values are ignored when computing evaluation metrics.
- model_configs
A list of model configs for Multistage Forecast estimator, representing the stages in the model.
- Type
list [
MultistageForecastModelConfig
]
- forecast_horizon
The forecast horizon on the original data frequency.
- Type
int
- freq
The frequency of the original data.
- Type
str or None
- train_lengths
A list of training data lengths for the models.
- Type
list [str] or None
- fit_lengths
A list of fitting data lengths for the models.
- Type
list [str] or None
- agg_funcs
A list of aggregation functions for the models.
- Type
list [str or Callable] or None
- agg_freqs
A list of aggregation frequencies for the models.
- Type
list [str] or None
- estimators
A list of estimators used in the models.
- Type
list [
BaseForecastEstimator
] or None
- estimator_params
A list of estimator parameters for the estimators.
- Type
list [dict or None] or None
- train_lengths_in_seconds
The list of training lengths in seconds.
- Type
list [int] or None
- fit_lengths_in_seconds
The list of fitting lengths in seconds. If the original
fit_length
is None or is shorter than the correspondingtrain_length
, it will be replaced by the correspondingtrain_length
.- Type
: list [int] or None
- max_ar_orders
A list of maximum AR orders in the models.
- Type
list [int] or None
- data_freq_in_seconds
The data frequency in seconds.
- Type
int or None
- num_points_per_agg_freqs
Number of data points in each aggregation frequency.
- Type
list [int] or None
- models
The list of model instances.
- Type
list [
BaseForecastEstimator
]
- fit_df
The prediction df.
- Type
pandas.DataFrame
or None
- train_end
The train end timestamp.
- Type
pandas.Timestamp
or None
- forecast_horizons
The list of forecast horizons for all models in terms of the aggregated frequencies.
- Type
list [int]
- fit(X, y=None, time_col='ts', value_col='y', **fit_params)[source]
Fits
MultistageForecast
forecast model.- Parameters
X (
pandas.DataFrame
) – Input timeseries, with timestamp column, value column, and any additional regressors. The value column is the response, included inX
to allow transformation bysklearn.pipeline
.y (ignored) – The original timeseries values, ignored. (The
y
for fitting is included inX
).time_col (str) – Time column name in
X
.value_col (str) – Value column name in
X
.fit_params (dict) – additional parameters for null model.
- Returns
self – Fitted model is stored in
self.model_dict
.- Return type
self
- predict(X, y=None)[source]
Creates forecast for the dates specified in
X
.- Parameters
X (
pandas.DataFrame
) – Input timeseries with timestamp column and any additional regressors. Timestamps are the dates for prediction. Value column, if provided inX
, is ignored.y (ignored.) –
- Returns
predictions –
Forecasted values for the dates in
X
. Columns:TIME_COL
: datesPREDICTED_COL
: predictionsPREDICTED_LOWER_COL
: lower bound of predictions, optionalPREDICTED_UPPER_COL
: upper bound of predictions, optional
PREDICTED_LOWER_COL
andPREDICTED_UPPER_COL
are present ifself.coverage
is not None.- Return type
- plot_components()[source]
Makes component plots.
- Returns
figs – A list of figures from each model.
- Return type
list [
plotly.graph_objects.Figure
or None]
- summary()[source]
Gets model summaries.
- Returns
summaries – A list of model summaries from each model.
- Return type
list [
ModelSummary
or None]
- fit_uncertainty(df: DataFrame, uncertainty_dict: dict, fit_params: Optional[dict] = None, **kwargs)
Fits the uncertainty model with a given
df
anduncertainty_dict
.- Parameters
df (
pandas.DataFrame
) – A dataframe representing the data to fit the uncertainty model.uncertainty_dict (dict [str, any]) –
The uncertainty model specification. It should have the following keys:
- ”uncertainty_method”: a string that is in
UncertaintyMethodEnum
.
”params”: a dictionary that includes any additional parameters needed by the uncertainty method.
fit_params (dict [str, any] or None, default None) – Parameters to be passed to the
fit
function.kwargs (additional parameters to be fed into the uncertainty method.) – These parameters are from the estimator attributes, not given by user.
- Return type
The function sets
self.uncertainty_model
and does not return anything.
- get_params(deep=True)
Get parameters for this estimator.
- predict_uncertainty(df: DataFrame, predict_params: Optional[dict] = None)
Makes predictions of prediction intervals for
df
based on the predictions andself.uncertainty_model
.- Parameters
df (
pandas.DataFrame
) – The dataframe to calculate prediction intervals upon. It should have eitherself.value_col_
or PREDICT_COL which the prediction interval is based on.predict_params (dict [str, any] or None, default None) – Parameters to be passed to the
predict
function.
- Returns
result_df – The
df
with prediction interval columns.- Return type
- score(X, y, sample_weight=None)
Default scorer for the estimator (Used in GridSearchCV/RandomizedSearchCV if scoring=None)
Notes
If null_model_params is not None, returns R2_null_model_score of model error relative to null model, evaluated by score_func.
If null_model_params is None, returns score_func of the model itself.
By default, grid search (with no scoring parameter) optimizes improvement of
score_func
against null model.To optimize a different score function, pass scoring to GridSearchCV/RandomizedSearchCV.
- Parameters
X (
pandas.DataFrame
) – Input timeseries with timestamp column and any additional regressors. Value column, if provided in X, is ignoredy (
pandas.Series
ornumpy.array
) – Actual value, used to compute errorsample_weight (
pandas.Series
ornumpy.array
) – ignored
- Returns
score – Comparison of predictions against null predictions, according to specified score function
- Return type
float or None
- set_params(**params)
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters
**params (dict) – Estimator parameters.
- Returns
self – Estimator instance.
- Return type
estimator instance
Prophet Template
- class greykite.framework.templates.prophet_template.ProphetTemplate(estimator: Optional[BaseForecastEstimator] = None)[source]
A template for
ProphetEstimator
.Takes input data and optional configuration parameters to customize the model. Returns a set of parameters to call
forecast_pipeline
.Notes
The attributes of a
ForecastConfig
forProphetEstimator
are:- computation_param: ComputationParam or None, default None
How to compute the result. See
ComputationParam
.- coverage: float or None, default None
Intended coverage of the prediction bands (0.0 to 1.0) If None, the upper/lower predictions are not returned Same as coverage in
forecast_pipeline
- evaluation_metric_param: EvaluationMetricParam or None, default None
What metrics to evaluate. See
EvaluationMetricParam
.- evaluation_period_param: EvaluationPeriodParam or None, default None
How to split data for evaluation. See
EvaluationPeriodParam
.- forecast_horizon: int or None, default None
Number of periods to forecast into the future. Must be > 0 If None, default is determined from input data frequency Same as forecast_horizon in forecast_pipeline
- metadata_param: MetadataParam or None, default None
Information about the input data. See
MetadataParam
.- model_components_param:
ModelComponentsParam
or None, default None Parameters to tune the model. See
ModelComponentsParam
. The fields are dictionaries with the following items.- seasonality: dict [str, any] or None
Seasonality config dictionary, with the following optional keys.
"seasonality_mode"
: str or None or list of such values for grid searchCan be ‘additive’ (default) or ‘multiplicative’.
"seasonality_prior_scale"
: float or None or list of such values for grid searchParameter modulating the strength of the seasonality model. Larger values allow the model to fit larger seasonal fluctuations, smaller values dampen the seasonality. Specify for individual seasonalities using add_seasonality_dict.
"yearly_seasonality"
: str or bool or int or list of such values for grid search, default ‘auto’Determines the yearly seasonality Can be ‘auto’, True, False, or a number of Fourier terms to generate.
"weekly_seasonality"
: str or bool or int or list of such values for grid search, default ‘auto’Determines the weekly seasonality Can be ‘auto’, True, False, or a number of Fourier terms to generate.
"daily_seasonality"
: str or bool or int or list of such values for grid search, default ‘auto’Determines the daily seasonality Can be ‘auto’, True, False, or a number of Fourier terms to generate.
"add_seasonality_dict"
: dict or None or list of such values for grid searchdict of custom seasonality parameters to be added to the model, default=None Key is the seasonality component name e.g. ‘monthly’; parameters are specified via dict. See
prophet_estimator
for details.
- growth: dict [str, any] or None
Specifies the growth parameter configuration. Dictionary with the following optional key:
"growth_term"
: str or None or list of such values for grid searchHow to model the growth. Valid options are “linear” and “logistic” Specify a linear or logistic trend, these terms have their origin at the train start date.
- events: dict [str, any] or None
Holiday/events configuration dictionary with the following optional keys:
"holiday_lookup_countries"
: list [str] or “auto” or NoneWhich countries’ holidays to include. Must contain all the holidays you intend to model. If “auto”, uses a default list of countries with a good coverage of global holidays. If None or an empty list, no holidays are modeled.
"holidays_prior_scale"
: float or None or list of such values for grid search, default 10.0Modulates the strength of the holiday effect.
"holiday_pre_num_days"
: list [int] or None, default 2Model holiday effects for holiday_pre_num_days days before the holiday. Grid search is not supported. Must be a list with one element or None.
"holiday_post_num_days"
: list [int] or None, default 2Model holiday effects for holiday_post_num_days days after the holiday Grid search is not supported. Must be a list with one element or None.
- changepoints: dict [str, any] or None
Specifies the changepoint configuration. Dictionary with the following optional keys:
"changepoint_prior_scale"
float or None or list of such values for grid search, default 0.05Parameter modulating the flexibility of the automatic changepoint selection. Large values will allow many changepoints, small values will allow few changepoints.
"changepoints"
list [datetime.datetime
] or None or list of such values for grid search, default NoneList of dates at which to include potential changepoints. If not specified, potential changepoints are selected automatically.
"n_changepoints"
int or None or list of such values for grid search, default 25Number of potential changepoints to include. Not used if input changepoints is supplied. If changepoints is not supplied, then n_changepoints potential changepoints are selected uniformly from the first changepoint_range proportion of the history.
"changepoint_range"
float or None or list of such values for grid search, default 0.8Proportion of history in which trend changepoints will be estimated. Permitted values: (0,1] Not used if input changepoints is supplied.
- regressors: dict [str, any] or None
Specifies the regressors to include in the model (e.g. macro-economic factors). Dictionary with the following optional keys:
"add_regressor_dict"
dict or None or list of such values for grid search, default NoneDictionary of extra regressors to be modeled. See
ProphetEstimator
for details.
- uncertainty: dict [str, any] or None
Specifies the uncertainty configuration. A dictionary with the following optional keys:
"mcmc_samples"
int or None or list of such values for grid search, default 0if greater than 0, will do full Bayesian inference with the specified number of MCMC samples. If 0, will do MAP estimation.
"uncertainty_samples"
int or None or list of such values for grid search, default 1000Number of simulated draws used to estimate uncertainty intervals. Setting this value to 0 or False will disable uncertainty estimation and speed up the calculation.
- hyperparameter_override: dict [str, any] or None or list [dict [str, any] or None]
After the above model components are used to create a hyperparameter grid, the result is updated by this dictionary, to create new keys or override existing ones. Allows for complete customization of the grid search.
Keys should have format
{named_step}__{parameter_name}
for the named steps of thesklearn.pipeline.Pipeline
returned by this function. Seesklearn.pipeline.Pipeline
.For example:
hyperparameter_override={ "estimator__yearly_seasonality": [True, False], "estimator__seasonality_prior_scale": [5.0, 15.0], "input__response__null__impute_algorithm": "ts_interpolate", "input__response__null__impute_params": {"orders": [7, 14]}, "input__regressors_numeric__normalize__normalize_algorithm": "RobustScaler", }
If a list of dictionaries, grid search will be done for each dictionary in the list. Each dictionary in the list override the defaults. This enables grid search over specific combinations of parameters to reduce the search space.
For example, the first dictionary could define combinations of parameters for a “complex” model, and the second dictionary could define combinations of parameters for a “simple” model, to prevent mixed combinations of simple and complex.
Or the first dictionary could grid search over fit algorithm, and the second dictionary could use a single fit algorithm and grid search over seasonality.
The result is passed as the
param_distributions
parameter tosklearn.model_selection.RandomizedSearchCV
.- autoregression: dict [str, any] or None
Ignored. Prophet template does not support autoregression.
- lagged_regressors: dict [str, any] or None
Ignored. Prophet template does not support lagged regressors.
- custom: dict [str, any] or None
Ignored. There are no custom options.
- model_template: str
This class only accepts “PROPHET”.
- DEFAULT_MODEL_TEMPLATE = 'PROPHET'
The default model template. See
ModelTemplateEnum
. Uses a string to avoid circular imports. Overrides the value fromForecastConfigDefaults
.
- HOLIDAY_LOOKUP_COUNTRIES_AUTO = ('UnitedStates', 'UnitedKingdom', 'India', 'France', 'China')
Default holiday countries to use if countries=’auto’
- property allow_model_template_list
ProphetTemplate does not allow config.model_template to be a list.
- property allow_model_components_param_list
ProphetTemplate does not allow config.model_components_param to be a list.
- get_prophet_holidays(year_list, countries='auto', lower_window=-2, upper_window=2)[source]
Generates holidays for Prophet model.
- Parameters
year_list (list [int]) – List of years for selecting the holidays across given countries.
countries (list [str] or “auto” or None, default “auto”) –
Countries for selecting holidays.
If “auto”, uses a default list of countries with a good coverage of global holidays.
If a list, a list of country names.
If None, the function returns None.
lower_window (int or None, default -2) – Negative integer. Model holiday effects for given number of days before the holiday.
upper_window (int or None, default 2) – Positive integer. Model holiday effects for given number of days after the holiday.
- Returns
holidays – holidays dataframe to pass to Prophet’s
holidays
argument.- Return type
- get_regressor_cols()[source]
Returns regressor column names.
Implements the method in
BaseTemplate
.- Returns
regressor_cols – The names of regressor columns used in any hyperparameter set requested by
model_components
. None if there are no regressors.- Return type
list [str] or None
- apply_prophet_model_components_defaults(model_components=None, time_properties=None)[source]
Sets default values for
model_components
.Called by
get_hyperparameter_grid
aftertime_properties` is defined. Requires ``time_properties
as well asmodel_components
so we do not simply overrideapply_model_components_defaults
.- Parameters
model_components (
ModelComponentsParam
or None, default None) – Configuration of model growth, seasonality, events, etc. See the docstring of this class for details.time_properties (dict [str, any] or None, default None) –
Time properties dictionary (likely produced by
get_forecast_time_properties
) with keys:"period"
intPeriod of each observation (i.e. minimum time between observations, in seconds).
"simple_freq"
SimpleTimeFrequencyEnumSimpleTimeFrequencyEnum
member corresponding to data frequency."num_training_points"
intNumber of observations for training.
"num_training_days"
intNumber of days for training.
"start_year"
intStart year of the training period.
"end_year"
intEnd year of the forecast period.
"origin_for_time_vars"
floatContinuous time representation of the first date in
df
.
If None, start_year is set to 2015 and end_year to 2030.
- Returns
model_components – The provided
model_components
with default values set- Return type
- get_hyperparameter_grid()[source]
Returns hyperparameter grid.
Implements the method in
BaseTemplate
.Uses
self.time_properties
andself.config
to generate the hyperparameter grid.Converts model components and time properties into
ProphetEstimator
hyperparameters.- Returns
hyperparameter_grid –
ProphetEstimator
hyperparameters.See
forecast_pipeline
. The output dictionary values are lists, combined in grid search.- Return type
dict [str, list [any]] or None
- static apply_computation_defaults(computation: Optional[ComputationParam] = None) ComputationParam
Applies the default ComputationParam values to the given object. If an expected attribute value is provided, the value is unchanged. Otherwise the default value for it is used. Other attributes are untouched. If the input object is None, it creates a ComputationParam object.
- Parameters
computation (
ComputationParam
or None) – The ComputationParam object.- Returns
computation – Valid ComputationParam object with the provided attribute values and the default attribute values if not.
- Return type
- static apply_evaluation_metric_defaults(evaluation: Optional[EvaluationMetricParam] = None) EvaluationMetricParam
Applies the default EvaluationMetricParam values to the given object. If an expected attribute value is provided, the value is unchanged. Otherwise the default value for it is used. Other attributes are untouched. If the input object is None, it creates a EvaluationMetricParam object.
- Parameters
evaluation (
EvaluationMetricParam
or None) – The EvaluationMetricParam object.- Returns
evaluation – Valid EvaluationMetricParam object with the provided attribute values and the default attribute values if not.
- Return type
- static apply_evaluation_period_defaults(evaluation: Optional[EvaluationPeriodParam] = None) EvaluationPeriodParam
Applies the default EvaluationPeriodParam values to the given object. If an expected attribute value is provided, the value is unchanged. Otherwise the default value for it is used. Other attributes are untouched. If the input object is None, it creates a EvaluationPeriodParam object.
- Parameters
evaluation (
EvaluationPeriodParam
or None) – The EvaluationMetricParam object.- Returns
evaluation – Valid EvaluationPeriodParam object with the provided attribute values and the default attribute values if not.
- Return type
- apply_forecast_config_defaults(config: Optional[ForecastConfig] = None) ForecastConfig
Applies the default Forecast Config values to the given config. If an expected attribute value is provided, the value is unchanged. Otherwise the default value for it is used. Other attributes are untouched. If the input config is None, it creates a Forecast Config.
- Parameters
config (
ForecastConfig
or None) – Forecast configuration if available. SeeForecastConfig
.- Returns
config – A valid Forecast Config which contains the provided attribute values and the default attribute values if not.
- Return type
ForecastConfig
- static apply_metadata_defaults(metadata: Optional[MetadataParam] = None) MetadataParam
Applies the default MetadataParam values to the given object. If an expected attribute value is provided, the value is unchanged. Otherwise the default value for it is used. Other attributes are untouched. If the input object is None, it creates a MetadataParam object.
- Parameters
metadata (
MetadataParam
or None) – The MetadataParam object.- Returns
metadata – Valid MetadataParam object with the provided attribute values and the default attribute values if not.
- Return type
- static apply_model_components_defaults(model_components: Optional[Union[ModelComponentsParam, List[Optional[ModelComponentsParam]]]] = None) Union[ModelComponentsParam, List[ModelComponentsParam]]
Applies the default ModelComponentsParam values to the given object.
Converts None to a ModelComponentsParam object. Unpacks a list of a single element to the element itself.
- Parameters
model_components (
ModelComponentsParam
or None or list of such items) – The ModelComponentsParam object.- Returns
model_components – Valid ModelComponentsParam object with the provided attribute values and the default attribute values if not.
- Return type
ModelComponentsParam
or list of such items
- apply_model_template_defaults(model_template: Optional[Union[str, List[Optional[str]]]] = None) Union[str, List[str]]
Applies the default model template to the given object.
Unpacks a list of a single element to the element itself. Sets default value if None.
- Parameters
model_template (str or None or list [None, str]) – The model template name. See valid names in
ModelTemplateEnum
.- Returns
model_template – The model template name, with defaults value used if not provided.
- Return type
str or list [str]
- property estimator
The estimator instance to use as the final step in the pipeline. An instance of
greykite.sklearn.estimator.base_forecast_estimator.BaseForecastEstimator
.
- get_forecast_time_properties()
Returns forecast time parameters.
Uses
self.df
,self.config
,self.regressor_cols
.Available parameters:
self.df
self.config
self.score_func
self.score_func_greater_is_better
self.regressor_cols
self.lagged_regressor_cols
self.estimator
self.pipeline
- Returns
time_properties – Time properties dictionary (likely produced by
get_forecast_time_properties
) with keys:"period"
intPeriod of each observation (i.e. minimum time between observations, in seconds).
"simple_freq"
SimpleTimeFrequencyEnumSimpleTimeFrequencyEnum
member corresponding to data frequency."num_training_points"
intNumber of observations for training.
"num_training_days"
intNumber of days for training.
"start_year"
intStart year of the training period.
"end_year"
intEnd year of the forecast period.
"origin_for_time_vars"
floatContinuous time representation of the first date in
df
.
- Return type
dict [str, any] or None, default None
- get_lagged_regressor_info()
Returns lagged regressor column names and minimal/maximal lag order. The lag order can be used to check potential imputation in the computation of lags.
Can be overridden by subclass.
- Returns
lagged_regressor_info – A dictionary that includes the lagged regressor column names and maximal/minimal lag order The keys are:
- lagged_regressor_colslist [str] or None
See
forecast_pipeline
.
overall_min_lag_order : int or None overall_max_lag_order : int or None
- Return type
dict
- get_pipeline()
Returns pipeline.
Implementation may be overridden by subclass if a different pipeline is desired.
Uses
self.estimator
,self.score_func
,self.score_func_greater_is_better
,self.config
,self.regressor_cols
.Available parameters:
self.df
self.config
self.score_func
self.score_func_greater_is_better
self.regressor_cols
self.estimator
- Returns
pipeline – See
forecast_pipeline
.- Return type
- score_func
Score function used to select optimal model in CV.
- score_func_greater_is_better
True if
score_func
is a score function, meaning higher is better, and False if it is a loss function, meaning lower is better.
- regressor_cols
A list of regressor columns used in the training and prediction DataFrames. If None, no regressor columns are used.
- lagged_regressor_cols
A list of lagged regressor columns used in the training and prediction DataFrames. If None, no lagged regressor columns are used.
- pipeline
Pipeline to fit. The final named step must be called “estimator”.
- time_properties
Time properties dictionary (likely produced by
get_forecast_time_properties
)
- hyperparameter_grid
Sets properties of the steps in the pipeline, and specifies combinations to search over. Should be valid input to
sklearn.model_selection.GridSearchCV
(param_grid) orsklearn.model_selection.RandomizedSearchCV
(param_distributions).
- config: Optional[ForecastConfig]
Forecast configuration.
- pipeline_params: Optional[Dict]
Parameters (keyword arguments) to call
forecast_pipeline
.
- apply_template_for_pipeline_params(df: DataFrame, config: Optional[ForecastConfig] = None) Dict [source]
Explicitly calls the method in
BaseTemplate
to make use of the decorator in this class.- Parameters
df (
pandas.DataFrame
) – The time series dataframe withtime_col
andvalue_col
and optional regressor columns.config (
ForecastConfig
.) – TheForecastConfig
class that includes model training parameters.
- Returns
pipeline_parameters – The pipeline parameters consumable by
forecast_pipeline
.- Return type
dict
- class greykite.sklearn.estimator.prophet_estimator.ProphetEstimator(score_func=<function mean_squared_error>, coverage=0.8, null_model_params=None, growth='linear', changepoints=None, n_changepoints=25, changepoint_range=0.8, yearly_seasonality='auto', weekly_seasonality='auto', daily_seasonality='auto', holidays=None, seasonality_mode='additive', seasonality_prior_scale=10.0, holidays_prior_scale=10.0, changepoint_prior_scale=0.05, mcmc_samples=0, uncertainty_samples=1000, add_regressor_dict=None, add_seasonality_dict=None)[source]
Wrapper for Facebook Prophet model.
- Parameters
score_func (callable) – see BaseForecastEstimator
coverage (float between [0.0, 1.0]) – see BaseForecastEstimator
null_model_params (dict with arguments to define DummyRegressor null model, optional, default=None) – see BaseForecastEstimator
add_regressor_dict (dictionary of extra regressors to be added to the model, optional, default=None) –
These should be available for training and entire prediction interval.
Dictionary format:
add_regressor_dict={ # we can add as many regressors as we want, in the following format "reg_col1": { "prior_scale": 10, "standardize": True, "mode": 'additive' }, "reg_col2": { "prior_scale": 20, "standardize": True, "mode": 'multiplicative' } }
add_seasonality_dict (dict of custom seasonality parameters to be added to the model, optional, default=None) –
parameter details: https://github.com/facebook/prophet/blob/main/python/prophet/forecaster.py - refer to add_seasonality() function. Key is the seasonality component name e.g. ‘monthly’; parameters are specified via dict.
Dictionary format:
add_seasonality_dict={ 'monthly': { 'period': 30.5, 'fourier_order': 5 }, 'weekly': { 'period': 7, 'fourier_order': 20, 'prior_scale': 0.6, 'mode': 'additive', 'condition_name': 'condition_col' # takes a bool column in df with True/False values. This means that # the seasonality will only be applied to dates where the condition_name column is True. }, 'yearly': { 'period': 365.25, 'fourier_order': 10, 'prior_scale': 0.2, 'mode': 'additive' } }
Note: If there is a conflict in built-in and custom seasonality e.g. both have “yearly”, then custom seasonality will be used and Model will throw a warning such as: “INFO:prophet:Found custom seasonality named “yearly”, disabling built-in yearly seasonality.”
kwargs (additional parameters) –
Other parameters are the same as Prophet model, with one exception:
interval_width
is specified bycoverage
.See source code
__init__
for the parameter names, and refer to Prophet documentation for a description:
- model
Prophet model object
- Type
Prophet
object
- forecast
Output of predict method of
Prophet
.- Type
- fit(X, y=None, time_col='ts', value_col='y', **fit_params)[source]
Fits prophet model.
- Parameters
X (
pandas.DataFrame
) – Input timeseries, with timestamp column, value column, and any additional regressors. The value column is the response, included in X to allow transformation bysklearn.pipeline.Pipeline
y (ignored) – The original timeseries values, ignored. (The y for fitting is included in
X
.)time_col (str) – Time column name in
X
value_col (str) – Value column name in
X
fit_params (dict) – additional parameters for null model
- Returns
self – Fitted model is stored in
self.model
.- Return type
self
- predict(X, y=None)[source]
Creates forecast for dates specified in
X
.- Parameters
X (
pandas.DataFrame
) – Input timeseries with timestamp column and any additional regressors. Timestamps are the dates for prediction. Value column, if provided in X, is ignored.y (ignored) –
- Returns
predictions –
Forecasted values for the dates in
X
. Columns:TIME_COL dates
PREDICTED_COL predictions
PREDICTED_LOWER_COL lower bound of predictions, optional
PREDICTED_UPPER_COL upper bound of predictions, optional
[other columns], optional
PREDICTED_LOWER_COL and PREDICTED_UPPER_COL are present iff coverage is not None
- Return type
- summary()[source]
Prints input parameters and Prophet model parameters.
- Returns
log_message – log message printed to logging.info()
- Return type
- plot_components(uncertainty=True, plot_cap=True, weekly_start=0, yearly_start=0, figsize=None)[source]
Plot the
Prophet
forecast components on the dataset passed topredict
.Will plot whichever are available of: trend, holidays, weekly seasonality, and yearly seasonality.
- Parameters
uncertainty (bool, optional, default True) – Boolean to plot uncertainty intervals.
plot_cap (bool, optional, default True) – Boolean indicating if the capacity should be shown in the figure, if available.
weekly_start (int, optional, default 0) – Specifying the start day of the weekly seasonality plot. 0 (default) starts the week on Sunday. 1 shifts by 1 day to Jan 2, and so on.
yearly_start (int, optional, default 0) – Specifying the start day of the yearly seasonality plot. 0 (default) starts the year on Jan 1. 1 shifts by 1 day to Jan 2, and so on.
figsize (tuple , optional, default None) – Width, height in inches.
- Returns
fig – A matplotlib figure.
- Return type
matplotlib.figure.Figure
- fit_uncertainty(df: DataFrame, uncertainty_dict: dict, fit_params: Optional[dict] = None, **kwargs)
Fits the uncertainty model with a given
df
anduncertainty_dict
.- Parameters
df (
pandas.DataFrame
) – A dataframe representing the data to fit the uncertainty model.uncertainty_dict (dict [str, any]) –
The uncertainty model specification. It should have the following keys:
- ”uncertainty_method”: a string that is in
UncertaintyMethodEnum
.
”params”: a dictionary that includes any additional parameters needed by the uncertainty method.
fit_params (dict [str, any] or None, default None) – Parameters to be passed to the
fit
function.kwargs (additional parameters to be fed into the uncertainty method.) – These parameters are from the estimator attributes, not given by user.
- Return type
The function sets
self.uncertainty_model
and does not return anything.
- get_params(deep=True)
Get parameters for this estimator.
- predict_uncertainty(df: DataFrame, predict_params: Optional[dict] = None)
Makes predictions of prediction intervals for
df
based on the predictions andself.uncertainty_model
.- Parameters
df (
pandas.DataFrame
) – The dataframe to calculate prediction intervals upon. It should have eitherself.value_col_
or PREDICT_COL which the prediction interval is based on.predict_params (dict [str, any] or None, default None) – Parameters to be passed to the
predict
function.
- Returns
result_df – The
df
with prediction interval columns.- Return type
- score(X, y, sample_weight=None)
Default scorer for the estimator (Used in GridSearchCV/RandomizedSearchCV if scoring=None)
Notes
If null_model_params is not None, returns R2_null_model_score of model error relative to null model, evaluated by score_func.
If null_model_params is None, returns score_func of the model itself.
By default, grid search (with no scoring parameter) optimizes improvement of
score_func
against null model.To optimize a different score function, pass scoring to GridSearchCV/RandomizedSearchCV.
- Parameters
X (
pandas.DataFrame
) – Input timeseries with timestamp column and any additional regressors. Value column, if provided in X, is ignoredy (
pandas.Series
ornumpy.array
) – Actual value, used to compute errorsample_weight (
pandas.Series
ornumpy.array
) – ignored
- Returns
score – Comparison of predictions against null predictions, according to specified score function
- Return type
float or None
- set_params(**params)
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters
**params (dict) – Estimator parameters.
- Returns
self – Estimator instance.
- Return type
estimator instance
ARIMA Template
- class greykite.framework.templates.auto_arima_template.AutoArimaTemplate(estimator: BaseForecastEstimator = AutoArimaEstimator())[source]
A template for
AutoArimaEstimator
.Takes input data and optional configuration parameters to customize the model. Returns a set of parameters to call
forecast_pipeline
.Notes
The attributes of a
ForecastConfig
forAutoArimaEstimator
are:- computation_param: ComputationParam or None, default None
How to compute the result. See
ComputationParam
.- coverage: float or None, default None
Intended coverage of the prediction bands (0.0 to 1.0) If None, the upper/lower predictions are not returned Same as coverage in
forecast_pipeline
- evaluation_metric_param: EvaluationMetricParam or None, default None
What metrics to evaluate. See
EvaluationMetricParam
.- evaluation_period_param: EvaluationPeriodParam or None, default None
How to split data for evaluation. See
EvaluationPeriodParam
.- forecast_horizon: int or None, default None
Number of periods to forecast into the future. Must be > 0 If None, default is determined from input data frequency Same as forecast_horizon in forecast_pipeline
- metadata_param: MetadataParam or None, default None
Information about the input data. See
MetadataParam
.- model_components_param:
ModelComponentsParam
or None, default None Parameters to tune the model. See
ModelComponentsParam
. The fields are dictionaries with the following items.- seasonality: dict [str, any] or None
Ignored. Pass the relevant Auto Arima arguments via custom.
- growth: dict [str, any] or None
Ignored. Pass the relevant Auto Arima arguments via custom.
- events: dict [str, any] or None
Ignored. Pass the relevant Auto Arima arguments via custom.
- changepoints: dict [str, any] or None
Ignored. Pass the relevant Auto Arima arguments via custom.
- regressors: dict [str, any] or None
Ignored. Auto Arima template currently does not support regressors.
- uncertainty: dict [str, any] or None
Ignored. Pass the relevant Auto Arima arguments via custom.
- hyperparameter_override: dict [str, any] or None or list [dict [str, any] or None]
After the above model components are used to create a hyperparameter grid, the result is updated by this dictionary, to create new keys or override existing ones. Allows for complete customization of the grid search.
Keys should have format
{named_step}__{parameter_name}
for the named steps of thesklearn.pipeline.Pipeline
returned by this function. Seesklearn.pipeline.Pipeline
.For example:
hyperparameter_override={ "estimator__max_p": [8, 10], "estimator__information_criterion": ["bic"], }
If a list of dictionaries, grid search will be done for each dictionary in the list. Each dictionary in the list override the defaults. This enables grid search over specific combinations of parameters to reduce the search space.
For example, the first dictionary could define combinations of parameters for a “complex” model, and the second dictionary could define combinations of parameters for a “simple” model, to prevent mixed combinations of simple and complex.
Or the first dictionary could grid search over fit algorithm, and the second dictionary could use a single fit algorithm and grid search over seasonality.
The result is passed as the
param_distributions
parameter tosklearn.model_selection.RandomizedSearchCV
.- autoregression: dict [str, any] or None
Ignored. Pass the relevant Auto Arima arguments via custom.
- custom: dict [str, any] or None
Any parameter in the
AutoArimaEstimator
can be passed.
- model_template: str
This class only accepts “AUTO_ARIMA”.
- DEFAULT_MODEL_TEMPLATE = 'AUTO_ARIMA'
The default model template. See
ModelTemplateEnum
. Uses a string to avoid circular imports.
- property allow_model_template_list
AutoArimaTemplate does not allow config.model_template to be a list.
- property allow_model_components_param_list
AutoArimaTemplate does not allow config.model_components_param to be a list.
- get_regressor_cols()[source]
Returns regressor column names from the model components.
Currently does not implement regressors.
- apply_auto_arima_model_components_defaults(model_components=None)[source]
Sets default values for
model_components
.- Parameters
model_components (
ModelComponentsParam
or None, default None) – Configuration of model growth, seasonality, events, etc. See the docstring of this class for details.- Returns
model_components – The provided
model_components
with default values set- Return type
- get_hyperparameter_grid()[source]
Returns hyperparameter grid.
Implements the method in
BaseTemplate
.Uses
self.time_properties
andself.config
to generate the hyperparameter grid.Converts model components into
AutoArimaEstimator
. hyperparameters.The output dictionary values are lists, combined via grid search in
forecast_pipeline
.- Parameters
model_components (
ModelComponentsParam
or None, default None) – Configuration of parameter space to search the order (p, d, q etc.) of SARIMAX model. Seeauto_arima_template
for details.coverage (float or None, default=0.95) – Intended coverage of the prediction bands (0.0 to 1.0)
- Returns
hyperparameter_grid –
AutoArimaEstimator
hyperparameters.See
forecast_pipeline
. The output dictionary values are lists, combined in grid search.- Return type
dict [str, list [any]] or None
- apply_template_for_pipeline_params(df: DataFrame, config: Optional[ForecastConfig] = None) Dict [source]
Explicitly calls the method in
BaseTemplate
to make use of the decorator in this class.- Parameters
df (
pandas.DataFrame
) – The time series dataframe withtime_col
andvalue_col
and optional regressor columns.config (
ForecastConfig
.) – TheForecastConfig
class that includes model training parameters.
- Returns
pipeline_parameters – The pipeline parameters consumable by
forecast_pipeline
.- Return type
dict
- static apply_computation_defaults(computation: Optional[ComputationParam] = None) ComputationParam
Applies the default ComputationParam values to the given object. If an expected attribute value is provided, the value is unchanged. Otherwise the default value for it is used. Other attributes are untouched. If the input object is None, it creates a ComputationParam object.
- Parameters
computation (
ComputationParam
or None) – The ComputationParam object.- Returns
computation – Valid ComputationParam object with the provided attribute values and the default attribute values if not.
- Return type
- static apply_evaluation_metric_defaults(evaluation: Optional[EvaluationMetricParam] = None) EvaluationMetricParam
Applies the default EvaluationMetricParam values to the given object. If an expected attribute value is provided, the value is unchanged. Otherwise the default value for it is used. Other attributes are untouched. If the input object is None, it creates a EvaluationMetricParam object.
- Parameters
evaluation (
EvaluationMetricParam
or None) – The EvaluationMetricParam object.- Returns
evaluation – Valid EvaluationMetricParam object with the provided attribute values and the default attribute values if not.
- Return type
- static apply_evaluation_period_defaults(evaluation: Optional[EvaluationPeriodParam] = None) EvaluationPeriodParam
Applies the default EvaluationPeriodParam values to the given object. If an expected attribute value is provided, the value is unchanged. Otherwise the default value for it is used. Other attributes are untouched. If the input object is None, it creates a EvaluationPeriodParam object.
- Parameters
evaluation (
EvaluationPeriodParam
or None) – The EvaluationMetricParam object.- Returns
evaluation – Valid EvaluationPeriodParam object with the provided attribute values and the default attribute values if not.
- Return type
- apply_forecast_config_defaults(config: Optional[ForecastConfig] = None) ForecastConfig
Applies the default Forecast Config values to the given config. If an expected attribute value is provided, the value is unchanged. Otherwise the default value for it is used. Other attributes are untouched. If the input config is None, it creates a Forecast Config.
- Parameters
config (
ForecastConfig
or None) – Forecast configuration if available. SeeForecastConfig
.- Returns
config – A valid Forecast Config which contains the provided attribute values and the default attribute values if not.
- Return type
ForecastConfig
- static apply_metadata_defaults(metadata: Optional[MetadataParam] = None) MetadataParam
Applies the default MetadataParam values to the given object. If an expected attribute value is provided, the value is unchanged. Otherwise the default value for it is used. Other attributes are untouched. If the input object is None, it creates a MetadataParam object.
- Parameters
metadata (
MetadataParam
or None) – The MetadataParam object.- Returns
metadata – Valid MetadataParam object with the provided attribute values and the default attribute values if not.
- Return type
- static apply_model_components_defaults(model_components: Optional[Union[ModelComponentsParam, List[Optional[ModelComponentsParam]]]] = None) Union[ModelComponentsParam, List[ModelComponentsParam]]
Applies the default ModelComponentsParam values to the given object.
Converts None to a ModelComponentsParam object. Unpacks a list of a single element to the element itself.
- Parameters
model_components (
ModelComponentsParam
or None or list of such items) – The ModelComponentsParam object.- Returns
model_components – Valid ModelComponentsParam object with the provided attribute values and the default attribute values if not.
- Return type
ModelComponentsParam
or list of such items
- apply_model_template_defaults(model_template: Optional[Union[str, List[Optional[str]]]] = None) Union[str, List[str]]
Applies the default model template to the given object.
Unpacks a list of a single element to the element itself. Sets default value if None.
- Parameters
model_template (str or None or list [None, str]) – The model template name. See valid names in
ModelTemplateEnum
.- Returns
model_template – The model template name, with defaults value used if not provided.
- Return type
str or list [str]
- static apply_template_decorator(func)[source]
Decorator for
apply_template_for_pipeline_params
function.Overrides the method in
BaseTemplate
.- Raises
ValueError if config.model_template != "AUTO_ARIMA" –
- property estimator
The estimator instance to use as the final step in the pipeline. An instance of
greykite.sklearn.estimator.base_forecast_estimator.BaseForecastEstimator
.
- get_forecast_time_properties()
Returns forecast time parameters.
Uses
self.df
,self.config
,self.regressor_cols
.Available parameters:
self.df
self.config
self.score_func
self.score_func_greater_is_better
self.regressor_cols
self.lagged_regressor_cols
self.estimator
self.pipeline
- Returns
time_properties – Time properties dictionary (likely produced by
get_forecast_time_properties
) with keys:"period"
intPeriod of each observation (i.e. minimum time between observations, in seconds).
"simple_freq"
SimpleTimeFrequencyEnumSimpleTimeFrequencyEnum
member corresponding to data frequency."num_training_points"
intNumber of observations for training.
"num_training_days"
intNumber of days for training.
"start_year"
intStart year of the training period.
"end_year"
intEnd year of the forecast period.
"origin_for_time_vars"
floatContinuous time representation of the first date in
df
.
- Return type
dict [str, any] or None, default None
- get_lagged_regressor_info()
Returns lagged regressor column names and minimal/maximal lag order. The lag order can be used to check potential imputation in the computation of lags.
Can be overridden by subclass.
- Returns
lagged_regressor_info – A dictionary that includes the lagged regressor column names and maximal/minimal lag order The keys are:
- lagged_regressor_colslist [str] or None
See
forecast_pipeline
.
overall_min_lag_order : int or None overall_max_lag_order : int or None
- Return type
dict
- get_pipeline()
Returns pipeline.
Implementation may be overridden by subclass if a different pipeline is desired.
Uses
self.estimator
,self.score_func
,self.score_func_greater_is_better
,self.config
,self.regressor_cols
.Available parameters:
self.df
self.config
self.score_func
self.score_func_greater_is_better
self.regressor_cols
self.estimator
- Returns
pipeline – See
forecast_pipeline
.- Return type
- score_func
Score function used to select optimal model in CV.
- score_func_greater_is_better
True if
score_func
is a score function, meaning higher is better, and False if it is a loss function, meaning lower is better.
- regressor_cols
A list of regressor columns used in the training and prediction DataFrames. If None, no regressor columns are used.
- lagged_regressor_cols
A list of lagged regressor columns used in the training and prediction DataFrames. If None, no lagged regressor columns are used.
- pipeline
Pipeline to fit. The final named step must be called “estimator”.
- time_properties
Time properties dictionary (likely produced by
get_forecast_time_properties
)
- hyperparameter_grid
Sets properties of the steps in the pipeline, and specifies combinations to search over. Should be valid input to
sklearn.model_selection.GridSearchCV
(param_grid) orsklearn.model_selection.RandomizedSearchCV
(param_distributions).
- config: Optional[ForecastConfig]
Forecast configuration.
- pipeline_params: Optional[Dict]
Parameters (keyword arguments) to call
forecast_pipeline
.
- class greykite.sklearn.estimator.auto_arima_estimator.AutoArimaEstimator(score_func: callable = <function mean_squared_error>, coverage: float = 0.9, null_model_params: ~typing.Optional[~typing.Dict] = None, regressor_cols: ~typing.Optional[~typing.List[str]] = None, freq: ~typing.Optional[float] = None, start_p: ~typing.Optional[int] = 2, d: ~typing.Optional[int] = None, start_q: ~typing.Optional[int] = 2, max_p: ~typing.Optional[int] = 5, max_d: ~typing.Optional[int] = 2, max_q: ~typing.Optional[int] = 5, start_P: ~typing.Optional[int] = 1, D: ~typing.Optional[int] = None, start_Q: ~typing.Optional[int] = 1, max_P: ~typing.Optional[int] = 2, max_D: ~typing.Optional[int] = 1, max_Q: ~typing.Optional[int] = 2, max_order: ~typing.Optional[int] = 5, m: ~typing.Optional[int] = 1, seasonal: ~typing.Optional[bool] = True, stationary: ~typing.Optional[bool] = False, information_criterion: ~typing.Optional[str] = 'aic', alpha: ~typing.Optional[int] = 0.05, test: ~typing.Optional[str] = 'kpss', seasonal_test: ~typing.Optional[str] = 'ocsb', stepwise: ~typing.Optional[bool] = True, n_jobs: ~typing.Optional[int] = 1, start_params: ~typing.Optional[~typing.Dict] = None, trend: ~typing.Optional[str] = None, method: ~typing.Optional[str] = 'lbfgs', maxiter: ~typing.Optional[int] = 50, offset_test_args: ~typing.Optional[~typing.Dict] = None, seasonal_test_args: ~typing.Optional[~typing.Dict] = None, suppress_warnings: ~typing.Optional[bool] = True, error_action: ~typing.Optional[str] = 'trace', trace: ~typing.Optional[~typing.Union[int, bool]] = False, random: ~typing.Optional[bool] = False, random_state: ~typing.Optional[~typing.Union[int, callable]] = None, n_fits: ~typing.Optional[int] = 10, out_of_sample_size: ~typing.Optional[int] = 0, scoring: ~typing.Optional[str] = 'mse', scoring_args: ~typing.Optional[~typing.Dict] = None, with_intercept: ~typing.Optional[~typing.Union[bool, str]] = 'auto', return_conf_int: ~typing.Optional[bool] = True, dynamic: ~typing.Optional[bool] = False)[source]
Wrapper for
pmdarima.arima.AutoARIMA
. It currently does not handle the regressor issue when there is gap between train and predict periods.- Parameters
score_func (callable) – see
BaseForecastEstimator
.coverage (float between [0.0, 1.0]) – see
BaseForecastEstimator
.null_model_params (dict with arguments to define DummyRegressor null model, optional, default=None) – see
BaseForecastEstimator
.regressor_cols (list [str], optional, default None) – A list of regressor columns used during training and prediction. If None, no regressor columns are used.
descriptions (See AutoArima documentation for rest of the parameter) –
- model
Auto arima model object
- Type
AutoArima
object
- fit_df
The training data used to fit the model.
- Type
pandas.DataFrame
or None
- forecast
Output of the predict method of
AutoArima
.- Type
- fit(X, y=None, time_col='ts', value_col='y', **fit_params)[source]
Fits
ARIMA
forecast model.- Parameters
X (
pandas.DataFrame
) – Input timeseries, with timestamp column, value column, and any additional regressors. The value column is the response, included in X to allow transformation bysklearn.pipeline.Pipeline
y (ignored) – The original timeseries values, ignored. (The y for fitting is included in
X
.)time_col (str) – Time column name in
X
value_col (str) – Value column name in
X
fit_params (dict) – additional parameters for null model
- Returns
self – Fitted model is stored in
self.model
.- Return type
self
- predict(X, y=None)[source]
Creates forecast for the dates specified in
X
. Currently does not support the regressor case where there is gap between train and predict periods.- Parameters
X (
pandas.DataFrame
) – Input timeseries with timestamp column and any additional regressors. Timestamps are the dates for prediction. Value column, if provided inX
, is ignored.y (ignored.) –
- Returns
predictions –
Forecasted values for the dates in
X
. Columns:TIME_COL
: datesPREDICTED_COL
: predictionsPREDICTED_LOWER_COL
: lower bound of predictionsPREDICTED_UPPER_COL
: upper bound of predictions
- Return type
- summary()[source]
Creates human readable string of how the model works, including relevant diagnostics These details cannot be extracted from the forecast alone Prints model configuration. Extend this in child class to print the trained model parameters.
Log message is printed to the cst.LOGGER_NAME logger.
- fit_uncertainty(df: DataFrame, uncertainty_dict: dict, fit_params: Optional[dict] = None, **kwargs)
Fits the uncertainty model with a given
df
anduncertainty_dict
.- Parameters
df (
pandas.DataFrame
) – A dataframe representing the data to fit the uncertainty model.uncertainty_dict (dict [str, any]) –
The uncertainty model specification. It should have the following keys:
- ”uncertainty_method”: a string that is in
UncertaintyMethodEnum
.
”params”: a dictionary that includes any additional parameters needed by the uncertainty method.
fit_params (dict [str, any] or None, default None) – Parameters to be passed to the
fit
function.kwargs (additional parameters to be fed into the uncertainty method.) – These parameters are from the estimator attributes, not given by user.
- Return type
The function sets
self.uncertainty_model
and does not return anything.
- get_params(deep=True)
Get parameters for this estimator.
- predict_uncertainty(df: DataFrame, predict_params: Optional[dict] = None)
Makes predictions of prediction intervals for
df
based on the predictions andself.uncertainty_model
.- Parameters
df (
pandas.DataFrame
) – The dataframe to calculate prediction intervals upon. It should have eitherself.value_col_
or PREDICT_COL which the prediction interval is based on.predict_params (dict [str, any] or None, default None) – Parameters to be passed to the
predict
function.
- Returns
result_df – The
df
with prediction interval columns.- Return type
- score(X, y, sample_weight=None)
Default scorer for the estimator (Used in GridSearchCV/RandomizedSearchCV if scoring=None)
Notes
If null_model_params is not None, returns R2_null_model_score of model error relative to null model, evaluated by score_func.
If null_model_params is None, returns score_func of the model itself.
By default, grid search (with no scoring parameter) optimizes improvement of
score_func
against null model.To optimize a different score function, pass scoring to GridSearchCV/RandomizedSearchCV.
- Parameters
X (
pandas.DataFrame
) – Input timeseries with timestamp column and any additional regressors. Value column, if provided in X, is ignoredy (
pandas.Series
ornumpy.array
) – Actual value, used to compute errorsample_weight (
pandas.Series
ornumpy.array
) – ignored
- Returns
score – Comparison of predictions against null predictions, according to specified score function
- Return type
float or None
- set_params(**params)
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters
**params (dict) – Estimator parameters.
- Returns
self – Estimator instance.
- Return type
estimator instance
Forecast Pipeline
- greykite.framework.pipeline.pipeline.forecast_pipeline(df: DataFrame, time_col='ts', value_col='y', date_format=None, tz=None, freq=None, train_end_date=None, anomaly_info=None, pipeline=None, regressor_cols=None, lagged_regressor_cols=None, estimator=SimpleSilverkiteEstimator(), hyperparameter_grid=None, hyperparameter_budget=None, n_jobs=1, verbose=1, forecast_horizon=None, coverage=0.95, test_horizon=None, periods_between_train_test=None, agg_periods=None, agg_func=None, score_func='MeanAbsolutePercentError', score_func_greater_is_better=False, cv_report_metrics='ALL', null_model_params=None, relative_error_tolerance=None, cv_horizon=None, cv_min_train_periods=None, cv_expanding_window=False, cv_use_most_recent_splits=False, cv_periods_between_splits=None, cv_periods_between_train_test=None, cv_max_splits=3)[source]
Computation pipeline for end-to-end forecasting.
Trains a forecast model end-to-end:
checks input data
runs cross-validation to select optimal hyperparameters e.g. best model
evaluates best model on test set
provides forecast of best model (re-trained on all data) into the future
Returns forecasts with methods to plot and see diagnostics. Also returns the fitted pipeline and CV results.
Provides a high degree of customization over training and evaluation parameters:
model
cross validation
evaluation
forecast horizon
See test cases for examples.
- Parameters
df (
pandas.DataFrame
) – Timeseries data to forecast. Contains columns [time_col, value_col], and optional regressor columns Regressor columns should include future values for predictiontime_col (str, default TIME_COL in constants.py) – name of timestamp column in df
value_col (str, default VALUE_COL in constants.py) – name of value column in df (the values to forecast)
date_format (str or None, default None) – strftime format to parse time column, eg
%m/%d/%Y
. Note that%f
will parse all the way up to nanoseconds. If None (recommended), inferred bypandas.to_datetime
.tz (str or None, default None) – Passed to pandas.tz_localize to localize the timestamp
freq (str or None, default None) – Frequency of input data. Used to generate future dates for prediction. Frequency strings can have multiples, e.g. ‘5H’. See https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases for a list of frequency aliases. If None, inferred by
pandas.infer_freq
. Provide this parameter ifdf
has missing timepoints.train_end_date (
datetime.datetime
, optional, default None) – Last date to use for fitting the model. Forecasts are generated after this date. If None, it is set to the last date with a non-null value invalue_col
ofdf
.anomaly_info (dict or list [dict] or None, default None) –
Anomaly adjustment info. Anomalies in
df
are corrected before any forecasting is done.If None, no adjustments are made.
A dictionary containing the parameters to
adjust_anomalous_data
. See that function for details. The possible keys are:"value_col"
strThe name of the column in
df
to adjust. You may adjust the value to forecast as well as any numeric regressors."anomaly_df"
pandas.DataFrame
Adjustments to correct the anomalies.
"start_time_col"
: str, default START_TIME_COLStart date column in
anomaly_df
."end_time_col"
: str, default END_TIME_COLEnd date column in
anomaly_df
."adjustment_delta_col"
: str or None, default NoneImpact column in
anomaly_df
."filter_by_dict"
: dict or None, default NoneUsed to filter
anomaly_df
to the relevant anomalies for thevalue_col
in this dictionary. Key specifies the column name, value specifies the filter value."filter_by_value_col""
: str or None, default NoneAdds
{filter_by_value_col: value_col}
tofilter_by_dict
if not None, for thevalue_col
in this dictionary."adjustment_method"
str (“add” or “subtract”), default “add”How to make the adjustment, if
adjustment_delta_col
is provided.
Accepts a list of such dictionaries to adjust multiple columns in
df
.pipeline (
sklearn.pipeline.Pipeline
or None, default None) – Pipeline to fit. The final named step must be called “estimator”. If None, will use the default Pipeline fromget_basic_pipeline
.regressor_cols (list [str] or None, default None) – A list of regressor columns used in the training and prediction DataFrames. It should contain only the regressors that are being used in the grid search. If None, no regressor columns are used. Regressor columns that are unavailable in
df
are dropped.lagged_regressor_cols (list [str] or None, default None) – A list of additional columns needed for lagged regressors in the training and prediction DataFrames. This list can have overlap with
regressor_cols
. If None, no additional columns are added to the DataFrame. Lagged regressor columns that are unavailable indf
are dropped.estimator (instance of an estimator that implements greykite.algo.models.base_forecast_estimator.BaseForecastEstimator) – Estimator to use as the final step in the pipeline. Ignored if
pipeline
is provided.forecast_horizon (int or None, default None) – Number of periods to forecast into the future. Must be > 0. If None, default is determined from input data frequency
coverage (float or None, default=0.95) – Intended coverage of the prediction bands (0.0 to 1.0) If None, the upper/lower predictions are not returned Ignored if pipeline is provided. Uses coverage of the
pipeline
estimator instead.test_horizon (int or None, default None) – Numbers of periods held back from end of df for test. The rest is used for cross validation. If None, default is forecast_horizon. Set to 0 to skip backtest.
periods_between_train_test (int or None, default None) – Number of periods for the gap between train and test data. If None, default is 0.
agg_periods (int or None, default None) –
Number of periods to aggregate before evaluation.
Model is fit and forecasted on the dataset’s original frequency.
Before evaluation, the actual and forecasted values are aggregated, using rolling windows of size
agg_periods
and the functionagg_func
. (e.g. if the dataset is hourly, useagg_periods=24, agg_func=np.sum
, to evaluate performance on the daily totals).If None, does not aggregate before evaluation.
Currently, this is only used when calculating CV metrics and the R2_null_model_score metric in backtest/forecast. No pre-aggregation is applied for the other backtest/forecast evaluation metrics.
agg_func (callable or None, default None) –
Takes an array and returns a number, e.g. np.max, np.sum.
Defines how to aggregate rolling windows of actual and predicted values before evaluation.
Ignored if
agg_periods
is None.Currently, this is only used when calculating CV metrics and the R2_null_model_score metric in backtest/forecast. No pre-aggregation is applied for the other backtest/forecast evaluation metrics.
score_func (str or callable, default
EvaluationMetricEnum.MeanAbsolutePercentError.name
) – Score function used to select optimal model in CV. If a callable, takes arraysy_true
,y_pred
and returns a float. If a string, must be either aEvaluationMetricEnum
member name orFRACTION_OUTSIDE_TOLERANCE
.score_func_greater_is_better (bool, default False) – True if
score_func
is a score function, meaning higher is better, and False if it is a loss function, meaning lower is better. Must be provided ifscore_func
is a callable (custom function). Ignored ifscore_func
is a string, because the direction is known.cv_report_metrics (str, or list [str], or None, default CV_REPORT_METRICS_ALL) –
Additional metrics to compute during CV, besides the one specified by
score_func
.If the string constant
greykite.framework.constants.CV_REPORT_METRICS_ALL
, computes all metrics inEvaluationMetricEnum
. Also computesFRACTION_OUTSIDE_TOLERANCE
ifrelative_error_tolerance
is not None. The results are reported by the short name (.get_metric_name()
) forEvaluationMetricEnum
members andFRACTION_OUTSIDE_TOLERANCE_NAME
forFRACTION_OUTSIDE_TOLERANCE
. These names appear in the keys offorecast_result.grid_search.cv_results_
returned by this function.If a list of strings, each of the listed metrics is computed. Valid strings are
EvaluationMetricEnum
member names andFRACTION_OUTSIDE_TOLERANCE
.For example:
["MeanSquaredError", "MeanAbsoluteError", "MeanAbsolutePercentError", "MedianAbsolutePercentError", "FractionOutsideTolerance2"]
If None, no additional metrics are computed.
null_model_params (dict or None, default None) –
Defines baseline model to compute
R2_null_model_score
evaluation metric.R2_null_model_score
is the improvement in the loss function relative to a null model. It can be used to evaluate model quality with respect to a simple baseline. For details, seer2_null_model_score
.The null model is a
DummyRegressor
, which returns constant predictions.Valid keys are “strategy”, “constant”, “quantile”. See
DummyRegressor
. For example:null_model_params = { "strategy": "mean", } null_model_params = { "strategy": "median", } null_model_params = { "strategy": "quantile", "quantile": 0.8, } null_model_params = { "strategy": "constant", "constant": 2.0, }
If None,
R2_null_model_score
is not calculated.Note: CV model selection always optimizes
score_func`, not the ``R2_null_model_score
.relative_error_tolerance (float or None, default None) – Threshold to compute the
Outside Tolerance
metric, defined as the fraction of forecasted values whose relative error is strictly greater thanrelative_error_tolerance
. For example, 0.05 allows for 5% relative error. If None, the metric is not computed.hyperparameter_grid (dict, list [dict] or None, default None) –
Sets properties of the steps in the pipeline, and specifies combinations to search over. Should be valid input to
sklearn.model_selection.GridSearchCV
(param_grid) orsklearn.model_selection.RandomizedSearchCV
(param_distributions).Prefix transform/estimator attributes by the name of the step in the pipeline. See details at: https://scikit-learn.org/stable/modules/compose.html#nested-parameters
If None, uses the default pipeline parameters.
hyperparameter_budget (int or None, default None) –
Max number of hyperparameter sets to try within the
hyperparameter_grid
search spaceRuns a full grid search if
hyperparameter_budget
is sufficient to exhaust fullhyperparameter_grid
, otherwise samples uniformly at random from the space.If None, uses defaults:
full grid search if all values are constant
10 if any value is a distribution to sample from
n_jobs (int or None, default
COMPUTATION_N_JOBS
) – Number of jobs to run in parallel (the maximum number of concurrently running workers).-1
uses all CPUs.-2
uses all CPUs but one.None
is treated as 1 unless in ajoblib.Parallel
backend context that specifies otherwise.verbose (int, default 1) – Verbosity level during CV. if > 0, prints number of fits if > 1, prints fit parameters, total score + fit time if > 2, prints train/test scores
cv_horizon (int or None, default None) – Number of periods in each CV test set If None, default is
forecast_horizon
. Set eithercv_horizon
orcv_max_splits
to 0 to skip CV.cv_min_train_periods (int or None, default None) – Minimum number of periods for training each CV fold. If cv_expanding_window is False, every training period is this size If None, default is 2 *
cv_horizon
cv_expanding_window (bool, default False) – If True, training window for each CV split is fixed to the first available date. Otherwise, train start date is sliding, determined by
cv_min_train_periods
.cv_use_most_recent_splits (bool, default False) – If True, splits from the end of the dataset are used. Else a sampling strategy is applied. Check
_sample_splits
for details.cv_periods_between_splits (int or None, default None) – Number of periods to slide the test window between CV splits If None, default is
cv_horizon
cv_periods_between_train_test (int or None, default None) – Number of periods for the gap between train and test in a CV split. If None, default is
periods_between_train_test
.cv_max_splits (int or None, default 3) – Maximum number of CV splits. Given the above configuration, samples up to max_splits train/test splits, preferring splits toward the end of available data. If None, uses all splits. Set either
cv_horizon
orcv_max_splits
to 0 to skip CV.
- Returns
forecast_result – Forecast result. See
ForecastResult
for details.If
cv_horizon=0
,forecast_result.grid_search.best_estimator_
andforecast_result.grid_search.best_params_
attributes are defined according to the provided single set of parameters. There must be a single set of parameters to skip cross-validation.If
test_horizon=0
,forecast_result.backtest
is None.
- Return type
- class greykite.framework.pipeline.pipeline.ForecastResult(timeseries: Optional[UnivariateTimeSeries] = None, grid_search: Optional[RandomizedSearchCV] = None, model: Optional[Pipeline] = None, backtest: Optional[UnivariateForecast] = None, forecast: Optional[UnivariateForecast] = None)[source]
Forecast results. Contains results from cross-validation, backtest, and forecast, the trained model, and the original input data.
- timeseries: UnivariateTimeSeries = None
Input time series in standard format with stats and convenient plot functions.
- grid_search: RandomizedSearchCV = None
Result of cross-validation grid search on training dataset. The relevant attributes are:
cv_results_
cross-validation scoresbest_estimator_
the model used for backtestingbest_params_
the optimal parameters used for backtesting.
Also see
summarize_grid_search_results
. We recommend using this function to extract results, rather than accessingcv_results_
directly.
- model: Pipeline = None
Model fitted on full dataset, using the best parameters selected via cross-validation. Has
.fit()
,.predict()
, and diagnostic functions depending on the model.
- backtest: UnivariateForecast = None
Forecast on backtest period. Backtest period is a holdout test set to check forecast quality against the most recent actual values available. The best model from cross validation is refit on data prior to this period. The timestamps in
backtest.df
are sorted in ascending order. Has a.plot()
method and attributes to get forecast vs actuals, evaluation results.
- forecast: UnivariateForecast = None
Forecast on future period. Future dates are after the train end date, following the holdout test set. The best model from cross validation is refit on data prior to this period. The timestamps in
forecast.df
are sorted in ascending order. Has a.plot()
method and attributes to get forecast vs actuals, evaluation results.
Template Output
- class greykite.framework.input.univariate_time_series.UnivariateTimeSeries[source]
Defines univariate time series input. The dataset can include regressors, but only one metric is designated as the target metric to forecast.
Loads time series into a standard format. Provides statistics, plotting functions, and ability to generate future dataframe for prediction.
- df
Data frame containing timestamp and value, with standardized column names for internal use (TIME_COL, VALUE_COL). Rows are sorted by time index, and missing gaps between dates are filled in so that dates are spaced at regular intervals. Values are adjusted for anomalies according to
anomaly_info
. The index can be timezone aware (but TIME_COL is not).- Type
- y
Value of time series to forecast.
- Type
pandas.Series
, dtype float64
- time_stats
Summary statistics about the timestamp column.
- Type
dict
- value_stats
Summary statistics about the value column.
- Type
dict
- original_time_col
Name of time column in original input data.
- Type
str
- original_value_col
Name of value column in original input data.
- Type
str
- regressor_cols
A list of regressor columns in the training and prediction DataFrames.
- Type
list [str]
- lagged_regressor_cols
A list of additional columns needed for lagged regressors in the training and prediction DataFrames.
- Type
list [str]
- last_date_for_val
Date or timestamp corresponding to last non-null value in
df[original_value_col]
.- Type
datetime.datetime
or None, default None
- last_date_for_reg
Date or timestamp corresponding to last non-null value in
df[regressor_cols]
. Ifregressor_cols
is None,last_date_for_reg
is None.- Type
datetime.datetime
or None, default None
- last_date_for_lag_reg
Date or timestamp corresponding to last non-null value in
df[lagged_regressor_cols]
. Iflagged_regressor_cols
is None,last_date_for_lag_reg
is None.- Type
datetime.datetime
or None, default None
- train_end_date
Last date or timestamp in
fit_df
. It is always less than or equal to minimum non-null values oflast_date_for_val
andlast_date_for_reg
.- Type
- fit_cols
A list of columns used in the training and prediction DataFrames.
- Type
list [str]
- fit_df
Data frame containing timestamp and value, with standardized column names for internal use. Will be used for fitting (train, cv, backtest).
- Type
- fit_y
Value of time series for fit_df.
- Type
pandas.Series
, dtype float64
- freq
timeseries frequency, DateOffset alias, e.g. {‘T’ (minute), ‘H’, D’, ‘W’, ‘M’ (month end), ‘MS’ (month start), ‘Y’ (year end), ‘Y’ (year start)} See https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases
- Type
str
- anomaly_info
Anomaly adjustment info. Anomalies in
df
are corrected before any forecasting is done. Seeself.load_data()
- Type
dict or list [dict] or None, default None
- df_before_adjustment
self.df
before adjustment byanomaly_info
. Used byself.plot()
to show the adjustment.- Type
pandas.DataFrame
or None, default None
- load_data(df: DataFrame, time_col: str = 'ts', value_col: str = 'y', freq: Optional[str] = None, date_format: Optional[str] = None, tz: Optional[str] = None, train_end_date: Optional[Union[str, datetime]] = None, regressor_cols: Optional[List[str]] = None, lagged_regressor_cols: Optional[List[str]] = None, anomaly_info: Optional[Union[Dict, List[Dict]]] = None)[source]
Loads data to internal representation. Parses date column, sets timezone aware index. Checks for irregularities and raises an error if input is invalid. Adjusts for anomalies according to
anomaly_info
.- Parameters
df (
pandas.DataFrame
) – Input timeseries. A data frame which includes the timestamp column as well as the value column.time_col (str) – The column name in
df
representing time for the time series data. The time column can be anything that can be parsed by pandas DatetimeIndex.value_col (str) – The column name which has the value of interest to be forecasted.
freq (str or None, default None) – Timeseries frequency, DateOffset alias, If None automatically inferred.
date_format (str or None, default None) – strftime format to parse time column, eg
%m/%d/%Y
. Note that%f
will parse all the way up to nanoseconds. If None (recommended), inferred bypandas.to_datetime
.tz (str or pytz.timezone object or None, default None) – Passed to pandas.tz_localize to localize the timestamp.
train_end_date (str or
datetime.datetime
or None, default None) – Last date to use for fitting the model. Forecasts are generated after this date. If None, it is set to the minimum ofself.last_date_for_val
andself.last_date_for_reg
.regressor_cols (list [str] or None, default None) – A list of regressor columns used in the training and prediction DataFrames. If None, no regressor columns are used. Regressor columns that are unavailable in
df
are dropped.lagged_regressor_cols (list [str] or None, default None) – A list of additional columns needed for lagged regressors in the training and prediction DataFrames. This list can have overlap with
regressor_cols
. If None, no additional columns are added to the DataFrame. Lagged regressor columns that are unavailable indf
are dropped.anomaly_info (dict or list [dict] or None, default None) –
Anomaly adjustment info. Anomalies in
df
are corrected before any forecasting is done.If None, no adjustments are made.
A dictionary containing the parameters to
adjust_anomalous_data
. See that function for details. The possible keys are:"value_col"
strThe name of the column in
df
to adjust. You may adjust the value to forecast as well as any numeric regressors."anomaly_df"
pandas.DataFrame
Adjustments to correct the anomalies.
"start_time_col"
: str, default START_TIME_COLStart date column in
anomaly_df
."end_time_col"
: str, default END_TIME_COLEnd date column in
anomaly_df
."adjustment_delta_col"
: str or None, default NoneImpact column in
anomaly_df
."filter_by_dict"
: dict or None, default NoneUsed to filter
anomaly_df
to the relevant anomalies for thevalue_col
in this dictionary. Key specifies the column name, value specifies the filter value."filter_by_value_col""
: str or None, default NoneAdds
{filter_by_value_col: value_col}
tofilter_by_dict
if not None, for thevalue_col
in this dictionary."adjustment_method"
str (“add” or “subtract”), default “add”How to make the adjustment, if
adjustment_delta_col
is provided.
Accepts a list of such dictionaries to adjust multiple columns in
df
.
- Returns
self – Sets
self.df
with standard column names, value adjusted for anomalies, and time gaps filled in, sorted by time index.- Return type
Returns self.
- describe_time_col()[source]
Basic descriptive stats on the timeseries time column.
- Returns
time_stats –
Dictionary with descriptive stats on the timeseries time column.
- data_points: int
number of time points
- mean_increment_secs: float
mean frequency
- min_timestamp: datetime64
start date
- max_timestamp: datetime64
end date
- Return type
dict
- describe_value_col()[source]
Basic descriptive stats on the timeseries value column.
- Returns
value_stats – Dict with keys: count, mean, std, min, 25%, 50%, 75%, max
- Return type
dict [str, float]
- make_future_dataframe(periods: Optional[int] = None, include_history=True)[source]
Extends the input data for prediction into the future.
Includes the historical values (VALUE_COL) so this can be fed into a Pipeline that transforms input data for fitting, and for use in evaluation.
- Parameters
- Returns
future_df – Dataframe with future timestamps for prediction. Contains columns for:
prediction dates (
TIME_COL
),values (
VALUE_COL
),optional regressors
- Return type
- plot(color='rgb(32, 149, 212)', show_anomaly_adjustment=False, **kwargs)[source]
Returns interactive plotly graph of the value against time.
If anomaly info is provided, there is an option to show the anomaly adjustment.
- Parameters
color (str, default “rgb(32, 149, 212)” (light blue)) – Color of the value line (after adjustment, if applicable).
show_anomaly_adjustment (bool, default False) – Whether to show the anomaly adjustment.
kwargs (additional parameters) – Additional parameters to pass to
plot_univariate
such as title and color.
- Returns
fig – Interactive plotly graph of the value against time.
See
plot_forecast_vs_actual
return value for how to plot the figure and add customization.- Return type
- get_grouping_evaluation(aggregation_func=<function nanmean>, aggregation_func_name='mean', groupby_time_feature=None, groupby_sliding_window_size=None, groupby_custom_column=None)[source]
Group-wise computation of aggregated timeSeries value. Can be used to evaluate error/ aggregated value by a time feature, over time, or by a user-provided column.
Exactly one of:
groupby_time_feature
,groupby_sliding_window_size
,groupby_custom_column
must be provided.- Parameters
aggregation_func (callable, optional, default
numpy.nanmean
) – Function that aggregates an array to a number. Signature (y: array) -> aggregated value: float.aggregation_func_name (str or None, optional, default “mean”) – Name of grouping function, used to report results. If None, defaults to “aggregation”.
groupby_time_feature (str or None, optional) – If provided, groups by a column generated by
build_time_features_df
. See that function for valid values.groupby_sliding_window_size (int or None, optional) – If provided, sequentially partitions data into groups of size
groupby_sliding_window_size
.groupby_custom_column (
pandas.Series
or None, optional) – If provided, groups by this column value. Should be same length as the DataFrame.
- Returns
grouped_df –
grouping_func_name: evaluation metric for aggregation of timeseries.
group name: group name depends on the grouping method:
groupby_time_feature
forgroupby_time_feature
cst.TIME_COL
forgroupby_sliding_window_size
groupby_custom_column.name
forgroupby_custom_column
.
- Return type
pandas.DataFrame
with two columns:
- plot_grouping_evaluation(aggregation_func=<function nanmean>, aggregation_func_name='mean', groupby_time_feature=None, groupby_sliding_window_size=None, groupby_custom_column=None, xlabel=None, ylabel=None, title=None)[source]
Computes aggregated timeseries by group and plots the result. Can be used to plot aggregated timeseries by a time feature, over time, or by a user-provided column.
Exactly one of:
groupby_time_feature
,groupby_sliding_window_size
,groupby_custom_column
must be provided.- Parameters
aggregation_func (callable, optional, default
numpy.nanmean
) – Function that aggregates an array to a number. Signature (y: array) -> aggregated value: float.aggregation_func_name (str or None, optional, default “mean”) – Name of grouping function, used to report results. If None, defaults to “aggregation”.
groupby_time_feature (str or None, optional) – If provided, groups by a column generated by
build_time_features_df
. See that function for valid values.groupby_sliding_window_size (int or None, optional) – If provided, sequentially partitions data into groups of size
groupby_sliding_window_size
.groupby_custom_column (
pandas.Series
or None, optional) – If provided, groups by this column value. Should be same length as the DataFrame.xlabel (str, optional, default None) – X-axis label of the plot.
ylabel (str, optional, default None) – Y-axis label of the plot.
title (str or None, optional) – Plot title. If None, default is based on axis labels.
- Returns
fig – plotly graph object showing aggregated timeseries by group. x-axis label depends on the grouping method:
groupby_time_feature
forgroupby_time_feature
TIME_COL
forgroupby_sliding_window_size
groupby_custom_column.name
forgroupby_custom_column
.- Return type
- get_quantiles_and_overlays(groupby_time_feature=None, groupby_sliding_window_size=None, groupby_custom_column=None, show_mean=False, show_quantiles=False, show_overlays=False, overlay_label_time_feature=None, overlay_label_sliding_window_size=None, overlay_label_custom_column=None, center_values=False, value_col='y', mean_col_name='mean', quantile_col_prefix='Q', **overlay_pivot_table_kwargs)[source]
Computes mean, quantiles, and overlays by the requested grouping dimension.
Overlays are best explained in the plotting context. The grouping dimension goes on the x-axis, and one line is shown for each level of the overlay dimension. This function returns a column for each line to plot (e.g. mean, each quantile, each overlay value).
Exactly one of:
groupby_time_feature
,groupby_sliding_window_size
,groupby_custom_column
must be provided as the grouping dimension.If
show_overlays
is True, exactly one of:overlay_label_time_feature
,overlay_label_sliding_window_size
,overlay_label_custom_column
can be provided to specify thelabel_col
(overlay dimension). Internally, the function callspandas.DataFrame.pivot_table
withindex=groupby_col
,columns=label_col
,values=value_col
to get the overlay values for plotting. You can pass additional parameters topandas.DataFrame.pivot_table
viaoverlay_pivot_table_kwargs
, e.g. to change the aggregation method. If an explicit label is not provided, the records are labeled by their position within the group.For example, to show yearly seasonality mean, quantiles, and overlay plots for each individual year, use:
self.get_quantiles_and_overlays( groupby_time_feature="doy", # Rows: a row for each day of year (1, 2, ..., 366) show_mean=True, # mean value on that day show_quantiles=[0.1, 0.9], # quantiles of the observed distribution on that day show_overlays=True, # Include overlays defined by ``overlay_label_time_feature`` overlay_label_time_feature="year") # One column for each observed "year" (2016, 2017, 2018, ...)
To show weekly seasonality over time, use:
self.get_quantiles_and_overlays( groupby_time_feature="dow", # Rows: a row for each day of week (1, 2, ..., 7) show_mean=True, # mean value on that day show_quantiles=[0.1, 0.5, 0.9], # quantiles of the observed distribution on that day show_overlays=True, # Include overlays defined by ``overlay_label_time_feature`` overlay_label_sliding_window_size=90, # One column for each 90 period sliding window in the dataset, aggfunc="median") # overlay value is the median value for the dow over the period (default="mean").
It may be difficult to assess the weekly seasonality from the previous result, because overlays shift up/down over time due to trend/yearly seasonality. Use
center_values=True
to adjust each overlay so its average value is centered at 0. Mean and quantiles are shifted by a single constant to center the mean at 0, while preserving their relative values:self.get_quantiles_and_overlays( groupby_time_feature="dow", show_mean=True, show_quantiles=[0.1, 0.5, 0.9], show_overlays=True, overlay_label_sliding_window_size=90, aggfunc="median", center_values=True) # Centers the output
Centering reduces the variability in the overlays to make it easier to isolate the effect by the groupby column. As a result, centered overlays have smaller variability than that reported by the quantiles, which operate on the original, uncentered data points. Similarly, if overlays are aggregates of individual values (i.e.
aggfunc
is needed in the call topandas.DataFrame.pivot_table
), the quantiles of overlays will be less extreme than those of the original data.To assess variability conditioned on the groupby value, check the quantiles.
To assess variability conditioned on both the groupby and overlay value, after any necessary aggregation, check the variability of the overlay values. Compute quantiles of overlays from the return value if desired.
- Parameters
groupby_time_feature (str or None, default None) – If provided, groups by a column generated by
build_time_features_df
. See that function for valid values.groupby_sliding_window_size (int or None, default None) – If provided, sequentially partitions data into groups of size
groupby_sliding_window_size
.groupby_custom_column (
pandas.Series
or None, default None) – If provided, groups by this column value. Should be same length as the DataFrame.show_mean (bool, default False) – Whether to return the mean value by the groupby column.
show_quantiles (bool or list [float] or
numpy.array
, default False) – Whether to return the quantiles of the value by the groupby column. If False, does not return quantiles. If True, returns default quantiles (0.1 and 0.9). If array-like, a list of quantiles to compute (e.g. (0.1, 0.25, 0.75, 0.9)).show_overlays (bool or int or array-like [int or str] or None, default False) –
Whether to return overlays of the value by the groupby column.
If False, no overlays are shown.
If True and
label_col
is defined, callspandas.DataFrame.pivot_table
withindex=groupby_col
,columns=label_col
,values=value_col
.label_col
is defined by one ofoverlay_label_time_feature
,overlay_label_sliding_window_size
, oroverlay_label_custom_column
. Returns one column for each value of thelabel_col
.If True and the
label_col
is not defined, returns the raw values within each group. Values across groups are put into columns by their position in the group (1st element in group, 2nd, 3rd, etc.). Positional order in a group is not guaranteed to correspond to anything meaningful, so the items within a column may not have anything in common. It is better to specify one ofoverlay_*
to explicitly define the overlay labels.If an integer, the number of overlays to randomly sample. The same as True, then randomly samples up to int columns. This is useful if there are too many values.
If a list [int], a list of column indices (int type). The same as True, then selects the specified columns by index.
If a list [str], a list of column names. Column names are matched by their string representation to the names in this list. The same as True, then selects the specified columns by name.
overlay_label_time_feature (str or None, default None) –
If
show_overlays
is True, can be used to definelabel_col
, i.e. which dimension to show separately as overlays.If provided, uses a column generated by
build_time_features_df
. See that function for valid values.overlay_label_sliding_window_size (int or None, default None) –
If
show_overlays
is True, can be used to definelabel_col
, i.e. which dimension to show separately as overlays.If provided, uses a column that sequentially partitions data into groups of size
groupby_sliding_window_size
.overlay_label_custom_column (
pandas.Series
or None, default None) –If
show_overlays
is True, can be used to definelabel_col
, i.e. which dimension to show separately as overlays.If provided, uses this column value. Should be same length as the DataFrame.
value_col (str, default VALUE_COL) – The column name for the value column. By default, shows the univariate time series value, but it can be any other column in
self.df
.mean_col_name (str, default “mean”) – The name to use for the mean column in the output. Applies if
show_mean=True
.quantile_col_prefix (str, default “Q”) – The prefix to use for quantile column names in the output. Columns are named with this prefix followed by the quantile, rounded to 2 decimal places.
center_values (bool, default False) –
Whether to center the return values. If True, shifts each overlay so its average value is centered at 0. Shifts mean and quantiles by a constant to center the mean at 0, while preserving their relative values.
If False, values are not centered.
overlay_pivot_table_kwargs (additional parameters) – Additional keyword parameters to pass to
pandas.DataFrame.pivot_table
, used in generating the overlays. See above description for details.
- Returns
grouped_df – Dataframe with mean, quantiles, and overlays by the grouping column. Overlays are defined by the grouping column and overlay dimension.
ColumnIndex is a multiindex with first level as the “category”, a subset of [MEAN_COL_GROUP, QUANTILE_COL_GROUP, OVERLAY_COL_GROUP] depending on what is requests.
grouped_df[MEAN_COL_GROUP] = df with single column, named
mean_col_name
.grouped_df[QUANTILE_COL_GROUP] = df with a column for each quantile, named f”{quantile_col_prefix}{round(str(q))}”, where
q
is the quantile.grouped_df[OVERLAY_COL_GROUP] = df with one column per overlay value, named by the overlay value.
For example, it might look like:
category mean quantile overlay name mean Q0.1 Q0.9 2007 2008 2009 doy 1 8.42 7.72 9.08 8.29 7.75 8.33 2 8.82 8.20 9.56 8.43 8.80 8.53 3 8.95 8.25 9.88 8.26 9.12 8.70 4 9.07 8.60 9.49 8.10 9.99 8.73 5 8.73 8.29 9.24 7.95 9.26 8.37 ... ... ... ... ... ... ...
- Return type
- plot_quantiles_and_overlays(groupby_time_feature=None, groupby_sliding_window_size=None, groupby_custom_column=None, show_mean=False, show_quantiles=False, show_overlays=False, overlay_label_time_feature=None, overlay_label_sliding_window_size=None, overlay_label_custom_column=None, center_values=False, value_col='y', mean_col_name='mean', quantile_col_prefix='Q', mean_style=None, quantile_style=None, overlay_style=None, xlabel=None, ylabel=None, title=None, showlegend=True, **overlay_pivot_table_kwargs)[source]
Plots mean, quantiles, and overlays by the requested grouping dimension.
The grouping dimension goes on the x-axis, and one line is shown for the mean, each quantile, and each level of the overlay dimension, as requested. By default, shading is applied between the quantiles.
Exactly one of:
groupby_time_feature
,groupby_sliding_window_size
,groupby_custom_column
must be provided as the grouping dimension.If
show_overlays
is True, exactly one of:overlay_label_time_feature
,overlay_label_sliding_window_size
,overlay_label_custom_column
can be provided to specify thelabel_col
(overlay dimension). Internally, the function callspandas.DataFrame.pivot_table
withindex=groupby_col
,columns=label_col
,values=value_col
to get the overlay values for plotting. You can pass additional parameters topandas.DataFrame.pivot_table
viaoverlay_pivot_table_kwargs
, e.g. to change the aggregation method. If an explicit label is not provided, the records are labeled by their position within the group.For example, to show yearly seasonality mean, quantiles, and overlay plots for each individual year, use:
self.plot_quantiles_and_overlays( groupby_time_feature="doy", # Rows: a row for each day of year (1, 2, ..., 366) show_mean=True, # mean value on that day show_quantiles=[0.1, 0.9], # quantiles of the observed distribution on that day show_overlays=True, # Include overlays defined by ``overlay_label_time_feature`` overlay_label_time_feature="year") # One column for each observed "year" (2016, 2017, 2018, ...)
To show weekly seasonality over time, use:
self.plot_quantiles_and_overlays( groupby_time_feature="dow", # Rows: a row for each day of week (1, 2, ..., 7) show_mean=True, # mean value on that day show_quantiles=[0.1, 0.5, 0.9], # quantiles of the observed distribution on that day show_overlays=True, # Include overlays defined by ``overlay_label_time_feature`` overlay_label_sliding_window_size=90, # One column for each 90 period sliding window in the dataset, aggfunc="median") # overlay value is the median value for the dow over the period (default="mean").
It may be difficult to assess the weekly seasonality from the previous result, because overlays shift up/down over time due to trend/yearly seasonality. Use
center_values=True
to adjust each overlay so its average value is centered at 0. Mean and quantiles are shifted by a single constant to center the mean at 0, while preserving their relative values:self.plot_quantiles_and_overlays( groupby_time_feature="dow", show_mean=True, show_quantiles=[0.1, 0.5, 0.9], show_overlays=True, overlay_label_sliding_window_size=90, aggfunc="median", center_values=True) # Centers the output
Centering reduces the variability in the overlays to make it easier to isolate the effect by the groupby column. As a result, centered overlays have smaller variability than that reported by the quantiles, which operate on the original, uncentered data points. Similarly, if overlays are aggregates of individual values (i.e.
aggfunc
is needed in the call topandas.DataFrame.pivot_table
), the quantiles of overlays will be less extreme than those of the original data.To assess variability conditioned on the groupby value, check the quantiles.
To assess variability conditioned on both the groupby and overlay value, after any necessary aggregation, check the variability of the overlay values. Compute quantiles of overlays from the return value if desired.
- Parameters
groupby_time_feature (str or None, default None) – If provided, groups by a column generated by
build_time_features_df
. See that function for valid values.groupby_sliding_window_size (int or None, default None) – If provided, sequentially partitions data into groups of size
groupby_sliding_window_size
.groupby_custom_column (
pandas.Series
or None, default None) – If provided, groups by this column value. Should be same length as the DataFrame.show_mean (bool, default False) – Whether to return the mean value by the groupby column.
show_quantiles (bool or list [float] or
numpy.array
, default False) – Whether to return the quantiles of the value by the groupby column. If False, does not return quantiles. If True, returns default quantiles (0.1 and 0.9). If array-like, a list of quantiles to compute (e.g. (0.1, 0.25, 0.75, 0.9)).show_overlays (bool or int or array-like [int or str], default False) –
Whether to return overlays of the value by the groupby column.
If False, no overlays are shown.
If True and
label_col
is defined, callspandas.DataFrame.pivot_table
withindex=groupby_col
,columns=label_col
,values=value_col
.label_col
is defined by one ofoverlay_label_time_feature
,overlay_label_sliding_window_size
, oroverlay_label_custom_column
. Returns one column for each value of thelabel_col
.If True and the
label_col
is not defined, returns the raw values within each group. Values across groups are put into columns by their position in the group (1st element in group, 2nd, 3rd, etc.). Positional order in a group is not guaranteed to correspond to anything meaningful, so the items within a column may not have anything in common. It is better to specify one ofoverlay_*
to explicitly define the overlay labels.If an integer, the number of overlays to randomly sample. The same as True, then randomly samples up to int columns. This is useful if there are too many values.
If a list [int], a list of column indices (int type). The same as True, then selects the specified columns by index.
If a list [str], a list of column names. Column names are matched by their string representation to the names in this list. The same as True, then selects the specified columns by name.
overlay_label_time_feature (str or None, default None) –
If
show_overlays
is True, can be used to definelabel_col
, i.e. which dimension to show separately as overlays.If provided, uses a column generated by
build_time_features_df
. See that function for valid values.overlay_label_sliding_window_size (int or None, default None) –
If
show_overlays
is True, can be used to definelabel_col
, i.e. which dimension to show separately as overlays.If provided, uses a column that sequentially partitions data into groups of size
groupby_sliding_window_size
.overlay_label_custom_column (
pandas.Series
or None, default None) –If
show_overlays
is True, can be used to definelabel_col
, i.e. which dimension to show separately as overlays.If provided, uses this column value. Should be same length as the DataFrame.
value_col (str, default VALUE_COL) – The column name for the value column. By default, shows the univariate time series value, but it can be any other column in
self.df
.mean_col_name (str, default “mean”) – The name to use for the mean column in the output. Applies if
show_mean=True
.quantile_col_prefix (str, default “Q”) – The prefix to use for quantile column names in the output. Columns are named with this prefix followed by the quantile, rounded to 2 decimal places.
center_values (bool, default False) –
Whether to center the return values. If True, shifts each overlay so its average value is centered at 0. Shifts mean and quantiles by a constant to center the mean at 0, while preserving their relative values.
If False, values are not centered.
mean_style (dict or None, default None) –
How to style the mean line, passed as keyword arguments to
plotly.graph_objects.Scatter
. If None, the default is:mean_style = { "line": dict( width=2, color="#595959"), # gray "legendgroup": MEAN_COL_GROUP}
quantile_style (dict or None, default None) –
How to style the quantile lines, passed as keyword arguments to
plotly.graph_objects.Scatter
. If None, the default is:quantile_style = { "line": dict( width=2, color="#1F9AFF", # blue dash="solid"), "legendgroup": QUANTILE_COL_GROUP, # show/hide them together "fill": "tonexty"}
Note that fill style is removed from to the first quantile line, to fill only between items in the same category.
overlay_style (dict or None, default None) –
How to style the overlay lines, passed as keyword arguments to
plotly.graph_objects.Scatter
. If None, the default is:overlay_style = { "opacity": 0.5, # makes it easier to see density "line": dict( width=1, color="#B3B3B3", # light gray dash="solid"), "legendgroup": OVERLAY_COL_GROUP}
xlabel (str, optional, default None) – X-axis label of the plot.
ylabel (str, optional, default None) – Y-axis label of the plot. If None, uses
value_col
.title (str or None, default None) – Plot title. If None, default is based on axis labels.
showlegend (bool, default True) – Whether to show the legend.
overlay_pivot_table_kwargs (additional parameters) – Additional keyword parameters to pass to
pandas.DataFrame.pivot_table
, used in generating the overlays. Seeget_quantiles_and_overlays
description for details.
- Returns
fig – plotly graph object showing the mean, quantiles, and overlays.
- Return type
See also
None
To get the mean, quantiles, and overlays as a
pandas.DataFrame
without plotting.
- class greykite.framework.output.univariate_forecast.UnivariateForecast(df, time_col='ts', actual_col='actual', predicted_col='forecast', predicted_lower_col='forecast_lower', predicted_upper_col='forecast_upper', null_model_predicted_col='forecast_null', ylabel='y', train_end_date=None, test_start_date=None, forecast_horizon=None, coverage=0.95, r2_loss_function=<function mean_squared_error>, estimator=None, relative_error_tolerance=None)[source]
Stores predicted and actual values. Provides functionality to evaluate a forecast:
plots true against actual with prediction bands.
evaluates model performance.
Input should be one of two kinds of forecast results:
model fit to train data, forecast on test set (actuals available).
model fit to all data, forecast on future dates (actuals not available).
The input
df
is a concatenation of fitted and forecasted values.- df
Timestamp, predicted, and actual values.
- Type
- time_col
Column in
df
with timestamp (default “ts”).- Type
str
- actual_col
Column in
df
with actual values (default “y”).- Type
str
- predicted_col
Column in
df
with predicted values (default “forecast”).- Type
str
- predicted_lower_col
Column in
df
with predicted lower bound (default “forecast_lower”, optional).- Type
str or None
- predicted_upper_col
Column in
df
with predicted upper bound (default “forecast_upper”, optional).- Type
str or None
- null_model_predicted_col
Column in
df
with predicted value of null model (default “forecast_null”, optional).- Type
str or None
- ylabel
Unit of measurement (default “y”)
- Type
str
- train_end_date
End date for train period. If None, assumes all data were used for training.
- Type
str or
datetime
or None, default None
- test_start_date
Start date of test period. If None, set to the
time_col
value immediately aftertrain_end_date
. This assumes that all data not used in training were used for testing.- Type
str or
datetime
or None, default None
- forecast_horizon
Number of periods forecasted into the future. Must be > 0.
- Type
int or None, default None
- coverage
Intended coverage of the prediction bands (0.0 to 1.0).
- Type
float or None
- r2_loss_function
Loss function to calculate
cst.R2_null_model_score
, with signatureloss_func(y_true, y_pred)
(default mean_squared_error)- Type
function
- estimator
The fitted estimator, the last step in the forecast pipeline.
- Type
An instance of an estimator that implements greykite.models.base_forecast_estimator.BaseForecastEstimator.
- relative_error_tolerance
Threshold to compute the
Outside Tolerance
metric, defined as the fraction of forecasted values whose relative error is strictly greater thanrelative_error_tolerance
. For example, 0.05 allows for 5% relative error. If None, the metric is not computed.- Type
float or None, default None
- df_train
Subset of
df
wheredf[time_col]
<=train_end_date
.- Type
- df_test
Subset of
df
wheredf[time_col]
>train_end_date
.- Type
- train_evaluation
Evaluation metrics on training set.
- Type
dict [str, float]
- test_evaluation
Evaluation metrics on test set (if actual values provided after train_end_date).
- Type
dict [str, float]
- test_na_count
Count of NA values in test data.
- Type
int
- compute_evaluation_metrics_split()[source]
Computes __evaluation_metrics for train and test set separately.
- Returns
dictionary with train and test evaluation metrics
- plot(**kwargs)[source]
Plots predicted against actual.
- Parameters
kwargs (additional parameters) – Additional parameters to pass to
plot_forecast_vs_actual
such as title, colors, and line styling.- Returns
fig – Plotly figure of forecast against actuals, with prediction intervals if available.
See
plot_forecast_vs_actual
return value for how to plot the figure and add customization.- Return type
- get_grouping_evaluation(score_func=<function add_finite_filter_to_scorer.<locals>.score_func_finite>, score_func_name='MAPE', which='train', groupby_time_feature=None, groupby_sliding_window_size=None, groupby_custom_column=None)[source]
Group-wise computation of forecasting error. Can be used to evaluate error/ aggregated value by a time feature, over time, or by a user-provided column.
Exactly one of:
groupby_time_feature
,groupby_sliding_window_size
,groupby_custom_column
must be provided.- Parameters
score_func (callable, optional) – Function that maps two arrays to a number. Signature (y_true: array, y_pred: array) -> error: float
score_func_name (str or None, optional) – Name of the score function used to report results. If None, defaults to “metric”.
which (str) – “train” or “test”. Which dataset to evaluate.
groupby_time_feature (str or None, optional) – If provided, groups by a column generated by
build_time_features_df
. See that function for valid values.groupby_sliding_window_size (int or None, optional) – If provided, sequentially partitions data into groups of size
groupby_sliding_window_size
.groupby_custom_column (
pandas.Series
or None, optional) – If provided, groups by this column value. Should be same length as the DataFrame.
- Returns
grouped_df –
grouping_func_name: evaluation metric computing forecasting error of timeseries.
group name: group name depends on the grouping method:
groupby_time_feature
forgroupby_time_feature
cst.TIME_COL
forgroupby_sliding_window_size
groupby_custom_column.name
forgroupby_custom_column
.
- Return type
pandas.DataFrame
with two columns:
- plot_grouping_evaluation(score_func=<function add_finite_filter_to_scorer.<locals>.score_func_finite>, score_func_name='MAPE', which='train', groupby_time_feature=None, groupby_sliding_window_size=None, groupby_custom_column=None, xlabel=None, ylabel=None, title=None)[source]
Computes error by group and plots the result. Can be used to plot error by a time feature, over time, or by a user-provided column.
Exactly one of:
groupby_time_feature
,groupby_sliding_window_size
,groupby_custom_column
must be provided.- Parameters
score_func (callable, optional) – Function that maps two arrays to a number. Signature (y_true: array, y_pred: array) -> error: float
score_func_name (str or None, optional) – Name of the score function used to report results. If None, defaults to “metric”.
which (str, optional, default “train”) – Which dataset to evaluate, “train” or “test”.
groupby_time_feature (str or None, optional) – If provided, groups by a column generated by
build_time_features_df
. See that function for valid values.groupby_sliding_window_size (int or None, optional) – If provided, sequentially partitions data into groups of size
groupby_sliding_window_size
.groupby_custom_column (
pandas.Series
or None, optional) – If provided, groups by this column value. Should be same length as the DataFrame.xlabel (str, optional, default None) – X-axis label of the plot.
ylabel (str, optional, default None) – Y-axis label of the plot.
title (str or None, optional) – Plot title, if None this function creates a suitable title.
- Returns
fig – plotly graph object showing forecasting error by group. x-axis label depends on the grouping method:
groupby_time_feature
forgroupby_time_feature
time_col
forgroupby_sliding_window_size
groupby_custom_column.name
forgroupby_custom_column
.- Return type
- autocomplete_map_func_dict(map_func_dict)[source]
Sweeps through
map_func_dict
, converting values that areElementwiseEvaluationMetricEnum
member names to their corresponding row-wise evaluation function with appropriate column names for this UnivariateForecast instance.For example:
map_func_dict = { "squared_error": ElementwiseEvaluationMetricEnum.SquaredError.name, "coverage": ElementwiseEvaluationMetricEnum.Coverage.name, "custom_metric": custom_function } is converted to map_func_dict = { "squared_error": lambda row: ElementwiseEvaluationMetricEnum.SquaredError.get_metric_func()( row[self.actual_col], row[self.predicted_col]), "coverage": lambda row: ElementwiseEvaluationMetricEnum.Coverage.get_metric_func()( row[self.actual_col], row[self.predicted_lower_col], row[self.predicted_upper_col]), "custom_metric": custom_function }
- Parameters
map_func_dict (dict or None) – Same as
flexible_grouping_evaluation
, with one exception: values may a ElementwiseEvaluationMetricEnum member name. There are converted a callable forflexible_grouping_evaluation
.- Returns
map_func_dict – Can be passed to
flexible_grouping_evaluation
.- Return type
dict
- get_flexible_grouping_evaluation(which='train', groupby_time_feature=None, groupby_sliding_window_size=None, groupby_custom_column=None, map_func_dict=None, agg_kwargs=None, extend_col_names=False)[source]
Group-wise computation of evaluation metrics. Whereas
self.get_grouping_evaluation
computes one metric, this allows computation of any number of custom metrics.For example:
Mean and quantiles of squared error by group.
Mean and quantiles of residuals by group.
Mean and quantiles of actual and forecast by group.
% of actuals outside prediction intervals by group
any combination of the above metrics by the same group
First adds a groupby column by passing
groupby_
parameters toadd_groupby_column
. Then computes grouped evaluation metrics by passingmap_func_dict
,agg_kwargs
andextend_col_names
toflexible_grouping_evaluation
.Exactly one of:
groupby_time_feature
,groupby_sliding_window_size
,groupby_custom_column
must be provided.- which: str
“train” or “test”. Which dataset to evaluate.
- groupby_time_featurestr or None, optional
If provided, groups by a column generated by
build_time_features_df
. See that function for valid values.- groupby_sliding_window_sizeint or None, optional
If provided, sequentially partitions data into groups of size
groupby_sliding_window_size
.- groupby_custom_column
pandas.Series
or None, optional If provided, groups by this column value. Should be same length as the DataFrame.
- map_func_dictdict [str, callable] or None, default None
Row-wise transformation functions to create new columns. If None, no new columns are added.
key: new column name
- value: row-wise function to apply to
df
to generate the column value. Signature (row:
pandas.DataFrame
) -> transformed value: float.
- value: row-wise function to apply to
For example:
map_func_dict = { "residual": lambda row: row["actual"] - row["forecast"], "squared_error": lambda row: (row["actual"] - row["forecast"])**2 }
Some predefined functions are available in
ElementwiseEvaluationMetricEnum
. For example:map_func_dict = { "residual": lambda row: ElementwiseEvaluationMetricEnum.Residual.get_metric_func()( row["actual"], row["forecast"]), "squared_error": lambda row: ElementwiseEvaluationMetricEnum.SquaredError.get_metric_func()( row["actual"], row["forecast"]), "q90_loss": lambda row: ElementwiseEvaluationMetricEnum.Quantile90.get_metric_func()( row["actual"], row["forecast"]), "abs_percent_error": lambda row: ElementwiseEvaluationMetricEnum.AbsolutePercentError.get_metric_func()( row["actual"], row["forecast"]), "coverage": lambda row: ElementwiseEvaluationMetricEnum.Coverage.get_metric_func()( row["actual"], row["forecast_lower"], row["forecast_upper"]), }
As shorthand, it is sufficient to provide the enum member name. These are auto-expanded into the appropriate function. So the following is equivalent:
map_func_dict = { "residual": ElementwiseEvaluationMetricEnum.Residual.name, "squared_error": ElementwiseEvaluationMetricEnum.SquaredError.name, "q90_loss": ElementwiseEvaluationMetricEnum.Quantile90.name, "abs_percent_error": ElementwiseEvaluationMetricEnum.AbsolutePercentError.name, "coverage": ElementwiseEvaluationMetricEnum.Coverage.name, }
- agg_kwargsdict or None, default None
Passed as keyword args to
pandas.core.groupby.DataFrameGroupBy.aggregate
after creating new columns and grouping bygroupby_col
.See
pandas.core.groupby.DataFrameGroupBy.aggregate
orflexible_grouping_evaluation
for details.- extend_col_namesbool or None, default False
How to flatten index after aggregation. In some cases, the column index after aggregation is a multi-index. This parameter controls how to flatten an index with 2 levels to 1 level.
If None, the index is not flattened.
If True, column name is a composite:
{index0}_{index1}
Use this option if index1 is not unique.If False, column name is simply
{index1}
Ignored if the ColumnIndex after aggregation has only one level (e.g. if named aggregation is used in
agg_kwargs
).
- Returns
df_transformed –
df
after transformation and optional aggregation.If
groupby_col
is None, returnsdf
with additional columns as the keys inmap_func_dict
. Otherwise,df
is grouped bygroupby_col
and this becomes the index. Columns are determined byagg_kwargs
andextend_col_names
.- Return type
- plot_flexible_grouping_evaluation(which='train', groupby_time_feature=None, groupby_sliding_window_size=None, groupby_custom_column=None, map_func_dict=None, agg_kwargs=None, extend_col_names=False, y_col_style_dict='auto-fill', default_color='rgba(0, 145, 202, 1.0)', xlabel=None, ylabel=None, title=None, showlegend=True)[source]
Plots group-wise evaluation metrics. Whereas
plot_grouping_evaluation
shows one metric, this can show any number of custom metrics.For example:
Mean and quantiles of squared error by group.
Mean and quantiles of residuals by group.
Mean and quantiles of actual and forecast by group.
% of actuals outside prediction intervals by group
any combination of the above metrics by the same group
See
get_flexible_grouping_evaluation
for details.- which: str
“train” or “test”. Which dataset to evaluate.
- groupby_time_featurestr or None, optional
If provided, groups by a column generated by
build_time_features_df
. See that function for valid values.- groupby_sliding_window_sizeint or None, optional
If provided, sequentially partitions data into groups of size
groupby_sliding_window_size
.- groupby_custom_column
pandas.Series
or None, optional If provided, groups by this column value. Should be same length as the DataFrame.
- map_func_dictdict [str, callable] or None, default None
Grouping evaluation metric specification, along with
agg_kwargs
. Seeget_flexible_grouping_evaluation
.- agg_kwargsdict or None, default None
Grouping evaluation metric specification, along with
map_func_dict
. Seeget_flexible_grouping_evaluation
.- extend_col_namesbool or None, default False
How to name the grouping metrics. See
get_flexible_grouping_evaluation
.- y_col_style_dict: dict [str, dict or None] or “plotly” or “auto” or “auto-fill”, default “auto-fill”
The column(s) to plot on the y-axis, and how to style them. The names should match those generated by
agg_kwargs
andextend_col_names
. The functionget_flexible_grouping_evaluation
can be used to check the column names.For convenience, start with “auto-fill” or “plotly”, then adjust styling as needed.
See
plot_multivariate
for details.- default_color: str, default “rgba(0, 145, 202, 1.0)” (blue)
Default line color when
y_col_style_dict
is one of “auto”, “auto-fill”.- xlabelstr or None, default None
x-axis label. If None, default is
x_col
.- ylabelstr or None, default None
y-axis label. If None, y-axis is not labeled.
- titlestr or None, default None
Plot title. If None and
ylabel
is provided, a default title is used.- showlegendbool, default True
Whether to show the legend.
- Returns
fig – Interactive plotly graph showing the evaluation metrics.
See
plot_forecast_vs_actual
return value for how to plot the figure and add customization.- Return type
- make_univariate_time_series()[source]
Converts prediction into a UnivariateTimeSeries Useful to convert a forecast into the input regressor for a subsequent forecast.
- Returns
UnivariateTimeSeries
- plot_components(**kwargs)[source]
Class method to plot the components of a
UnivariateForecast
object.Silverkite
calculates component plots based onfit
dataset.Prophet
calculates component plots based onpredict
dataset.For estimator specific component plots with advanced plotting options call
self.estimator.plot_components()
.- Returns
fig –
matplotlib.figure.Figure
forProphet
Figure plotting components against appropriate time scale.- Return type
plotly.graph_objects.Figure
forSilverkite
- class greykite.algo.common.model_summary.ModelSummary(x, y, pred_cols, pred_category, fit_algorithm, ml_model, max_colwidth=20)[source]
A class to store regression model summary statistics.
The class can be printed to get a well formatted model summary.
- x
The design matrix.
- Type
- beta
The estimated coefficients.
- Type
- y
The response.
- Type
- pred_cols
List of predictor names.
- Type
list [ str ]
- pred_category
Predictor category, returned by
create_pred_category
.- Type
dict
- fit_algorithm
The name of algorithm to fit the regression.
- Type
str
- ml_model
The trained machine learning model class.
- Type
class
- max_colwidth
The maximum length for predictors to be shown in their original name. If the maximum length of predictors exceeds this parameter, all predictors name will be suppressed and only indices are shown.
- Type
int
- info_dict
The model summary dictionary, output of
_get_summary
- Type
dict
- html_str
An html formatting of the string representation of the model summary.
- Type
str
- _get_summary()[source]
Gets the model summary from input. This function is called during initialization.
- Returns
info_dict – Includes direct and derived metrics about the trained model. For detailed keys, refer to
get_info_dict_lm
orget_info_dict_tree
.- Return type
dict
- get_coef_summary(is_intercept=None, is_time_feature=None, is_event=None, is_trend=None, is_seasonality=None, is_lag=None, is_regressor=None, is_interaction=None, return_df=False)[source]
Gets the coefficient summary filtered by conditions.
- Parameters
is_intercept (bool or None, default None) – Intercept or not.
is_time_feature (bool or None, default None) – Time features or not. Time features belong to
TimeFeaturesEnum
.is_event (bool or None, default None) – Event features or not. Event features have
EVENT_PREFIX
.is_trend (bool or None, default None) – Trend features or not. Trend features have
CHANGEPOINT_COL_PREFIX
or “cpd”.is_seasonality (bool or None, default None) – Seasonality feature or not. Seasonality features have
SEASONALITY_REGEX
.is_lag (bool or None, default None) – Lagged features or not. Lagged features have “lag”.
is_regressor (0 or 1) – Extra features provided by users. They are provided through
extra_pred_cols
in the fit function.is_interaction (bool or None, default None) – Interaction feature or not. Interaction features have “:”.
return_df (bool, default False) –
- If True, the filtered coefficient summary df is also returned.
Otherwise, the filtered coefficient summary df is printed only.
- Returns
filtered_coef_summary – If
return_df
is set to True, returns the filtered coefficient summary df filtered by the given conditions.- Return type
pandas.DataFrame
or None
Constants
- class greykite.common.aggregation_function_enum.AggregationFunctionEnum(value)[source]
Defines some common aggregation functions that can be retrieved by their names.
Every function is wrapped with
partial
because Enum handles functions differently from values. Wrapping withpartial
allows us to extract the function with variable keys.
- class greykite.common.evaluation.EvaluationMetricEnum(value)[source]
Valid evaluation metrics. The values tuple is
(score_func: callable, greater_is_better: boolean, short_name: str)
add_finite_filter_to_scorer
is added to the metrics that are directly imported fromsklearn.metrics
(e.g.mean_squared_error
) to ensure that the metric gets calculated even when inputs have missing values.- Correlation = (<function add_finite_filter_to_scorer.<locals>.score_func_finite>, True, 'CORR')
Pearson correlation coefficient between forecast and actuals. Higher is better.
- CoefficientOfDetermination = (<function add_finite_filter_to_scorer.<locals>.score_func_finite>, True, 'R2')
Coefficient of determination. See
sklearn.metrics.r2_score
. Higher is better. Equals 1.0 - mean_squared_error / variance(actuals).
- MeanSquaredError = (<function add_finite_filter_to_scorer.<locals>.score_func_finite>, False, 'MSE')
Mean squared error, the average of squared differences, see
sklearn.metrics.mean_squared_error
.
- RootMeanSquaredError = (<function add_finite_filter_to_scorer.<locals>.score_func_finite>, False, 'RMSE')
Root mean squared error, the square root of
sklearn.metrics.mean_squared_error
- MeanAbsoluteError = (<function add_finite_filter_to_scorer.<locals>.score_func_finite>, False, 'MAE')
Mean absolute error, average of absolute differences, see
sklearn.metrics.mean_absolute_error
.
- MedianAbsoluteError = (<function add_finite_filter_to_scorer.<locals>.score_func_finite>, False, 'MedAE')
Median absolute error, median of absolute differences, see
sklearn.metrics.median_absolute_error
.
- MeanAbsolutePercentError = (<function add_finite_filter_to_scorer.<locals>.score_func_finite>, False, 'MAPE')
Mean absolute percent error, error relative to actuals expressed as a %, see wikipedia MAPE.
- MedianAbsolutePercentError = (<function add_finite_filter_to_scorer.<locals>.score_func_finite>, False, 'MedAPE')
Median absolute percent error, median of error relative to actuals expressed as a %, a median version of the MeanAbsolutePercentError, less affected by extreme values.
- SymmetricMeanAbsolutePercentError = (<function add_finite_filter_to_scorer.<locals>.score_func_finite>, False, 'sMAPE')
Symmetric mean absolute percent error, error relative to (actuals+forecast) expressed as a %. Note that we do not include a factor of 2 in the denominator, so the range is 0% to 100%, see wikipedia sMAPE.
- Quantile80 = (<function quantile_loss_q.<locals>.quantile_loss_wrapper>, False, 'Q80')
Quantile loss with q=0.80:
np.where(y_true < y_pred, (1 - q) * (y_pred - y_true), q * (y_true - y_pred)).mean()
- Quantile95 = (<function quantile_loss_q.<locals>.quantile_loss_wrapper>, False, 'Q95')
Quantile loss with q=0.95:
np.where(y_true < y_pred, (1 - q) * (y_pred - y_true), q * (y_true - y_pred)).mean()
- Quantile99 = (<function quantile_loss_q.<locals>.quantile_loss_wrapper>, False, 'Q99')
Quantile loss with q=0.99:
np.where(y_true < y_pred, (1 - q) * (y_pred - y_true), q * (y_true - y_pred)).mean()
- FractionOutsideTolerance1 = (functools.partial(<function fraction_outside_tolerance>, rtol=0.01), False, 'OutsideTolerance1p')
Fraction of forecasted values that deviate more than 1% from the actual
- FractionOutsideTolerance2 = (functools.partial(<function fraction_outside_tolerance>, rtol=0.02), False, 'OutsideTolerance2p')
Fraction of forecasted values that deviate more than 2% from the actual
- FractionOutsideTolerance3 = (functools.partial(<function fraction_outside_tolerance>, rtol=0.03), False, 'OutsideTolerance3p')
Fraction of forecasted values that deviate more than 3% from the actual
- FractionOutsideTolerance4 = (functools.partial(<function fraction_outside_tolerance>, rtol=0.04), False, 'OutsideTolerance4p')
Fraction of forecasted values that deviate more than 4% from the actual
- FractionOutsideTolerance5 = (functools.partial(<function fraction_outside_tolerance>, rtol=0.05), False, 'OutsideTolerance5p')
Fraction of forecasted values that deviate more than 5% from the actual
Constants used by code in common
or in multiple places:
algo
, sklearn
,
and/or framework
.
- greykite.common.constants.TIME_COL = 'ts'
The default name for the column with the timestamps of the time series.
- greykite.common.constants.VALUE_COL = 'y'
The default name for the column with the values of the time series.
- greykite.common.constants.ACTUAL_COL = 'actual'
The column name representing actual (observed) values.
- greykite.common.constants.PREDICTED_COL = 'forecast'
The column name representing the predicted values.
- greykite.common.constants.RESIDUAL_COL = 'residual'
The column name representing the forecast residuals.
- greykite.common.constants.PREDICTED_LOWER_COL = 'forecast_lower'
The column name representing lower bounds of prediction interval.
- greykite.common.constants.PREDICTED_UPPER_COL = 'forecast_upper'
The column name representing upper bounds of prediction interval.
- greykite.common.constants.NULL_PREDICTED_COL = 'forecast_null'
The column name representing predicted values from null model.
- greykite.common.constants.ERR_STD_COL = 'err_std'
The column name representing the error standard deviation from models.
- greykite.common.constants.QUANTILE_SUMMARY_COL = 'quantile_summary'
The column name representing the quantile summary from models.
- greykite.common.constants.R2_null_model_score = 'R2_null_model_score'
Evaluation metric. Improvement in the specified loss function compared to the predictions of a null model.
- greykite.common.constants.FRACTION_OUTSIDE_TOLERANCE = 'Outside Tolerance (fraction)'
Evaluation metric. The fraction of predictions outside the specified tolerance level.
- greykite.common.constants.PREDICTION_BAND_WIDTH = 'Prediction Band Width (%)'
Evaluation metric. Relative size of prediction bands vs actual, as a percent.
- greykite.common.constants.PREDICTION_BAND_COVERAGE = 'Prediction Band Coverage (fraction)'
Evaluation metric. Fraction of observations within the bands.
- greykite.common.constants.LOWER_BAND_COVERAGE = 'Coverage: Lower Band'
Evaluation metric. Fraction of observations within the lower band.
- greykite.common.constants.UPPER_BAND_COVERAGE = 'Coverage: Upper Band'
Evaluation metric. Fraction of observations within the upper band.
- greykite.common.constants.COVERAGE_VS_INTENDED_DIFF = 'Coverage Diff: Actual_Coverage - Intended_Coverage'
Evaluation metric. Difference between actual and intended coverage.
- greykite.common.constants.EVENT_DF_DATE_COL = 'date'
Name of date column for the DataFrames passed to silverkite custom_daily_event_df_dict.
- greykite.common.constants.EVENT_DF_LABEL_COL = 'event_name'
Name of event column for the DataFrames passed to silverkite custom_daily_event_df_dict.
- greykite.common.constants.EVENT_PREFIX = 'events'
Prefix for naming event features.
- greykite.common.constants.EVENT_DEFAULT = ''
Label used for days without an event.
- greykite.common.constants.EVENT_INDICATOR = 'event'
Binary indicator for an event.
- greykite.common.constants.IS_EVENT_COL = 'is_event'
Indicator column in feature matrix, 1 if the day is an event or its neighboring days.
- greykite.common.constants.IS_EVENT_ADJACENT_COL = 'is_event_adjacent'
Indicator column in feature matrix, 1 if the day is adjacent to an event.
- greykite.common.constants.IS_EVENT_EXACT_COL = 'is_event_exact'
Indicator column in feature matrix, 1 if the day is an event but not its neighboring days.
- greykite.common.constants.EVENT_SHIFTED_SUFFIX_BEFORE = '_before'
The suffix for neighboring events before the events added to the event names.
- greykite.common.constants.EVENT_SHIFTED_SUFFIX_AFTER = '_after'
The suffix for neighboring events after the events added to the event names.
- greykite.common.constants.CHANGEPOINT_COL_PREFIX = 'changepoint'
Prefix for naming changepoint features.
- greykite.common.constants.CHANGEPOINT_COL_PREFIX_SHORT = 'cp'
Short prefix for naming changepoint features.
- greykite.common.constants.START_TIME_COL = 'start_time'
Default column name for anomaly start time in the anomaly dataframe.
- greykite.common.constants.END_TIME_COL = 'end_time'
Default column name for anomaly end time in the anomaly dataframe.
- greykite.common.constants.ADJUSTMENT_DELTA_COL = 'adjustment_delta'
Default column name for anomaly adjustment in the anomaly dataframe.
- greykite.common.constants.METRIC_COL = 'metric'
Column to denote metric of interest.
- greykite.common.constants.DIMENSION_COL = 'dimension'
Dimension column.
- greykite.common.constants.ANOMALY_COL = 'is_anomaly'
Default column name for anomaly labels (boolean) in the time series.
- greykite.common.constants.PREDICTED_ANOMALY_COL = 'is_anomaly_predicted'
Default column name for predicted anomaly labels (boolean) in the time series.
- greykite.common.constants.SEVERITY_SCORE_COL = 'severity_score'
Default column name for anomaly severity score in the anomaly dataframe.
- greykite.common.constants.USER_REVIEWED_COL = 'is_user_reviewed'
Default column name for whether an anomaly is reviewed by the user (boolean) in the anomaly dataframe.
- greykite.common.constants.NEW_PATTERN_ANOMALY_COL = 'new_pattern_anomaly'
Default column name for whether an anomaly is a new pattern (boolean) in the anomaly dataframe.
- class greykite.common.constants.TimeFeaturesEnum(value)[source]
Time features generated by
build_time_features_df
.The item names are lower-case letters (kept the same as the values) for easier check of existence. To check if a string s is in this Enum, use
s in TimeFeaturesEnum.__dict__["_member_names_"]
. Direct check of existences in TimeFeaturesEnum
is deprecated in python 3.8.
- class greykite.common.constants.GrowthColEnum(value)[source]
Human-readable names for the growth columns generated by
build_time_features_df
.The names are the human-readable names, and the values are the corresponding column names generated by
build_time_features_df
.
- greykite.common.constants.LAG_INFIX = '_lag'
Infix for lagged feature names.
- greykite.common.constants.AGG_LAG_INFIX = 'avglag'
Infix for aggregated lag feature names.
- greykite.common.constants.TREND_REGEX = 'changepoint\\d|ct\\d|ct_|cp\\d'
Growth terms, including changepoints.
- greykite.common.constants.SEASONALITY_REGEX = 'sin\\d|cos\\d'
Seasonality terms modeled by fourier series.
- greykite.common.constants.EVENT_REGEX = 'events_'
Event terms.
- greykite.common.constants.LAG_REGEX = '_lag\\d|_avglag_\\d'
Lag terms.
- greykite.common.constants.LOGGER_NAME = 'Greykite'
Name used by the logger.
Constants used by `~greykite.framework.
- greykite.framework.constants.EVALUATION_PERIOD_CV_MAX_SPLITS = 3
Default value for EvaluationPeriodParam().cv_max_splits
- greykite.framework.constants.COMPUTATION_N_JOBS = 1
Default value for ComputationParam.n_jobs
- greykite.framework.constants.COMPUTATION_VERBOSE = 1
Default value for ComputationParam.verbose
- greykite.framework.constants.CV_REPORT_METRICS_ALL = 'ALL'
Set cv_report_metrics to this value to compute all metrics during CV
- greykite.framework.constants.FRACTION_OUTSIDE_TOLERANCE_NAME = 'OutsideTolerance'
Short name used to report the result of FRACTION_OUTSIDE_TOLERANCE in CV
- greykite.framework.constants.CUSTOM_SCORE_FUNC_NAME = 'score'
Short name used to report the result of custom score_func in CV
- greykite.framework.constants.MEAN_COL_GROUP = 'mean'
Columns with mean.
- greykite.framework.constants.QUANTILE_COL_GROUP = 'quantile'
Columns with quantile.
- greykite.framework.constants.OVERLAY_COL_GROUP = 'overlay'
Columns with overlay.
- greykite.framework.constants.FORECAST_STEP_COL = 'forecast_step'
The column name for forecast step in benchmarking
- class greykite.algo.forecast.silverkite.constants.silverkite_constant.SilverkiteConstant[source]
Uses the appropriate constant mixins to provide all the constants that will be used by Silverkite.
- get_silverkite_column() Type[SilverkiteColumn]
Return the SilverkiteColumn constants
- get_silverkite_components_enum() Type[SilverkiteComponentsEnum]
Return the SilverkiteComponentsEnum constants
- get_silverkite_holiday() Type[SilverkiteHoliday]
Return the SilverkiteHoliday constants
- get_silverkite_seasonality_enum() Type[SilverkiteSeasonalityEnum]
Return the SilverkiteSeasonalityEnum constants
- get_silverkite_time_frequency_enum() Type[SilverkiteTimeFrequencyEnum]
Return the SilverkiteTimeFrequencyEnum constants
- class greykite.algo.forecast.silverkite.constants.silverkite_column.SilverkiteColumn[source]
Silverkite feature sets for sub-daily data.
- COLS_HOUR_OF_WEEK: str = 'hour_of_week'
Silverkite feature_sets_enabled key. constant hour of week effect
- COLS_WEEKEND_SEAS: str = 'is_weekend:daily_seas'
Silverkite feature_sets_enabled key. daily seasonality interaction with is_weekend
- COLS_DAY_OF_WEEK_SEAS: str = 'day_of_week:daily_seas'
Silverkite feature_sets_enabled key. daily seasonality interaction with day of week
- COLS_TREND_DAILY_SEAS: str = 'trend:is_weekend:daily_seas'
Silverkite feature_sets_enabled key. allow daily seasonality to change over time, depending on is_weekend
- COLS_EVENT_SEAS: str = 'event:daily_seas'
Silverkite feature_sets_enabled key. allow sub-daily event effects
- COLS_EVENT_WEEKEND_SEAS: str = 'event:is_weekend:daily_seas'
Silverkite feature_sets_enabled key. allow sub-daily event effect to interact with is_weekend
- COLS_DAY_OF_WEEK: str = 'day_of_week'
Silverkite feature_sets_enabled key. constant day of week effect
- COLS_TREND_WEEKEND: str = 'trend:is_weekend'
Silverkite feature_sets_enabled key. allow trend (growth, changepoints) to interact with is_weekend
- class greykite.algo.forecast.silverkite.constants.silverkite_component.SilverkiteComponentsEnum(value)[source]
Defines groupby time feature, xlabel and ylabel for Silverkite Component Plots.
- class greykite.algo.forecast.silverkite.constants.silverkite_holiday.SilverkiteHoliday[source]
Holiday constants to be used by Silverkite
- HOLIDAY_LOOKUP_COUNTRIES_AUTO = ('UnitedStates', 'UnitedKingdom', 'India', 'France', 'China')
Auto setting for the countries that contain the holidays to include in the model
- HOLIDAYS_TO_MODEL_SEPARATELY_AUTO = ("New Year's Day", 'Chinese New Year', 'Christmas Day', 'Independence Day', 'Thanksgiving', 'Labor Day', 'Good Friday', 'Easter Monday [England, Wales, Northern Ireland]', 'Memorial Day', 'Veterans Day')
Auto setting for the holidays to include in the model
- ALL_HOLIDAYS_IN_COUNTRIES = 'ALL_HOLIDAYS_IN_COUNTRIES'
Value for holidays_to_model_separately to request all holidays in the lookup countries
- HOLIDAYS_TO_INTERACT = ('Christmas Day', 'Christmas Day_minus_1', 'Christmas Day_minus_2', 'Christmas Day_plus_1', 'Christmas Day_plus_2', 'New Years Day', 'New Years Day_minus_1', 'New Years Day_minus_2', 'New Years Day_plus_1', 'New Years Day_plus_2', 'Thanksgiving', 'Thanksgiving_plus_1', 'Independence Day')
Significant holidays that may have a different daily seasonality pattern
- class greykite.algo.forecast.silverkite.constants.silverkite_seasonality.SilverkiteSeasonalityEnum(value)[source]
Defines default seasonalities for Silverkite estimator. Names should match those in SeasonalityEnum. The default order for various seasonalities is stored in this enum.
- DAILY_SEASONALITY: SilverkiteSeasonality = SilverkiteSeasonality(name='tod', period=24.0, order=12, seas_names='daily', default_min_days=2)
tod
is 0-24 time of day (tod granularity based on input data, up to second level). Requires at least two full cycles to add the seasonal term (default_min_days=2
).
- WEEKLY_SEASONALITY: SilverkiteSeasonality = SilverkiteSeasonality(name='tow', period=7.0, order=4, seas_names='weekly', default_min_days=14)
tow
is 0-7 time of week (tow granularity based on input data, up to second level).order=4
for full flexibility to model daily input.
- MONTHLY_SEASONALITY: SilverkiteSeasonality = SilverkiteSeasonality(name='tom', period=1.0, order=2, seas_names='monthly', default_min_days=60)
tom
is 0-1 time of month (tom granularity based on input data, up to daily level).
- QUARTERLY_SEASONALITY: SilverkiteSeasonality = SilverkiteSeasonality(name='toq', period=1.0, order=5, seas_names='quarterly', default_min_days=180)
toq
(continuous time of quarter) with natural period. Each day is mapped to a value in [0.0, 1.0) based on its position in the calendar quarter: (Jan1-Mar31, Apr1-Jun30, Jul1-Sep30, Oct1-Dec31). The start of each quarter is 0.0.
- YEARLY_SEASONALITY: SilverkiteSeasonality = SilverkiteSeasonality(name='ct1', period=1.0, order=15, seas_names='yearly', default_min_days=548)
ct1
(continuous year) with natural period.
- class greykite.algo.forecast.silverkite.constants.silverkite_time_frequency.SilverkiteTimeFrequencyEnum(value)[source]
Provides properties for modeling for various time frequencies in Silverkite. The enum names is the time frequency, corresponding to the simple time frequencies in
SimpleTimeFrequencyEnum
.
Provides templates for SimpleSilverkiteEstimator that are pre-tuned to fit specific use cases.
A subset of these templates are recognized by ModelTemplateEnum.
simple_silverkite_template
also accepts any model_template
name that follows
the naming convention in this file. For details, see
the model_template
parameter in
SimpleSilverkiteTemplate
.
- class greykite.framework.templates.simple_silverkite_template_config.SILVERKITE_FREQ(value)[source]
Valid values for simple silverkite template string name frequency.
- greykite.framework.templates.simple_silverkite_template_config.VALID_FREQ = ['HOURLY', 'DAILY', 'WEEKLY']
Valid non-default values for simple silverkite template string name frequency. These are the non-default frequencies recognized by
SimpleSilverkiteTemplateOptions
.
- class greykite.framework.templates.simple_silverkite_template_config.SILVERKITE_SEAS(value)[source]
Valid values for simple silverkite template string name seasonality.
- class greykite.framework.templates.simple_silverkite_template_config.SILVERKITE_GR(value)[source]
Valid values for simple silverkite template string name growth_term.
- class greykite.framework.templates.simple_silverkite_template_config.SILVERKITE_CP(value)[source]
Valid values for simple silverkite template string name changepoints_dict.
- class greykite.framework.templates.simple_silverkite_template_config.SILVERKITE_HOL(value)[source]
Valid values for simple silverkite template string name events.
- class greykite.framework.templates.simple_silverkite_template_config.SILVERKITE_FEASET(value)[source]
Valid values for simple silverkite template string name feature_sets_enabled.
- class greykite.framework.templates.simple_silverkite_template_config.SILVERKITE_ALGO(value)[source]
Valid values for simple silverkite template string name fit_algorithm.
- class greykite.framework.templates.simple_silverkite_template_config.SILVERKITE_AR(value)[source]
Valid values for simple silverkite template string name autoregression.
- class greykite.framework.templates.simple_silverkite_template_config.SILVERKITE_DSI(value)[source]
Valid values for simple silverkite template string name daily seasonality max interaction order.
- class greykite.framework.templates.simple_silverkite_template_config.SILVERKITE_WSI(value)[source]
Valid values for simple silverkite template string name weekly seasonality max interaction order.
- class greykite.framework.templates.simple_silverkite_template_config.SILVERKITE_COMPONENT_KEYWORDS(value)[source]
Valid values for simple silverkite template string name keywords. The names are the keywords and the values are the corresponding value enum. Can be used to create an instance of
SimpleSilverkiteTemplateOptions
.
- class greykite.framework.templates.simple_silverkite_template_config.SimpleSilverkiteTemplateOptions(freq: SILVERKITE_FREQ = SILVERKITE_FREQ.DAILY, seas: SILVERKITE_SEAS = SILVERKITE_SEAS.LT, gr: SILVERKITE_GR = SILVERKITE_GR.LINEAR, cp: SILVERKITE_CP = SILVERKITE_CP.NONE, hol: SILVERKITE_HOL = SILVERKITE_HOL.NONE, feaset: SILVERKITE_FEASET = SILVERKITE_FEASET.OFF, algo: SILVERKITE_ALGO = SILVERKITE_ALGO.LINEAR, ar: SILVERKITE_AR = SILVERKITE_AR.OFF, dsi: SILVERKITE_DSI = SILVERKITE_DSI.AUTO, wsi: SILVERKITE_WSI = SILVERKITE_WSI.AUTO)[source]
Defines generic simple silverkite template options.
Attributes can be set to different values using
SILVERKITE_COMPONENT_KEYWORDS
for high level tuning.freq
represents data frequency.The other attributes stand for seasonality, growth, changepoints_dict, events, feature_sets_enabled, fit_algorithm and autoregression in
ModelComponentsParam
, which are used inSimpleSilverkiteTemplate
.- freq: SILVERKITE_FREQ = 'DAILY'
Valid values for simple silverkite template string name frequency. See
SILVERKITE_FREQ
.
- seas: SILVERKITE_SEAS = 'LT'
Valid values for simple silverkite template string name seasonality. See
SILVERKITE_SEAS
.
- gr: SILVERKITE_GR = 'LINEAR'
Valid values for simple silverkite template string name growth. See
SILVERKITE_GR
.
- cp: SILVERKITE_CP = 'NONE'
Valid values for simple silverkite template string name changepoints. See
SILVERKITE_CP
.
- hol: SILVERKITE_HOL = 'NONE'
Valid values for simple silverkite template string name holiday. See
SILVERKITE_HOL
.
- feaset: SILVERKITE_FEASET = 'OFF'
Valid values for simple silverkite template string name feature sets enabled. See
SILVERKITE_FEASET
.
- algo: SILVERKITE_ALGO = 'LINEAR'
Valid values for simple silverkite template string name fit algorithm. See
SILVERKITE_ALGO
.
- ar: SILVERKITE_AR = 'OFF'
Valid values for simple silverkite template string name autoregression. See
SILVERKITE_AR
.
- dsi: SILVERKITE_DSI = 'AUTO'
Valid values for simple silverkite template string name max daily seasonality interaction order. See
SILVERKITE_DSI
.
- wsi: SILVERKITE_WSI = 'AUTO'
Valid values for simple silverkite template string name max weekly seasonality interaction order. See
SILVERKITE_WSI
.
- greykite.framework.templates.simple_silverkite_template_config.COMMON_MODELCOMPONENTPARAM_PARAMETERS = {'ALGO': {'LASSO': {'fit_algorithm': 'lasso', 'fit_algorithm_params': None}, 'LINEAR': {'fit_algorithm': 'linear', 'fit_algorithm_params': None}, 'RIDGE': {'fit_algorithm': 'ridge', 'fit_algorithm_params': None}, 'SGD': {'fit_algorithm': 'sgd', 'fit_algorithm_params': None}}, 'AR': {'AUTO': {'autoreg_dict': 'auto', 'fast_simulation': False, 'simulation_num': 10}, 'OFF': {'autoreg_dict': None, 'fast_simulation': False, 'simulation_num': 10}}, 'CP': {'DAILY': {'HV': {'method': 'auto', 'no_changepoint_distance_from_end': '180D', 'potential_changepoint_distance': '15D', 'regularization_strength': 0.3, 'resample_freq': '7D', 'yearly_seasonality_change_freq': '365D', 'yearly_seasonality_order': 15}, 'LT': {'method': 'auto', 'no_changepoint_distance_from_end': '90D', 'potential_changepoint_distance': '15D', 'regularization_strength': 0.6, 'resample_freq': '7D', 'yearly_seasonality_change_freq': None, 'yearly_seasonality_order': 15}, 'NM': {'method': 'auto', 'no_changepoint_distance_from_end': '180D', 'potential_changepoint_distance': '15D', 'regularization_strength': 0.5, 'resample_freq': '7D', 'yearly_seasonality_change_freq': '365D', 'yearly_seasonality_order': 15}, 'NONE': None}, 'HOURLY': {'HV': {'method': 'auto', 'no_changepoint_distance_from_end': '30D', 'potential_changepoint_distance': '15D', 'regularization_strength': 0.3, 'resample_freq': 'D', 'yearly_seasonality_change_freq': '365D', 'yearly_seasonality_order': 15}, 'LT': {'method': 'auto', 'no_changepoint_distance_from_end': '30D', 'potential_changepoint_distance': '7D', 'regularization_strength': 0.6, 'resample_freq': 'D', 'yearly_seasonality_change_freq': None, 'yearly_seasonality_order': 15}, 'NM': {'method': 'auto', 'no_changepoint_distance_from_end': '30D', 'potential_changepoint_distance': '15D', 'regularization_strength': 0.5, 'resample_freq': 'D', 'yearly_seasonality_change_freq': '365D', 'yearly_seasonality_order': 15}, 'NONE': None}, 'WEEKLY': {'HV': {'method': 'auto', 'no_changepoint_distance_from_end': '180D', 'potential_changepoint_distance': '14D', 'regularization_strength': 0.3, 'resample_freq': '7D', 'yearly_seasonality_change_freq': '365D', 'yearly_seasonality_order': 15}, 'LT': {'method': 'auto', 'no_changepoint_distance_from_end': '180D', 'potential_changepoint_distance': '14D', 'regularization_strength': 0.6, 'resample_freq': '7D', 'yearly_seasonality_change_freq': None, 'yearly_seasonality_order': 15}, 'NM': {'method': 'auto', 'no_changepoint_distance_from_end': '180D', 'potential_changepoint_distance': '14D', 'regularization_strength': 0.5, 'resample_freq': '7D', 'yearly_seasonality_change_freq': '365D', 'yearly_seasonality_order': 15}, 'NONE': None}}, 'DSI': {'DAILY': {'AUTO': 0, 'OFF': 0}, 'HOURLY': {'AUTO': 5, 'OFF': 0}, 'WEEKLY': {'AUTO': 0, 'OFF': 0}}, 'FEASET': {'AUTO': 'auto', 'OFF': False, 'ON': True}, 'GR': {'LINEAR': {'growth_term': 'linear'}, 'NONE': {'growth_term': None}}, 'HOL': {'NONE': {'auto_holiday': False, 'daily_event_df_dict': None, 'daily_event_neighbor_impact': None, 'daily_event_shifted_effect': None, 'holiday_lookup_countries': [], 'holiday_post_num_days': 0, 'holiday_pre_num_days': 0, 'holiday_pre_post_num_dict': None, 'holidays_to_model_separately': []}, 'SP1': {'auto_holiday': False, 'daily_event_df_dict': None, 'daily_event_neighbor_impact': None, 'daily_event_shifted_effect': None, 'holiday_lookup_countries': 'auto', 'holiday_post_num_days': 1, 'holiday_pre_num_days': 1, 'holiday_pre_post_num_dict': None, 'holidays_to_model_separately': 'auto'}, 'SP2': {'auto_holiday': False, 'daily_event_df_dict': None, 'daily_event_neighbor_impact': None, 'daily_event_shifted_effect': None, 'holiday_lookup_countries': 'auto', 'holiday_post_num_days': 2, 'holiday_pre_num_days': 2, 'holiday_pre_post_num_dict': None, 'holidays_to_model_separately': 'auto'}, 'SP4': {'auto_holiday': False, 'daily_event_df_dict': None, 'daily_event_neighbor_impact': None, 'daily_event_shifted_effect': None, 'holiday_lookup_countries': 'auto', 'holiday_post_num_days': 4, 'holiday_pre_num_days': 4, 'holiday_pre_post_num_dict': None, 'holidays_to_model_separately': 'auto'}, 'TG': {'auto_holiday': False, 'daily_event_df_dict': None, 'daily_event_neighbor_impact': None, 'daily_event_shifted_effect': None, 'holiday_lookup_countries': 'auto', 'holiday_post_num_days': 3, 'holiday_pre_num_days': 3, 'holiday_pre_post_num_dict': None, 'holidays_to_model_separately': []}}, 'SEAS': {'DAILY': {'HV': {'auto_seasonality': False, 'daily_seasonality': 0, 'monthly_seasonality': 0, 'quarterly_seasonality': 0, 'weekly_seasonality': 4, 'yearly_seasonality': 25}, 'HVQM': {'auto_seasonality': False, 'daily_seasonality': 0, 'monthly_seasonality': 4, 'quarterly_seasonality': 6, 'weekly_seasonality': 4, 'yearly_seasonality': 25}, 'LT': {'auto_seasonality': False, 'daily_seasonality': 0, 'monthly_seasonality': 0, 'quarterly_seasonality': 0, 'weekly_seasonality': 3, 'yearly_seasonality': 8}, 'LTQM': {'auto_seasonality': False, 'daily_seasonality': 0, 'monthly_seasonality': 2, 'quarterly_seasonality': 3, 'weekly_seasonality': 3, 'yearly_seasonality': 8}, 'NM': {'auto_seasonality': False, 'daily_seasonality': 0, 'monthly_seasonality': 0, 'quarterly_seasonality': 0, 'weekly_seasonality': 3, 'yearly_seasonality': 15}, 'NMQM': {'auto_seasonality': False, 'daily_seasonality': 0, 'monthly_seasonality': 4, 'quarterly_seasonality': 4, 'weekly_seasonality': 3, 'yearly_seasonality': 15}, 'NONE': {'auto_seasonality': False, 'daily_seasonality': 0, 'monthly_seasonality': 0, 'quarterly_seasonality': 0, 'weekly_seasonality': 0, 'yearly_seasonality': 0}}, 'HOURLY': {'HV': {'auto_seasonality': False, 'daily_seasonality': 12, 'monthly_seasonality': 0, 'quarterly_seasonality': 0, 'weekly_seasonality': 6, 'yearly_seasonality': 25}, 'HVQM': {'auto_seasonality': False, 'daily_seasonality': 12, 'monthly_seasonality': 4, 'quarterly_seasonality': 4, 'weekly_seasonality': 6, 'yearly_seasonality': 25}, 'LT': {'auto_seasonality': False, 'daily_seasonality': 5, 'monthly_seasonality': 0, 'quarterly_seasonality': 0, 'weekly_seasonality': 3, 'yearly_seasonality': 8}, 'LTQM': {'auto_seasonality': False, 'daily_seasonality': 5, 'monthly_seasonality': 2, 'quarterly_seasonality': 2, 'weekly_seasonality': 3, 'yearly_seasonality': 8}, 'NM': {'auto_seasonality': False, 'daily_seasonality': 8, 'monthly_seasonality': 0, 'quarterly_seasonality': 0, 'weekly_seasonality': 4, 'yearly_seasonality': 15}, 'NMQM': {'auto_seasonality': False, 'daily_seasonality': 8, 'monthly_seasonality': 3, 'quarterly_seasonality': 3, 'weekly_seasonality': 4, 'yearly_seasonality': 15}, 'NONE': {'auto_seasonality': False, 'daily_seasonality': 0, 'monthly_seasonality': 0, 'quarterly_seasonality': 0, 'weekly_seasonality': 0, 'yearly_seasonality': 0}}, 'WEEKLY': {'HV': {'auto_seasonality': False, 'daily_seasonality': 0, 'monthly_seasonality': 0, 'quarterly_seasonality': 0, 'weekly_seasonality': 0, 'yearly_seasonality': 25}, 'HVQM': {'auto_seasonality': False, 'daily_seasonality': 0, 'monthly_seasonality': 4, 'quarterly_seasonality': 4, 'weekly_seasonality': 0, 'yearly_seasonality': 25}, 'LT': {'auto_seasonality': False, 'daily_seasonality': 0, 'monthly_seasonality': 0, 'quarterly_seasonality': 0, 'weekly_seasonality': 0, 'yearly_seasonality': 8}, 'LTQM': {'auto_seasonality': False, 'daily_seasonality': 0, 'monthly_seasonality': 2, 'quarterly_seasonality': 2, 'weekly_seasonality': 0, 'yearly_seasonality': 8}, 'NM': {'auto_seasonality': False, 'daily_seasonality': 0, 'monthly_seasonality': 0, 'quarterly_seasonality': 0, 'weekly_seasonality': 0, 'yearly_seasonality': 15}, 'NMQM': {'auto_seasonality': False, 'daily_seasonality': 0, 'monthly_seasonality': 3, 'quarterly_seasonality': 3, 'weekly_seasonality': 0, 'yearly_seasonality': 15}, 'NONE': {'auto_seasonality': False, 'daily_seasonality': 0, 'monthly_seasonality': 0, 'quarterly_seasonality': 0, 'weekly_seasonality': 0, 'yearly_seasonality': 0}}}, 'WSI': {'DAILY': {'AUTO': 2, 'OFF': 0}, 'HOURLY': {'AUTO': 2, 'OFF': 0}, 'WEEKLY': {'AUTO': 0, 'OFF': 0}}}
Defines the default component values for
SimpleSilverkiteTemplate
. The components include seasonality, growth, holiday, trend changepoints, feature sets, autoregression, fit algorithm, etc. These are used when config.model_template provides theSimpleSilverkiteTemplateOptions
.
- greykite.framework.templates.simple_silverkite_template_config.SILVERKITE = ModelComponentsParam(autoregression={'autoreg_dict': 'auto', 'simulation_num': 10, 'fast_simulation': False}, changepoints={'auto_growth': False, 'changepoints_dict': {'method': 'auto', 'yearly_seasonality_order': 15, 'resample_freq': '3D', 'regularization_strength': 0.6, 'actual_changepoint_min_distance': '30D', 'potential_changepoint_distance': '15D', 'no_changepoint_distance_from_end': '90D'}, 'seasonality_changepoints_dict': None}, custom={'fit_algorithm_dict': {'fit_algorithm': 'ridge', 'fit_algorithm_params': None}, 'feature_sets_enabled': 'auto', 'max_daily_seas_interaction_order': 5, 'max_weekly_seas_interaction_order': 2, 'extra_pred_cols': [], 'drop_pred_cols': None, 'explicit_pred_cols': None, 'min_admissible_value': None, 'max_admissible_value': None, 'regression_weight_col': None, 'normalize_method': 'zero_to_one', 'remove_intercept': False}, events={'auto_holiday': False, 'holidays_to_model_separately': 'auto', 'holiday_lookup_countries': 'auto', 'holiday_pre_num_days': 2, 'holiday_post_num_days': 2, 'holiday_pre_post_num_dict': None, 'daily_event_df_dict': None, 'daily_event_neighbor_impact': None, 'daily_event_shifted_effect': None}, growth={'growth_term': 'linear'}, hyperparameter_override=None, regressors={'regressor_cols': []}, lagged_regressors={'lagged_regressor_dict': None}, seasonality={'auto_seasonality': False, 'yearly_seasonality': 'auto', 'quarterly_seasonality': 'auto', 'monthly_seasonality': 'auto', 'weekly_seasonality': 'auto', 'daily_seasonality': 'auto'}, uncertainty={'uncertainty_dict': None})
Defines the
SILVERKITE
template. Contains automatic growth, seasonality, holidays, autoregression and interactions. Uses “zero_to_one” normalization method. Best for hourly and daily frequencies. Uses SimpleSilverkiteEstimator.
- greykite.framework.templates.simple_silverkite_template_config.SILVERKITE_MONTHLY = ModelComponentsParam(autoregression={'autoreg_dict': {'lag_dict': None, 'agg_lag_dict': {'orders_list': [[1, 2, 3]]}}, 'simulation_num': 50, 'fast_simulation': True}, changepoints={'auto_growth': False, 'changepoints_dict': {'method': 'auto', 'regularization_strength': 0.6, 'resample_freq': '28D', 'potential_changepoint_distance': '180D', 'potential_changepoint_n_max': 100, 'actual_changepoint_min_distance': '730D', 'no_changepoint_distance_from_end': '180D', 'yearly_seasonality_order': 6}, 'seasonality_changepoints_dict': None}, custom={'fit_algorithm_dict': {'fit_algorithm': 'ridge', 'fit_algorithm_params': None}, 'feature_sets_enabled': False, 'max_daily_seas_interaction_order': 0, 'max_weekly_seas_interaction_order': 0, 'extra_pred_cols': ['y_avglag_1_2_3*C(month, levels=list(range(1, 13)))', 'C(month, levels=list(range(1, 13)))'], 'drop_pred_cols': None, 'explicit_pred_cols': None, 'min_admissible_value': None, 'max_admissible_value': None, 'regression_weight_col': None, 'normalize_method': 'zero_to_one', 'remove_intercept': False}, events={'auto_holiday': False, 'holidays_to_model_separately': [], 'holiday_lookup_countries': [], 'holiday_pre_num_days': 0, 'holiday_post_num_days': 0, 'holiday_pre_post_num_dict': None, 'daily_event_df_dict': None, 'daily_event_neighbor_impact': None, 'daily_event_shifted_effect': None}, growth={'growth_term': 'linear'}, hyperparameter_override=None, regressors={'regressor_cols': []}, lagged_regressors={'lagged_regressor_dict': None}, seasonality={'auto_seasonality': False, 'yearly_seasonality': False, 'quarterly_seasonality': False, 'monthly_seasonality': False, 'weekly_seasonality': False, 'daily_seasonality': False}, uncertainty={'uncertainty_dict': None})
Defines the
SILVERKITE_MONTHLY
template. Contains automatic growth. Seasonality is modeled via categorical variable “month”. Includes aggregated autoregression. Simulation is needed when forecast horizon is greater than 1. Uses statistical normalization method. Uses SimpleSilverkiteEstimator.
- greykite.framework.templates.simple_silverkite_template_config.SILVERKITE_DAILY_1 = ['SILVERKITE_DAILY_1_CONFIG_1', 'SILVERKITE_DAILY_1_CONFIG_2', 'SILVERKITE_DAILY_1_CONFIG_3']
Defines the
SILVERKITE_DAILY_1
template, which contains 3 candidate configs for grid search, optimized for the seasonality and changepoint parameters. Best for 1-day forecast for daily time series. Uses SimpleSilverkiteEstimator.
- greykite.framework.templates.simple_silverkite_template_config.MULTI_TEMPLATES = {'SILVERKITE_DAILY_1': ['SILVERKITE_DAILY_1_CONFIG_1', 'SILVERKITE_DAILY_1_CONFIG_2', 'SILVERKITE_DAILY_1_CONFIG_3'], 'SILVERKITE_DAILY_90': ['DAILY_SEAS_LTQM_GR_LINEAR_CP_LT_HOL_SP2_FEASET_AUTO_ALGO_LINEAR_AR_OFF_DSI_AUTO_WSI_AUTO', 'DAILY_SEAS_LTQM_GR_LINEAR_CP_NONE_HOL_SP2_FEASET_AUTO_ALGO_LINEAR_AR_OFF_DSI_AUTO_WSI_AUTO', 'DAILY_SEAS_LTQM_GR_LINEAR_CP_LT_HOL_SP2_FEASET_AUTO_ALGO_RIDGE_AR_OFF_DSI_AUTO_WSI_AUTO', 'DAILY_SEAS_NM_GR_LINEAR_CP_LT_HOL_SP4_FEASET_AUTO_ALGO_RIDGE_AR_OFF_DSI_AUTO_WSI_AUTO'], 'SILVERKITE_HOURLY_1': ['SILVERKITE', 'HOURLY_SEAS_LT_GR_LINEAR_CP_NM_HOL_SP4_FEASET_OFF_ALGO_RIDGE_AR_AUTO', 'HOURLY_SEAS_NM_GR_LINEAR_CP_NM_HOL_SP1_FEASET_AUTO_ALGO_RIDGE_AR_AUTO'], 'SILVERKITE_HOURLY_168': ['HOURLY_SEAS_LT_GR_LINEAR_CP_LT_HOL_SP4_FEASET_AUTO_ALGO_RIDGE_AR_OFF', 'HOURLY_SEAS_LT_GR_LINEAR_CP_LT_HOL_SP2_FEASET_AUTO_ALGO_RIDGE_AR_OFF', 'HOURLY_SEAS_NM_GR_LINEAR_CP_NONE_HOL_SP4_FEASET_OFF_ALGO_LINEAR_AR_AUTO', 'HOURLY_SEAS_NM_GR_LINEAR_CP_NM_HOL_SP1_FEASET_AUTO_ALGO_RIDGE_AR_OFF'], 'SILVERKITE_HOURLY_24': ['HOURLY_SEAS_LT_GR_LINEAR_CP_NM_HOL_SP4_FEASET_AUTO_ALGO_RIDGE_AR_AUTO', 'HOURLY_SEAS_LT_GR_LINEAR_CP_NONE_HOL_SP4_FEASET_AUTO_ALGO_RIDGE_AR_AUTO', 'HOURLY_SEAS_NM_GR_LINEAR_CP_LT_HOL_SP1_FEASET_OFF_ALGO_LINEAR_AR_AUTO', 'HOURLY_SEAS_NM_GR_LINEAR_CP_NM_HOL_SP4_FEASET_AUTO_ALGO_RIDGE_AR_AUTO'], 'SILVERKITE_HOURLY_336': ['HOURLY_SEAS_LT_GR_LINEAR_CP_LT_HOL_SP2_FEASET_AUTO_ALGO_RIDGE_AR_OFF', 'HOURLY_SEAS_LT_GR_LINEAR_CP_LT_HOL_SP4_FEASET_AUTO_ALGO_RIDGE_AR_OFF', 'HOURLY_SEAS_NM_GR_LINEAR_CP_LT_HOL_SP1_FEASET_AUTO_ALGO_LINEAR_AR_OFF', 'HOURLY_SEAS_NM_GR_LINEAR_CP_NM_HOL_SP1_FEASET_AUTO_ALGO_LINEAR_AR_AUTO'], 'SILVERKITE_WEEKLY': ['WEEKLY_SEAS_NM_GR_LINEAR_CP_NONE_HOL_NONE_FEASET_OFF_ALGO_LINEAR_AR_OFF_DSI_AUTO_WSI_AUTO', 'WEEKLY_SEAS_NM_GR_LINEAR_CP_LT_HOL_NONE_FEASET_OFF_ALGO_LINEAR_AR_OFF_DSI_AUTO_WSI_AUTO', 'WEEKLY_SEAS_HV_GR_LINEAR_CP_NM_HOL_NONE_FEASET_OFF_ALGO_RIDGE_AR_OFF_DSI_AUTO_WSI_AUTO', 'WEEKLY_SEAS_HV_GR_LINEAR_CP_LT_HOL_NONE_FEASET_OFF_ALGO_RIDGE_AR_OFF_DSI_AUTO_WSI_AUTO']}
A dictionary of multi templates.
Keys are the available multi templates names (valid strings for config.model_template).
Values correspond to a list of
ModelComponentsParam
.
- greykite.framework.templates.simple_silverkite_template_config.SINGLE_MODEL_TEMPLATE_TYPE
Types accepted by SimpleSilverkiteTemplate for
config.model_template
for a single template.alias of
Union
[str
,ModelComponentsParam
,SimpleSilverkiteTemplateOptions
]
- class greykite.framework.templates.simple_silverkite_template_config.SimpleSilverkiteTemplateConstants(COMMON_MODELCOMPONENTPARAM_PARAMETERS: ~typing.Dict = <factory>, MULTI_TEMPLATES: ~typing.Dict = <factory>, SILVERKITE: ~typing.Union[str, ~greykite.framework.templates.autogen.forecast_config.ModelComponentsParam, ~greykite.framework.templates.simple_silverkite_template_config.SimpleSilverkiteTemplateOptions] = ModelComponentsParam(autoregression={'autoreg_dict': 'auto', 'simulation_num': 10, 'fast_simulation': False}, changepoints={'auto_growth': False, 'changepoints_dict': {'method': 'auto', 'yearly_seasonality_order': 15, 'resample_freq': '3D', 'regularization_strength': 0.6, 'actual_changepoint_min_distance': '30D', 'potential_changepoint_distance': '15D', 'no_changepoint_distance_from_end': '90D'}, 'seasonality_changepoints_dict': None}, custom={'fit_algorithm_dict': {'fit_algorithm': 'ridge', 'fit_algorithm_params': None}, 'feature_sets_enabled': 'auto', 'max_daily_seas_interaction_order': 5, 'max_weekly_seas_interaction_order': 2, 'extra_pred_cols': [], 'drop_pred_cols': None, 'explicit_pred_cols': None, 'min_admissible_value': None, 'max_admissible_value': None, 'regression_weight_col': None, 'normalize_method': 'zero_to_one', 'remove_intercept': False}, events={'auto_holiday': False, 'holidays_to_model_separately': 'auto', 'holiday_lookup_countries': 'auto', 'holiday_pre_num_days': 2, 'holiday_post_num_days': 2, 'holiday_pre_post_num_dict': None, 'daily_event_df_dict': None, 'daily_event_neighbor_impact': None, 'daily_event_shifted_effect': None}, growth={'growth_term': 'linear'}, hyperparameter_override=None, regressors={'regressor_cols': []}, lagged_regressors={'lagged_regressor_dict': None}, seasonality={'auto_seasonality': False, 'yearly_seasonality': 'auto', 'quarterly_seasonality': 'auto', 'monthly_seasonality': 'auto', 'weekly_seasonality': 'auto', 'daily_seasonality': 'auto'}, uncertainty={'uncertainty_dict': None}), SILVERKITE_MONTHLY: ~typing.Union[str, ~greykite.framework.templates.autogen.forecast_config.ModelComponentsParam, ~greykite.framework.templates.simple_silverkite_template_config.SimpleSilverkiteTemplateOptions] = ModelComponentsParam(autoregression={'autoreg_dict': {'lag_dict': None, 'agg_lag_dict': {'orders_list': [[1, 2, 3]]}}, 'simulation_num': 50, 'fast_simulation': True}, changepoints={'auto_growth': False, 'changepoints_dict': {'method': 'auto', 'regularization_strength': 0.6, 'resample_freq': '28D', 'potential_changepoint_distance': '180D', 'potential_changepoint_n_max': 100, 'actual_changepoint_min_distance': '730D', 'no_changepoint_distance_from_end': '180D', 'yearly_seasonality_order': 6}, 'seasonality_changepoints_dict': None}, custom={'fit_algorithm_dict': {'fit_algorithm': 'ridge', 'fit_algorithm_params': None}, 'feature_sets_enabled': False, 'max_daily_seas_interaction_order': 0, 'max_weekly_seas_interaction_order': 0, 'extra_pred_cols': ['y_avglag_1_2_3*C(month, levels=list(range(1, 13)))', 'C(month, levels=list(range(1, 13)))'], 'drop_pred_cols': None, 'explicit_pred_cols': None, 'min_admissible_value': None, 'max_admissible_value': None, 'regression_weight_col': None, 'normalize_method': 'zero_to_one', 'remove_intercept': False}, events={'auto_holiday': False, 'holidays_to_model_separately': [], 'holiday_lookup_countries': [], 'holiday_pre_num_days': 0, 'holiday_post_num_days': 0, 'holiday_pre_post_num_dict': None, 'daily_event_df_dict': None, 'daily_event_neighbor_impact': None, 'daily_event_shifted_effect': None}, growth={'growth_term': 'linear'}, hyperparameter_override=None, regressors={'regressor_cols': []}, lagged_regressors={'lagged_regressor_dict': None}, seasonality={'auto_seasonality': False, 'yearly_seasonality': False, 'quarterly_seasonality': False, 'monthly_seasonality': False, 'weekly_seasonality': False, 'daily_seasonality': False}, uncertainty={'uncertainty_dict': None}), SILVERKITE_DAILY_1_CONFIG_1: ~typing.Union[str, ~greykite.framework.templates.autogen.forecast_config.ModelComponentsParam, ~greykite.framework.templates.simple_silverkite_template_config.SimpleSilverkiteTemplateOptions] = ModelComponentsParam(autoregression={'autoreg_dict': 'auto', 'simulation_num': 10, 'fast_simulation': False}, changepoints={'auto_growth': False, 'changepoints_dict': {'method': 'auto', 'resample_freq': '7D', 'regularization_strength': 0.809, 'potential_changepoint_distance': '7D', 'no_changepoint_distance_from_end': '7D', 'yearly_seasonality_order': 8, 'yearly_seasonality_change_freq': None}, 'seasonality_changepoints_dict': None}, custom={'fit_algorithm_dict': {'fit_algorithm': 'ridge', 'fit_algorithm_params': None}, 'feature_sets_enabled': 'auto', 'max_daily_seas_interaction_order': 5, 'max_weekly_seas_interaction_order': 2, 'extra_pred_cols': [], 'drop_pred_cols': None, 'explicit_pred_cols': None, 'min_admissible_value': None, 'max_admissible_value': None, 'regression_weight_col': None, 'normalize_method': 'zero_to_one', 'remove_intercept': False}, events={'auto_holiday': False, 'holidays_to_model_separately': ("New Year's Day", 'Chinese New Year', 'Christmas Day', 'Independence Day', 'Thanksgiving', 'Labor Day', 'Good Friday', 'Easter Monday [England, Wales, Northern Ireland]', 'Memorial Day', 'Veterans Day'), 'holiday_lookup_countries': ('UnitedStates', 'UnitedKingdom', 'India', 'France', 'China'), 'holiday_pre_num_days': 2, 'holiday_post_num_days': 2, 'holiday_pre_post_num_dict': None, 'daily_event_df_dict': None, 'daily_event_neighbor_impact': None, 'daily_event_shifted_effect': None}, growth={'growth_term': 'linear'}, hyperparameter_override=None, regressors={'regressor_cols': []}, lagged_regressors={'lagged_regressor_dict': None}, seasonality={'auto_seasonality': False, 'yearly_seasonality': 8, 'quarterly_seasonality': 0, 'monthly_seasonality': 7, 'weekly_seasonality': 1, 'daily_seasonality': 0}, uncertainty={'uncertainty_dict': None}), SILVERKITE_DAILY_1_CONFIG_2: ~typing.Union[str, ~greykite.framework.templates.autogen.forecast_config.ModelComponentsParam, ~greykite.framework.templates.simple_silverkite_template_config.SimpleSilverkiteTemplateOptions] = ModelComponentsParam(autoregression={'autoreg_dict': 'auto', 'simulation_num': 10, 'fast_simulation': False}, changepoints={'auto_growth': False, 'changepoints_dict': {'method': 'auto', 'resample_freq': '7D', 'regularization_strength': 0.624, 'potential_changepoint_distance': '7D', 'no_changepoint_distance_from_end': '17D', 'yearly_seasonality_order': 1, 'yearly_seasonality_change_freq': None}, 'seasonality_changepoints_dict': None}, custom={'fit_algorithm_dict': {'fit_algorithm': 'ridge', 'fit_algorithm_params': None}, 'feature_sets_enabled': 'auto', 'max_daily_seas_interaction_order': 5, 'max_weekly_seas_interaction_order': 2, 'extra_pred_cols': [], 'drop_pred_cols': None, 'explicit_pred_cols': None, 'min_admissible_value': None, 'max_admissible_value': None, 'regression_weight_col': None, 'normalize_method': 'zero_to_one', 'remove_intercept': False}, events={'auto_holiday': False, 'holidays_to_model_separately': ("New Year's Day", 'Chinese New Year', 'Christmas Day', 'Independence Day', 'Thanksgiving', 'Labor Day', 'Good Friday', 'Easter Monday [England, Wales, Northern Ireland]', 'Memorial Day', 'Veterans Day'), 'holiday_lookup_countries': ('UnitedStates', 'UnitedKingdom', 'India', 'France', 'China'), 'holiday_pre_num_days': 2, 'holiday_post_num_days': 2, 'holiday_pre_post_num_dict': None, 'daily_event_df_dict': None, 'daily_event_neighbor_impact': None, 'daily_event_shifted_effect': None}, growth={'growth_term': 'linear'}, hyperparameter_override=None, regressors={'regressor_cols': []}, lagged_regressors={'lagged_regressor_dict': None}, seasonality={'auto_seasonality': False, 'yearly_seasonality': 1, 'quarterly_seasonality': 0, 'monthly_seasonality': 4, 'weekly_seasonality': 6, 'daily_seasonality': 0}, uncertainty={'uncertainty_dict': None}), SILVERKITE_DAILY_1_CONFIG_3: ~typing.Union[str, ~greykite.framework.templates.autogen.forecast_config.ModelComponentsParam, ~greykite.framework.templates.simple_silverkite_template_config.SimpleSilverkiteTemplateOptions] = ModelComponentsParam(autoregression={'autoreg_dict': 'auto', 'simulation_num': 10, 'fast_simulation': False}, changepoints={'auto_growth': False, 'changepoints_dict': {'method': 'auto', 'resample_freq': '7D', 'regularization_strength': 0.59, 'potential_changepoint_distance': '7D', 'no_changepoint_distance_from_end': '8D', 'yearly_seasonality_order': 40, 'yearly_seasonality_change_freq': None}, 'seasonality_changepoints_dict': None}, custom={'fit_algorithm_dict': {'fit_algorithm': 'ridge', 'fit_algorithm_params': None}, 'feature_sets_enabled': 'auto', 'max_daily_seas_interaction_order': 5, 'max_weekly_seas_interaction_order': 2, 'extra_pred_cols': [], 'drop_pred_cols': None, 'explicit_pred_cols': None, 'min_admissible_value': None, 'max_admissible_value': None, 'regression_weight_col': None, 'normalize_method': 'zero_to_one', 'remove_intercept': False}, events={'auto_holiday': False, 'holidays_to_model_separately': ("New Year's Day", 'Chinese New Year', 'Christmas Day', 'Independence Day', 'Thanksgiving', 'Labor Day', 'Good Friday', 'Easter Monday [England, Wales, Northern Ireland]', 'Memorial Day', 'Veterans Day'), 'holiday_lookup_countries': ('UnitedStates', 'UnitedKingdom', 'India', 'France', 'China'), 'holiday_pre_num_days': 2, 'holiday_post_num_days': 2, 'holiday_pre_post_num_dict': None, 'daily_event_df_dict': None, 'daily_event_neighbor_impact': None, 'daily_event_shifted_effect': None}, growth={'growth_term': 'linear'}, hyperparameter_override=None, regressors={'regressor_cols': []}, lagged_regressors={'lagged_regressor_dict': None}, seasonality={'auto_seasonality': False, 'yearly_seasonality': 40, 'quarterly_seasonality': 0, 'monthly_seasonality': 0, 'weekly_seasonality': 2, 'daily_seasonality': 0}, uncertainty={'uncertainty_dict': None}), SILVERKITE_COMPONENT_KEYWORDS: ~typing.Type[~enum.Enum] = <enum 'SILVERKITE_COMPONENT_KEYWORDS'>, SILVERKITE_EMPTY: ~typing.Union[str, ~greykite.framework.templates.autogen.forecast_config.ModelComponentsParam, ~greykite.framework.templates.simple_silverkite_template_config.SimpleSilverkiteTemplateOptions] = 'DAILY_SEAS_NONE_GR_NONE_CP_NONE_HOL_NONE_FEASET_OFF_ALGO_LINEAR_AR_OFF_DSI_OFF_WSI_OFF', VALID_FREQ: ~typing.List = <factory>, SimpleSilverkiteTemplateOptions: ~dataclasses.dataclass = <class 'greykite.framework.templates.simple_silverkite_template_config.SimpleSilverkiteTemplateOptions'>)[source]
Constants used by
SimpleSilverkiteTemplate
. Includes the model templates and their default values.mutable_field
is used when the default value is a mutable type like dict and list. Dataclass requires mutable default values to be wrapped in ‘default_factory’, so that instances of this dataclass cannot accidentally modify the default value.mutable_field
wraps the constant accordingly.- COMMON_MODELCOMPONENTPARAM_PARAMETERS: Dict
Defines the default component values for
SimpleSilverkiteTemplate
. The components include seasonality, growth, holiday, trend changepoints, feature sets, autoregression, fit algorithm, etc. These are used when config.model_template provides theSimpleSilverkiteTemplateOptions
.
- MULTI_TEMPLATES: Dict
A dictionary of multi templates.
Keys are the available multi templates names (valid strings for config.model_template).
Values correspond to a list of
ModelComponentsParam
.
- SILVERKITE: Union[str, ModelComponentsParam, SimpleSilverkiteTemplateOptions] = ModelComponentsParam(autoregression={'autoreg_dict': 'auto', 'simulation_num': 10, 'fast_simulation': False}, changepoints={'auto_growth': False, 'changepoints_dict': {'method': 'auto', 'yearly_seasonality_order': 15, 'resample_freq': '3D', 'regularization_strength': 0.6, 'actual_changepoint_min_distance': '30D', 'potential_changepoint_distance': '15D', 'no_changepoint_distance_from_end': '90D'}, 'seasonality_changepoints_dict': None}, custom={'fit_algorithm_dict': {'fit_algorithm': 'ridge', 'fit_algorithm_params': None}, 'feature_sets_enabled': 'auto', 'max_daily_seas_interaction_order': 5, 'max_weekly_seas_interaction_order': 2, 'extra_pred_cols': [], 'drop_pred_cols': None, 'explicit_pred_cols': None, 'min_admissible_value': None, 'max_admissible_value': None, 'regression_weight_col': None, 'normalize_method': 'zero_to_one', 'remove_intercept': False}, events={'auto_holiday': False, 'holidays_to_model_separately': 'auto', 'holiday_lookup_countries': 'auto', 'holiday_pre_num_days': 2, 'holiday_post_num_days': 2, 'holiday_pre_post_num_dict': None, 'daily_event_df_dict': None, 'daily_event_neighbor_impact': None, 'daily_event_shifted_effect': None}, growth={'growth_term': 'linear'}, hyperparameter_override=None, regressors={'regressor_cols': []}, lagged_regressors={'lagged_regressor_dict': None}, seasonality={'auto_seasonality': False, 'yearly_seasonality': 'auto', 'quarterly_seasonality': 'auto', 'monthly_seasonality': 'auto', 'weekly_seasonality': 'auto', 'daily_seasonality': 'auto'}, uncertainty={'uncertainty_dict': None})
Defines the
"SILVERKITE"
template. Contains automatic growth, seasonality, holidays, autoregression and interactions. Uses “zero_to_one” normalization method. Best for hourly and daily frequencies. Uses SimpleSilverkiteEstimator.
- SILVERKITE_MONTHLY: Union[str, ModelComponentsParam, SimpleSilverkiteTemplateOptions] = ModelComponentsParam(autoregression={'autoreg_dict': {'lag_dict': None, 'agg_lag_dict': {'orders_list': [[1, 2, 3]]}}, 'simulation_num': 50, 'fast_simulation': True}, changepoints={'auto_growth': False, 'changepoints_dict': {'method': 'auto', 'regularization_strength': 0.6, 'resample_freq': '28D', 'potential_changepoint_distance': '180D', 'potential_changepoint_n_max': 100, 'actual_changepoint_min_distance': '730D', 'no_changepoint_distance_from_end': '180D', 'yearly_seasonality_order': 6}, 'seasonality_changepoints_dict': None}, custom={'fit_algorithm_dict': {'fit_algorithm': 'ridge', 'fit_algorithm_params': None}, 'feature_sets_enabled': False, 'max_daily_seas_interaction_order': 0, 'max_weekly_seas_interaction_order': 0, 'extra_pred_cols': ['y_avglag_1_2_3*C(month, levels=list(range(1, 13)))', 'C(month, levels=list(range(1, 13)))'], 'drop_pred_cols': None, 'explicit_pred_cols': None, 'min_admissible_value': None, 'max_admissible_value': None, 'regression_weight_col': None, 'normalize_method': 'zero_to_one', 'remove_intercept': False}, events={'auto_holiday': False, 'holidays_to_model_separately': [], 'holiday_lookup_countries': [], 'holiday_pre_num_days': 0, 'holiday_post_num_days': 0, 'holiday_pre_post_num_dict': None, 'daily_event_df_dict': None, 'daily_event_neighbor_impact': None, 'daily_event_shifted_effect': None}, growth={'growth_term': 'linear'}, hyperparameter_override=None, regressors={'regressor_cols': []}, lagged_regressors={'lagged_regressor_dict': None}, seasonality={'auto_seasonality': False, 'yearly_seasonality': False, 'quarterly_seasonality': False, 'monthly_seasonality': False, 'weekly_seasonality': False, 'daily_seasonality': False}, uncertainty={'uncertainty_dict': None})
Defines the
SILVERKITE_MONTHLY
template. Best for monthly forecasts. Uses SimpleSilverkiteEstimator.
- SILVERKITE_DAILY_1_CONFIG_1: Union[str, ModelComponentsParam, SimpleSilverkiteTemplateOptions] = ModelComponentsParam(autoregression={'autoreg_dict': 'auto', 'simulation_num': 10, 'fast_simulation': False}, changepoints={'auto_growth': False, 'changepoints_dict': {'method': 'auto', 'resample_freq': '7D', 'regularization_strength': 0.809, 'potential_changepoint_distance': '7D', 'no_changepoint_distance_from_end': '7D', 'yearly_seasonality_order': 8, 'yearly_seasonality_change_freq': None}, 'seasonality_changepoints_dict': None}, custom={'fit_algorithm_dict': {'fit_algorithm': 'ridge', 'fit_algorithm_params': None}, 'feature_sets_enabled': 'auto', 'max_daily_seas_interaction_order': 5, 'max_weekly_seas_interaction_order': 2, 'extra_pred_cols': [], 'drop_pred_cols': None, 'explicit_pred_cols': None, 'min_admissible_value': None, 'max_admissible_value': None, 'regression_weight_col': None, 'normalize_method': 'zero_to_one', 'remove_intercept': False}, events={'auto_holiday': False, 'holidays_to_model_separately': ("New Year's Day", 'Chinese New Year', 'Christmas Day', 'Independence Day', 'Thanksgiving', 'Labor Day', 'Good Friday', 'Easter Monday [England, Wales, Northern Ireland]', 'Memorial Day', 'Veterans Day'), 'holiday_lookup_countries': ('UnitedStates', 'UnitedKingdom', 'India', 'France', 'China'), 'holiday_pre_num_days': 2, 'holiday_post_num_days': 2, 'holiday_pre_post_num_dict': None, 'daily_event_df_dict': None, 'daily_event_neighbor_impact': None, 'daily_event_shifted_effect': None}, growth={'growth_term': 'linear'}, hyperparameter_override=None, regressors={'regressor_cols': []}, lagged_regressors={'lagged_regressor_dict': None}, seasonality={'auto_seasonality': False, 'yearly_seasonality': 8, 'quarterly_seasonality': 0, 'monthly_seasonality': 7, 'weekly_seasonality': 1, 'daily_seasonality': 0}, uncertainty={'uncertainty_dict': None})
Config 1 in template
SILVERKITE_DAILY_1
. Compared toSILVERKITE
, it adds change points and uses parameters specifically tuned for daily data and 1-day forecast.
- SILVERKITE_DAILY_1_CONFIG_2: Union[str, ModelComponentsParam, SimpleSilverkiteTemplateOptions] = ModelComponentsParam(autoregression={'autoreg_dict': 'auto', 'simulation_num': 10, 'fast_simulation': False}, changepoints={'auto_growth': False, 'changepoints_dict': {'method': 'auto', 'resample_freq': '7D', 'regularization_strength': 0.624, 'potential_changepoint_distance': '7D', 'no_changepoint_distance_from_end': '17D', 'yearly_seasonality_order': 1, 'yearly_seasonality_change_freq': None}, 'seasonality_changepoints_dict': None}, custom={'fit_algorithm_dict': {'fit_algorithm': 'ridge', 'fit_algorithm_params': None}, 'feature_sets_enabled': 'auto', 'max_daily_seas_interaction_order': 5, 'max_weekly_seas_interaction_order': 2, 'extra_pred_cols': [], 'drop_pred_cols': None, 'explicit_pred_cols': None, 'min_admissible_value': None, 'max_admissible_value': None, 'regression_weight_col': None, 'normalize_method': 'zero_to_one', 'remove_intercept': False}, events={'auto_holiday': False, 'holidays_to_model_separately': ("New Year's Day", 'Chinese New Year', 'Christmas Day', 'Independence Day', 'Thanksgiving', 'Labor Day', 'Good Friday', 'Easter Monday [England, Wales, Northern Ireland]', 'Memorial Day', 'Veterans Day'), 'holiday_lookup_countries': ('UnitedStates', 'UnitedKingdom', 'India', 'France', 'China'), 'holiday_pre_num_days': 2, 'holiday_post_num_days': 2, 'holiday_pre_post_num_dict': None, 'daily_event_df_dict': None, 'daily_event_neighbor_impact': None, 'daily_event_shifted_effect': None}, growth={'growth_term': 'linear'}, hyperparameter_override=None, regressors={'regressor_cols': []}, lagged_regressors={'lagged_regressor_dict': None}, seasonality={'auto_seasonality': False, 'yearly_seasonality': 1, 'quarterly_seasonality': 0, 'monthly_seasonality': 4, 'weekly_seasonality': 6, 'daily_seasonality': 0}, uncertainty={'uncertainty_dict': None})
Config 2 in template
SILVERKITE_DAILY_1
. Compared toSILVERKITE
, it adds change points and uses parameters specifically tuned for daily data and 1-day forecast.
- SILVERKITE_DAILY_1_CONFIG_3: Union[str, ModelComponentsParam, SimpleSilverkiteTemplateOptions] = ModelComponentsParam(autoregression={'autoreg_dict': 'auto', 'simulation_num': 10, 'fast_simulation': False}, changepoints={'auto_growth': False, 'changepoints_dict': {'method': 'auto', 'resample_freq': '7D', 'regularization_strength': 0.59, 'potential_changepoint_distance': '7D', 'no_changepoint_distance_from_end': '8D', 'yearly_seasonality_order': 40, 'yearly_seasonality_change_freq': None}, 'seasonality_changepoints_dict': None}, custom={'fit_algorithm_dict': {'fit_algorithm': 'ridge', 'fit_algorithm_params': None}, 'feature_sets_enabled': 'auto', 'max_daily_seas_interaction_order': 5, 'max_weekly_seas_interaction_order': 2, 'extra_pred_cols': [], 'drop_pred_cols': None, 'explicit_pred_cols': None, 'min_admissible_value': None, 'max_admissible_value': None, 'regression_weight_col': None, 'normalize_method': 'zero_to_one', 'remove_intercept': False}, events={'auto_holiday': False, 'holidays_to_model_separately': ("New Year's Day", 'Chinese New Year', 'Christmas Day', 'Independence Day', 'Thanksgiving', 'Labor Day', 'Good Friday', 'Easter Monday [England, Wales, Northern Ireland]', 'Memorial Day', 'Veterans Day'), 'holiday_lookup_countries': ('UnitedStates', 'UnitedKingdom', 'India', 'France', 'China'), 'holiday_pre_num_days': 2, 'holiday_post_num_days': 2, 'holiday_pre_post_num_dict': None, 'daily_event_df_dict': None, 'daily_event_neighbor_impact': None, 'daily_event_shifted_effect': None}, growth={'growth_term': 'linear'}, hyperparameter_override=None, regressors={'regressor_cols': []}, lagged_regressors={'lagged_regressor_dict': None}, seasonality={'auto_seasonality': False, 'yearly_seasonality': 40, 'quarterly_seasonality': 0, 'monthly_seasonality': 0, 'weekly_seasonality': 2, 'daily_seasonality': 0}, uncertainty={'uncertainty_dict': None})
Config 3 in template
SILVERKITE_DAILY_1
. Compared toSILVERKITE
, it adds change points and uses parameters specifically tuned for daily data and 1-day forecast.
- class SILVERKITE_COMPONENT_KEYWORDS(value)
Valid values for simple silverkite template string name keywords. The names are the keywords and the values are the corresponding value enum. Can be used to create an instance of
SimpleSilverkiteTemplateOptions
.
- SILVERKITE_EMPTY: Union[str, ModelComponentsParam, SimpleSilverkiteTemplateOptions] = 'DAILY_SEAS_NONE_GR_NONE_CP_NONE_HOL_NONE_FEASET_OFF_ALGO_LINEAR_AR_OFF_DSI_OFF_WSI_OFF'
Defines the
"SILVERKITE_EMPTY"
template. Everything here is None or off.
- VALID_FREQ: List
Valid non-default values for simple silverkite template string name frequency.
SimpleSilverkiteTemplateOptions
.
- class SimpleSilverkiteTemplateOptions(freq: SILVERKITE_FREQ = SILVERKITE_FREQ.DAILY, seas: SILVERKITE_SEAS = SILVERKITE_SEAS.LT, gr: SILVERKITE_GR = SILVERKITE_GR.LINEAR, cp: SILVERKITE_CP = SILVERKITE_CP.NONE, hol: SILVERKITE_HOL = SILVERKITE_HOL.NONE, feaset: SILVERKITE_FEASET = SILVERKITE_FEASET.OFF, algo: SILVERKITE_ALGO = SILVERKITE_ALGO.LINEAR, ar: SILVERKITE_AR = SILVERKITE_AR.OFF, dsi: SILVERKITE_DSI = SILVERKITE_DSI.AUTO, wsi: SILVERKITE_WSI = SILVERKITE_WSI.AUTO)
Defines generic simple silverkite template options. Attributes can be set to different values using
SILVERKITE_COMPONENT_KEYWORDS
for high level tuning.- algo: SILVERKITE_ALGO = 'LINEAR'
Valid values for simple silverkite template string name fit algorithm. See
SILVERKITE_ALGO
.
- ar: SILVERKITE_AR = 'OFF'
Valid values for simple silverkite template string name autoregression. See
SILVERKITE_AR
.
- cp: SILVERKITE_CP = 'NONE'
Valid values for simple silverkite template string name changepoints. See
SILVERKITE_CP
.
- dsi: SILVERKITE_DSI = 'AUTO'
Valid values for simple silverkite template string name max daily seasonality interaction order. See
SILVERKITE_DSI
.
- feaset: SILVERKITE_FEASET = 'OFF'
Valid values for simple silverkite template string name feature sets enabled. See
SILVERKITE_FEASET
.
- freq: SILVERKITE_FREQ = 'DAILY'
Valid values for simple silverkite template string name frequency. See
SILVERKITE_FREQ
.
- gr: SILVERKITE_GR = 'LINEAR'
Valid values for simple silverkite template string name growth. See
SILVERKITE_GR
.
- hol: SILVERKITE_HOL = 'NONE'
Valid values for simple silverkite template string name holiday. See
SILVERKITE_HOL
.
- seas: SILVERKITE_SEAS = 'LT'
Valid values for simple silverkite template string name seasonality. See
SILVERKITE_SEAS
.
- wsi: SILVERKITE_WSI = 'AUTO'
Valid values for simple silverkite template string name max weekly seasonality interaction order. See
SILVERKITE_WSI
.
EasyConfig
- class greykite.algo.common.seasonality_inferrer.SeasonalityInferrer[source]
A class to infer appropriate Fourier series orders in different seasonality components.
The method allows users to:
optionally remove the trend with different methods. Available methods are in
TrendAdjustMethodEnum
.optionally do an aggregation.
fits the seasonality component with different Fourier series orders.
calculates the AIC/BIC of the fits.
choose the most appropriate order with AIC or BIC and an optional tolerance.
plot the investigations.
- df
The input timeseries.
- Type
pandas.DataFrame
or None
- time_col
The column name for timestamps in
df
.- Type
str or None
- value_col
The column name for values in
df
.- Type
str or None
- fourier_series_orders
The inferred Fourier series orders. The keys are the seasonality component names. The values are the inferred best orders according to the config.
- Type
dict or None
- df_features
The cached dataframe with time features. Building this df is slow for large dataset. We cache it the first time we build it for subsequent uses.
- Type
pandas.DataFrame
or None
- infer_fourier_series_order(df: DataFrame, configs: List[SeasonalityInferConfig], time_col: str = 'ts', value_col: str = 'y', adjust_trend_method: Optional[str] = None, adjust_trend_param: Optional[dict] = None, fit_algorithm: Optional[str] = None, tolerance: Optional[float] = None, plotting: Optional[bool] = None, aggregation_period: Optional[str] = None, offset: Optional[int] = None, criterion: Optional[str] = None) dict [source]
Infers the most appropriate Fourier series order. Can infer multiple seasonality components with multiple configs at the same time. The configurations for each component are passed as a list of
SeasonalityInferConfig
object. To override a parameter for all configs, pass it via this function’s parameter.For each seasonality component, the method first does an optional trend removal via grouped average or spline fit. For example, for yearly seasonality, one option is to remove the average of each year from the time series. The seasonality pattern is clearer and dominates after the trend removal.
Next it does an optional aggregation to emphasize the current seasonality. For example, for yearly seasonality, it can do a weekly aggregation so that the weekly seasonality won’t be mixed when modeling yearly seasonality.
Then it fits seasonality model using Fourier series with orders up to a certain max_order, and computes the AIC/BIC of the models.
The final order will be selected based on the criterion with a tolerance adjustment. A pre-specified offset can also be added to the selected order for adjustment.
- Parameters
df (
pandas.DataFrame
) – The input timeseries.configs (list [
SeasonalityInferConfig
]) – A list ofSeasonalityInferConfig
objects. Each element corresponds to the config for a seasonality component. For example, if you would like to infer seasonality orders for yearly seasonality and weekly seasonality, you need to provide a list of two configs.time_col (str) – The column name for timestamps in
df
.value_col (str) – The column name for values in
df
.adjust_trend_method (str or None, default None) – The methods used to adjust trend. Supported methods are in AdjustTrendMethodEnum. If not None, value is used to override all configs.
adjust_trend_param (dict or None, default None) – Additional parameters for adjusting trend. For valid options, see
_adjust_trend
. If not None, value is used to override all configs.fit_algorithm (str or None, default None) – The algorithms used to fit the seasonality. Supported algorithms are “linear”, “ridge” and “sgd”. If not None, value is used to override all configs.
plotting (bool or None, default None) – Whether to generate plots. If True, the returned dictionary will have plot via the “fig” key. Can turn this off to speed up the process. If not None, value is used to override all configs.
tolerance (float or None, default None) – A tolerance on the criterion to allow a smaller order. For example, if AIC’s minimum is 100 and
tolerance
is 0.1, then the function will find the smallest order that has AIC less than or equal to 110. If not None, value is used to override all configs.aggregation_period (str or None, default None) – The aggregation periods before fitting the Fourier series. Having aggregation to eliminate shorter seasonal periods may help get more accurate orders. But also make sure the number of observations after aggregation is sufficient. (At least 2 * max_order + 1 to have a unique solution for the regression problem) If not None, value is used to override all configs.
offset (int or None, default None) – The offset order to be added to the inferred orders. The orders after applying offsets can not be negative. If not None, value is used to override all configs.
criterion (str or None, default None) – The criteria to pick the most appropriate orders. If not None, value is used to override all configs.
- Returns
result –
The result dictionary with the following keys:
”result”: a list of result dictionaries from the inferring methods. The keys are:
”seas_name”: the seasonality name.
”orders”: the Fourier series orders fitted.
”aics”: the fitted AICs.
”bics”: the fitted BICs.
”best_aic_order”: the order corresponding to the best feasible AIC.
”best_bic_order”: the order corresponding to the best feasible BIC.
”fig”: the diagnostic figure.
”best_orders”: a dictionary of seasonality component names and their inferred Fourier series orders.
- Return type
dict
- class greykite.algo.common.seasonality_inferrer.TrendAdjustMethodEnum(value)[source]
The methods that are available for adjusting trend in
infer_fourier_series_order
.- seasonal_average = 'seasonal_average'
Calculates the average within each seasonal period and removes it.
- overall_average = 'overall_average'
Calculates the average of the whole timeseries and removes it.
- spline_fit = 'spline_fit'
Fits a spline with no knots (polynomial) with a certain degree and removes it.
- none = 'none'
Does not adjust trend.
- class greykite.algo.common.seasonality_inferrer.SeasonalityInferConfig(seas_name: str, col_name: str, period: float, max_order: int, adjust_trend_method: str = 'seasonal_average', adjust_trend_param: Optional[dict] = None, fit_algorithm: str = 'ridge', tolerance: float = 0.0, plotting: bool = False, aggregation_period: Optional[str] = None, offset: int = 0, criterion: str = 'bic')[source]
A dataclass to pass the parameters for
infer_fourier_series_order
.- seas_name
Required. The seasonality component name. Will be used to distinguish the results.
- Type
str
- col_name
Required. The column name used to generate seasonality Fourier series. Must be in
df
or can be generated bybuild_time_features_df
. See fourier_series_multi_func.- Type
str
- period
Required. The period corresponding to
col_name
. See fourier_series_multi_func.- Type
float
- max_order
Required. The maximum Fourier series order to fit.
- Type
int
- adjust_trend_method
The method used to adjust trend. Supported methods are in AdjustTrendMethodEnum. None values are default to “seasonal_average” with subtracting yearly average as the default.
- Type
str or None, default “seasonal_average”
- adjust_trend_param
Additional parameters for adjusting the trend. For valid options, see
_adjust_trend
.- Type
dict or None, default None
- fit_algorithm
The algorithm used to fit the seasonality. Supported algorithms are “linear”, “ridge” and “sgd”. None values are default to “ridge”.
- Type
str or None, default “ridge”
- plotting
Whether to generate plots. If True, the returned dictionary will have plot via the “fig” key. Can turn this off to speed up the process. None values are default to False.
- Type
bool or None, default False
- tolerance
A tolerance on the criterion to allow a smaller order. For example, if AIC’s minimum is 100 and
tolerance
is 0.1, then the function will find the smallest order that has AIC less than or equal to 110. None values are default to 0.0.- Type
float or None, default 0.0
- aggregation_period
The aggregation period before fitting the Fourier series. Having aggregation to eliminate shorter seasonal periods may help get more accurate orders. But also making sure the number of observations after aggregation is sufficient. None corresponds to no aggregation.
- Type
str or None, default None
- offset
The offset order to be added to the inferred orders. The order after adding offset can not be negative.
- Type
int or None, default 0
- criterion
The criterion to pick the most appropriate orders. Supported criteria are “aic” and “bic”. None values are default to “bic”.
- Type
str or None, default “bic”
- class greykite.algo.common.holiday_inferrer.HolidayInferrer[source]
Implements methods to automatically infer holiday effects.
The class works for daily and sub-daily data. Sub-daily data is aggregated into daily data. It pulls holiday candidates from pypi:holidays-ext, and adds a pre-specified number of days before/after the holiday candidates as the whole holiday candidates pool. Every day in the candidate pool is compared with a pre-defined baseline imputed from surrounding days (e.g. the average of -7 and +7 days) and a score is generated to indicate deviation. The score is averaged if a holiday has multiple occurrences through the timeseries period. The holidays are ranked according to the magnitudes of the scores. Holidays are classified into:
model independently
model together
do not model
according to their score magnitudes. For example, if the sum of the absolute scores is 1000, and the threshold for independent holidays is 0.8, the method keeps adding holidays to the independent modeling list from the largest magnitude until the sum reaches 1000 x 0.8 = 800. Then it continues to count the together modeling list.
- baseline_offsets
The offsets in days to calculate baselines.
- Type
list [int] or None
- post_search_days
The number of days after each holiday to be counted as candidates.
- Type
int or None
- pre_search_days
The number of days before each holiday to be counted as candidates.
- Type
int or None
- independent_holiday_thres
A certain proportion of the total holiday effects that are allocated for holidays that are modeled independently. For example, 0.8 means the holidays that contribute to the first 80% of the holiday effects are modeled independently.
- Type
float or None
- together_holiday_thres
A certain proportion of the total holiday effects that are allocated for holidays that are modeled together. For example, if
independent_holiday_thres
is 0.8 andtogether_holiday_thres
is 0.9, then after the first 80% of the holiday effects are counted, the rest starts to be allocated for the holidays that are modeled together until the cum sum exceeds 0.9.- Type
float or None
- extra_years
Extra years after
self.year_end
to pull holidays inself.country_holiday_df
. This can be used to cover the forecast periods.- Type
int, default 2
- df
The timeseries after daily aggregation.
- Type
pandas.DataFrame
or None
- time_col
The column name for timestamps in
df
.- Type
str or None
- value_col
The column name for values in
df
.- Type
str or None
- year_start
The year of the first timeseries observation in
df
.- Type
int or None
- year_end
The year of the last timeseries observation in
df
.- Type
int or None
- country_holiday_df
The holidays between
year_start
andyear_end
. This is the output from pypi:holidays-ext. Duplicates are dropped. Observed holidays are merged.- Type
pandas.DataFrame
or None
- holidays
A list of holidays in
country_holiday_df
.- Type
list [str] or None
- score_result
The scores from comparing holidays and their baselines. The keys are holidays. The values are a list of the scores for each occurrence.
- Type
dict [str, list [float]] or None
- score_result_avg
The scores from
score_result
where the values are averaged.- Type
dict [str, float] or None
- result
The output of the model. Includes:
- “scores”: dict [str, list [float]]
The
score_result
fromself._get_scores_for_holidays
.
- “country_holiday_df”:
pandas.DataFrame
The
country_holiday_df
frompypi:holidays_ext
.
- “country_holiday_df”:
- “independent_holidays”: list [tuple [str, str]]
The holidays to be modeled independently. Each item is in (country, holiday) format.
- “together_holidays_positive”: list [tuple [str, str]]
The holidays with positive effects to be modeled together. Each item is in (country, holiday) format.
- “together_holidays_negative”: list [tuple [str, str]]
The holidays with negative effects to be modeled together. Each item is in (country, holiday) format.
- “fig”:
plotly.graph_objs.Figure
The visualization if activated.
- “fig”:
- Type
dict [str, any]
- infer_holidays(df: DataFrame, time_col: str = 'ts', value_col: str = 'y', countries: List[str] = ('US',), pre_search_days: int = 2, post_search_days: int = 2, baseline_offsets: List[int] = (-7, 7), plot: bool = False, independent_holiday_thres: float = 0.8, together_holiday_thres: float = 0.99, extra_years: int = 2, use_relative_score: bool = False) Optional[Dict[str, any]] [source]
Infers significant holidays and holiday configurations.
The class works for daily and sub-daily data. Sub-daily data is aggregated into daily data. It pulls holiday candidates from pypi:holidays-ext, and adds a pre-specified number of days before/after the holiday candidates as the whole holiday candidates pool. Every day in the candidate pool is compared with a pre-defined baseline imputed from surrounding days (e.g. the average of -7 and +7 days) and a score is generated to indicate deviation. The score is averaged if a holiday has multiple occurrences through the timeseries period. The holidays are ranked according to the magnitudes of the scores. Holidays are classified into:
model independently
model together
do not model
according to their score magnitudes. For example, if the sum of the absolute scores is 1000, and the threshold for independent holidays is 0.8, the method keeps adding holidays to the independent modeling list from the largest magnitude until the sum reaches 1000 x 0.8 = 800. Then it continues to count the together modeling list.
- Parameters
df (
pd.DataFrame
) – The input timeseries.time_col (str, default
TIME_COL
) – The column name for timestamps indf
.value_col (str, default
VALUE_COL
) – The column name for values indf
.countries (list [str], default (“UnitedStates”,)) – A list of countries to look up holiday candidates. Available countries can be listed with
holidays_ext.get_holidays.get_available_holiday_lookup_countries()
. Two-character country names are preferred.pre_search_days (int, default 2) – The number of days to include as holidays candidates before each holiday.
post_search_days (int, default 2) – The number of days to include as holidays candidates after each holiday.
baseline_offsets (list [int], default (-7, 7)) – The offsets in days as a baseline to compare with each holiday.
plot (bool, default False) – Whether to generate visualization.
independent_holiday_thres (float, default 0.8) – A certain proportion of the total holiday effects that are allocated for holidays that are modeled independently. For example, 0.8 means the holidays that contribute to the first 80% of the holiday effects are modeled independently.
together_holiday_thres (float, default 0.99) – A certain proportion of the total holiday effects that are allocated for holidays that are modeled together. For example, if
independent_holiday_thres
is 0.8 andtogether_holiday_thres
is 0.9, then after the first 80% of the holiday effects are counted, the rest starts to be allocated for the holidays that are modeled together until the cum sum exceeds 0.9.extra_years (int, default 2) – Extra years after
self.year_end
to pull holidays inself.country_holiday_df
. This can be used to cover the forecast periods.use_relative_score (bool, default False) – Whether the holiday effect is calculated as a relative ratio. If False,
_get_score_for_dates
will use absolute difference compared to the baseline as the score. If True, it uses relative ratio compared to the baseline as the score.
- Returns
result –
A dictionary with the following keys:
- ”scores”: dict [str, list [float]]
The
score_result
fromself._get_scores_for_holidays
.
- ”country_holiday_df”:
pandas.DataFrame
The
country_holiday_df
frompypi:holidays_ext
.
- ”country_holiday_df”:
- ”independent_holidays”: list [tuple [str, str]]
The holidays to be modeled independently. Each item is in (country, holiday) format.
- ”together_holidays_positive”: list [tuple [str, str]]
The holidays with positive effects to be modeled together. Each item is in (country, holiday) format.
- ”together_holidays_negative”: list [tuple [str, str]]
The holidays with negative effects to be modeled together. Each item is in (country, holiday) format.
- ”fig”:
plotly.graph_objs.Figure
The visualization if activated.
- ”fig”:
- Return type
dict [str, any] or None
- generate_daily_event_dict(country_holiday_df: Optional[DataFrame] = None, holiday_result: Optional[Dict[str, List[Tuple[str, str]]]] = None) Dict[str, DataFrame] [source]
Generates daily event dict for all holidays inferred. The daily event dict will contain:
Single events for every holiday or holiday neighboring day that is to be modeled independently.
A single event for all holiday or holiday neighboring days with positive effects that are modeled together.
A single event for all holiday or holiday neighboring days with negative effects that are modeled together.
- Parameters
country_holiday_df (
pandas.DataFrame
or None, default None) – The dataframe that contains the country/holiday/dates information for holidays. Must cover the periods need in training/forecasting for all holidays. This has the same format asself.country_holiday_df
. If None, it pulls fromself.country_holiday_df
.holiday_result (dict [str, list [tuple [str, str]]] or None, default None) –
A dictionary with the following keys:
INFERRED_INDEPENDENT_HOLIDAYS_KEY
INFERRED_GROUPED_POSITIVE_HOLIDAYS_KEY
INFERRED_GROUPED_NEGATIVE_HOLIDAYS_KEY
Each key’s value is a list of length-2 tuples of the format (country, holiday). This format is the output of
self.infer_holidays
. If None, it pulls fromself.result
.
- Returns
daily_event_dict – The daily event dict that is consumable by
SimpleSilverkiteForecast
orSilverkiteForecast
. The keys are the event names. The values are dataframes with the event dates.- Return type
dict
- class greykite.algo.common.holiday_grouper.HolidayGrouper(df: DataFrame, time_col: str, value_col: str, holiday_df: DataFrame, holiday_date_col: str, holiday_name_col: str, holiday_impact_dict: Optional[Dict[str, Tuple[int, int]]] = None, get_suffix_func: Optional[Union[str, Callable]] = 'wd_we')[source]
This module estimates the impact of holidays and their neighboring days given a raw holiday dataframe
holiday_df
, and a time series containing the observed values to construct the baselines. It groups events with similar effects to several groups using kernel density estimation (KDE) and generates the grouped events in a dictionary of dataframes that is recognizable bySilverkiteForecast
.- Parameters
df (
pandas.DataFrame
) – Input time series that containstime_col
andvalue_col
. The values will be used to construct baselines to estimate the holiday impact.time_col (str) – Name of the time column in
df
.value_col (str) – Name of the value column in
df
.holiday_df (
pandas.DataFrame
) – Input holiday dataframe that contains the dates and names of the holidays.holiday_date_col (str) – Name of the holiday date column in
holiday_df
.holiday_name_col (str) – Name of the holiday name column in
holiday_df
.holiday_impact_dict (
Dict
[str, Any] or None, default None) –A dictionary containing the neighboring impacting days of a certain holiday. The key is the name of the holiday matching those in the provided
holiday_df
. The value is a tuple of two values indicating the number of neighboring days before and after the holiday. For example, a valid dictionary may look like:holiday_impact_dict = { "Christmas Day": [3, 3], "Memorial Day": [0, 0] }
get_suffix_func (Callable or str or None, default “wd_we”) –
A function that generates a suffix (usually a time feature e.g. “_WD” for weekday, “_WE” for weekend) given an input date. This can be used to estimate the interaction between floating holidays and on which day they are getting observed. We currently support two defaults:
”wd_we” to generate suffixes based on whether the day falls on weekday or weekend.
”dow_grouped” to generate three categories: [“_WD”, “_Sat”, “_Sun”].
If None, no suffix is added.
- expanded_holiday_df
An expansion of
holiday_df
after adding the neighboring dates provided inholiday_impact_dict
and the suffix generated byget_suffix_func
. For example, if"Christmas Day": [3, 3]
and “wd_we” are used, events such as “Christmas Day_WD_plus_1_WE” or “Christmas Day_WD_minus_3_WD” will be generated for a Christmas that falls on Friday.- Type
- baseline_offsets
The offsets in days to calculate baselines for a given holiday. By default, the same days of the week before and after are used.
- Type
Tuple`[`int] or None
- use_relative_score
Whether to use relative or absolute score when estimating the holiday impact.
- Type
bool or None
- clustering_method
Clustering method used to group the holidays. Since we are doing 1-D clustering, current supported methods include (1) “kde” for kernel density estimation, and (2) “kmeans” for k-means clustering.
- Type
str or None
- bandwidth
The bandwidth used in the kernel density estimation. Higher bandwidth results in less clusters. If None, it is automatically inferred with the
bandwidth_multiplier
factor.- Type
float or None
- bandwidth_multiplier
Multiplier to be multiplied to the kernel density estimation’s default parameter calculated from here<https://en.wikipedia.org/wiki/Kernel_density_estimation#A_rule-of-thumb_bandwidth_estimator>_. This multiplier has been found useful in adjusting the default bandwidth parameter in many cases. Only used when
bandwidth
is not specified.- Type
float or None
- kde
The
KernelDensity
object ifclustering_method == "kde"
.- Type
KernelDensity
or None
- n_clusters
Number of clusters in the k-means algorithm.
- Type
int or None
- kmeans
The
KMeans
object ifclustering_method == "kmeans"
.- Type
KMeans
or None
- include_diagnostics
Whether to include
kmeans_diagnostics
andkmeans_plot
in the outputresult_dict
.- Type
bool or None
- result_dict
A dictionary that stores the scores and clustering results, with the following keys.
- “holiday_inferrer”: the
HolidayInferrer
instance used for calculating the scores.
- “holiday_inferrer”: the
- “score_result_original”: a dictionary with keys being the names of all holiday events
after expansion (i.e. the keys in
expanded_holiday_df
), values being a list of scores of all dates corresponding to this event.
- “score_result_avg_original”: a dictionary with the same key as in
result_dict["score_result_original"]
. But the values are the average scores of each event across all occurrences.
- “score_result”: same as
result_dict["score_result_original"]
, but after removing holidays with inconsistent / negligible scores.
- “score_result”: same as
- “score_result_avg”: same as
result_dict["score_result_original"]
, but after removing holidays with inconsistent / negligible scores.
- “score_result_avg”: same as
- “daily_event_df_dict_with_score”: a dictionary of dataframes.
Key is the group name
"holiday_group_{k}"
. Value is a dataframe of all holiday events in this group, containing 4 columns: “date” (EVENT_DF_DATE_COL
), “event_name” (EVENT_DF_LABEL_COL
), “original_name”, “avg_score”.
- “daily_event_df_dict”: a dictionary of dataframes that is ready to use in SilverkiteForecast.
Contains 2 keys:
EVENT_DF_DATE_COL
andEVENT_DF_LABEL_COL
.
“kde_cutoffs”: a list of float, the cutoffs returned by the kernel density clustering.
“kde_res”: a dataframe that contains “score” and “density” from the kernel density estimation.
“kde_plot”: a plot of the kernel density estimation.
- “kmeans_diagnostics”: a dataframe containing metrics for different number of clusters.
Columns are:
“k”: number of clusters;
“wsse”: within-cluster sum of squared error (lower is better);
- “sil_score”: Silhouette coefficient, a value between [-1, 1] that describes
the separation of clusters (higher is better).
Only generated when
include_diagnostics
is True. Seegroup_holidays
for details.
- “kmeans_plot”: a plot visualizing how the diagnostic metrics change over K.
Only generated when
include_diagnostics
is True. Seegroup_holidays
for details.
- Type
Dict`[`str, Any] or None
- group_holidays(baseline_offsets: Tuple[int, int] = (-7, 7), use_relative_score: bool = True, min_n_days: int = 1, min_same_sign_ratio: float = 0.66, min_abs_avg_score: float = 0.05, clustering_method: str = 'kde', bandwidth: Optional[float] = None, bandwidth_multiplier: Optional[float] = 0.2, n_clusters: Optional[int] = 5, include_diagnostics: bool = False) None [source]
Estimates the impact of holidays and their neighboring days and groups events with similar effects to several groups using kernel density estimation (KDE). Then generates the grouped events and stores the results in
self.result_dict
.- Parameters
baseline_offsets (Tuple`[`int], default (-7, 7)) – The offsets in days to calculate baselines for a given holiday. By default, the same days of the week before and after are used.
use_relative_score (bool, default True) – Whether to use relative or absolute score when estimating the holiday impact.
min_n_days (int, default 1) – Minimal number of occurrences for a holiday event to be kept before grouping.
min_same_sign_ratio (float, default 0.66) – Threshold of the ratio of the same-sign scores for an event’s occurrences. For example, if an event has two occurrences, they both need to have positive or negative scores for the ratio to achieve 0.66. Similarly, if an event has 3 occurrences, at least 2 of them must have the same directional impact. This parameter is intended to rule out holidays that have indefinite effects.
min_abs_avg_score (float, default 0.05) – The minimal average score of an event (across all its occurrences) to be kept before grouping. When
use_relative_score = True
, 0.05 means the effect must be greater than 5%.clustering_method (str, default “kde”) – Clustering method used to group the holidays. Since we are doing 1-D clustering, current supported methods include (1) “kde” for kernel density estimation, and (2) “kmeans” for k-means clustering.
bandwidth (float or None, default None) – The bandwidth used in the kernel density estimation. Higher bandwidth results in less clusters. If None, it is automatically inferred with the
bandwidth_multiplier
factor. Only used whenclustering_method == "kde"
.bandwidth_multiplier (float or None, default 0.2) – Multiplier to be multiplied to the kernel density estimation’s default parameter calculated from here<https://en.wikipedia.org/wiki/Kernel_density_estimation#A_rule-of-thumb_bandwidth_estimator>_. This multiplier has been found useful in adjusting the default bandwidth parameter in many cases. Only used when
bandwidth
is not specified andclustering_method == "kde"
.n_clusters (int or None, default 5) – Number of clusters in the k-means algorithm. Only used when
clustering_method == "kmeans"
.include_diagnostics (bool, default False) – Whether to include
kmeans_diagnostics
andkmeans_plot
in the outputresult_dict
.
- Return type
Saves the results in the
result_dict
attribute.
- get_holiday_scores(baseline_offsets: Tuple[int, int] = (-7, 7), use_relative_score: bool = True, min_n_days: int = 1, min_same_sign_ratio: float = 0.66, min_abs_avg_score: float = 0.05) Dict[str, Any] [source]
Computes the score of all holiday events and their neighboring days in
self.expanded_holiday_df
, by comparing their observed values with a baseline value that is an average of the values on the days specified inbaseline_offsets
. If a baseline date falls on another holiday, the algorithm looks for the next value with the same step size as the given offset, up to 3 extra iterations. Please see more details in_get_scores_for_holidays
. An additional pruning step is done to remove holidays with inconsistent / negligible scores. Both the results before and after the pruning are returned.- Parameters
baseline_offsets (Tuple`[`int], default (-7, 7)) – The offsets in days to calculate baselines for a given holiday. By default, the same days of the week before and after are used.
use_relative_score (bool, default True) – Whether to use relative or absolute score when estimating the holiday impact.
min_n_days (int, default 1) – Minimal number of occurrences for a holiday event to be kept before grouping.
min_same_sign_ratio (float, default 0.66) – Threshold of the ratio of the same-sign scores for an event’s occurrences. For example, if an event has two occurrences, they both need to have positive or negative scores for the ratio to achieve 0.66. Similarly, if an event has 3 occurrences, at least 2 of them must have the same directional impact. This parameter is intended to rule out holidays that have indefinite effects.
min_abs_avg_score (float, default 0.05) – The minimal average score of an event (across all its occurrences) to be kept before grouping. When
use_relative_score = True
, 0.05 means the effect must be greater than 5%.
- Returns
result_dict – A dictionary containing the scoring results. In particular the following keys are set: “holiday_inferrer”, “score_result_original”, “score_result_avg_original”, “score_result”, and “score_result_avg”. Please refer to the docstring of the
self.result_dict
attribute ofHolidayGrouper
.- Return type
Dict
[str, Any]
- check_scores(holiday_name_pattern: str, show_pruned: bool = True) None [source]
Spot checks the score of certain holidays containing pattern
holiday_name_pattern
. Prints out the dates, individual day scores of all occurrences, and the average scores of all matching holiday events. Note that it only checks the keys inself.expanded_holiday_df
, and it assumesget_holiday_scores
is already run.- Parameters
holiday_name_pattern (str) – Any substring of the holiday event names (
self.expanded_holiday_df[self.holiday_name_col]
).show_pruned (bool, default True) – Whether to show pruned holidays along with the remaining holidays.
- Returns
Prints out the dates, individual day scores of all occurrences,
and the average scores of all matching holiday events.
- check_holiday_group(holiday_name_pattern: str = '', holiday_groups: Optional[Union[int, List[int]]] = None) None [source]
Prints out the holiday groups that contain holidays matching
holiday_name_pattern
and their scores. The searching is limited to the givenholiday_groups
. Note that it assumesgroup_holidays
has already been run.- Parameters
holiday_name_pattern (str) – Any substring of the holiday event names (
self.expanded_holiday_df[self.holiday_name_col]
).holiday_groups (List`[`int] or int, default None) – The indices of holiday groups that the searching is limited in. If None, all groups are available to search.
- Return type
Prints out all qualifying holiday groups and their scores.
- static expand_holiday_df_with_suffix(holiday_df: DataFrame, holiday_date_col: str, holiday_name_col: str, holiday_impact_dict: Optional[Dict[str, Tuple[int, int]]] = None, get_suffix_func: Optional[Union[str, Callable]] = 'wd_we') DataFrame [source]
Expands an input holiday dataframe
holiday_df
to include the neighboring days specified inholiday_impact_dict
. Also adds suffixes generated byget_suffix_func
to better model the effects of events falling on different days of week.- Parameters
holiday_df (
pandas.DataFrame
) – Input holiday dataframe that contains the dates and names of the holidays.holiday_date_col (str) – Name of the holiday date column in
holiday_df
.holiday_name_col (str) – Name of the holiday name column in
holiday_df
.holiday_impact_dict (
Dict
[str, Any] or None, default None) –A dictionary containing the neighboring impacting days of a certain holiday. The key is the name of the holiday matching those in the provided
holiday_df
. The value is a tuple of two values indicating the number of neighboring days before and after the holiday. For example, a valid dictionary may look like:holiday_impact_dict = { "Christmas Day": [3, 3], "Memorial Day": [0, 0] }
get_suffix_func (Callable or str or None, default “wd_we”) –
A function that generates a suffix (usually a time feature e.g. “_WD” for weekday, “_WE” for weekend) given an input date. This can be used to estimate the interaction between floating holidays and on which day they are getting observed. We currently support two defaults:
”wd_we” to generate suffixes based on whether the day falls on weekday or weekend.
”dow_grouped” to generate three categories: [“_WD”, “_Sat”, “_Sun”].
If None, no suffix is added.
- Returns
expanded_holiday_df – An expansion of
holiday_df
after adding the neighboring dates provided inholiday_impact_dict
and the suffix generated byget_suffix_func
. For example, if"Christmas Day": [3, 3]
and “wd_we” are used, events such as “Christmas Day_WD_plus_1_WE” or “Christmas Day_WD_minus_3_WD” will be generated for a Christmas that falls on Friday.- Return type
Changepoint Detection
- class greykite.algo.changepoint.adalasso.changepoint_detector.ChangepointDetector[source]
A class to implement change point detection.
Currently supports long-term change point detection only. Input is a dataframe with time_col indicating the column of time info (the format should be able to be parsed by pd.to_datetime), and value_col indicating the column of observed time series values.
- original_df
The original data df, used to retrieve original observations, if aggregation is used in fitting change points.
- Type
- time_col
The column name for time column.
- Type
str
- value_col
The column name for value column.
- Type
str
- trend_potential_changepoint_n
The number of change points that are evenly distributed over the time period.
- Type
int
- yearly_seasonality_order
The yearly seasonality order used when fitting trend.
- Type
int
- y
The observations after aggregation.
- Type
- trend_df
The augmented df of the original_df, including regressors of trend change points and Fourier series for yearly seasonality.
- Type
- trend_model
The fitted trend model.
- Type
sklearn.base.RegressionMixin
- trend_coef
The estimated trend coefficients.
- Type
- trend_intercept
The estimated trend intercept.
- Type
float
- adaptive_lasso_coef
The list of length two, first element is estimated trend coefficients, and second element is intercept, both estimated by adaptive lasso.
- Type
list
- trend_changepoints
The list of detected trend change points, parsable by pd.to_datetime
- Type
list
- trend_estimation
The estimated trend with detected trend change points.
- Type
pd.Series
- seasonality_df
The augmented df of
original_df
, including regressors of seasonality change points with different Fourier series frequencies.- Type
- seasonality_changepoints
The dictionary of detected seasonality change points for each component. Keys are component names, and values are list of change points.
- Type
dict
- seasonality_estimation
The estimated seasonality with detected seasonality change points. The series has the same length as
original_df
. Index is timestamp, and values are the estimated seasonality at each timestamp. The seasonality estimation is the estimated of seasonality effect with trend estimated byestimate_trend_with_detected_changepoints
removed.- Type
- find_trend_changepoints : callable
Finds the potential trend change points for a given time series df.
- plot : callable
Plot the results after implementing find_trend_changepoints.
- find_trend_changepoints(df, time_col, value_col, yearly_seasonality_order=8, yearly_seasonality_change_freq=None, resample_freq='D', trend_estimator='ridge', adaptive_lasso_initial_estimator='ridge', regularization_strength=None, actual_changepoint_min_distance='30D', potential_changepoint_distance=None, potential_changepoint_n=100, potential_changepoint_n_max=None, no_changepoint_distance_from_begin=None, no_changepoint_proportion_from_begin=0.0, no_changepoint_distance_from_end=None, no_changepoint_proportion_from_end=0.0, fast_trend_estimation=True)[source]
Finds trend change points automatically by adaptive lasso.
The algorithm does an aggregation with a user-defined frequency, defaults daily.
If
potential_changepoint_distance
is not given,potential_changepoint_n
potential change points are evenly distributed over the time period, elsepotential_changepoint_n
is overridden by:total_time_length / ``potential_changepoint_distance``
Users can specify either
no_changepoint_proportion_from_end
to specify what proportion from the end of data they do not want changepoints, orno_changepoint_distance_from_end
(overridesno_changepoint_proportion_from_end
) to specify how long from the end they do not want change points.Then all potential change points will be selected by adaptive lasso, with the initial estimator specified by
adaptive_lasso_initial_estimator
. If user specifiesregularization_strength
, then the adaptive lasso will be run with a single tuning parameter calculated based on user provided prior, else a cross-validation will be run to automatically select the tuning parameter.A yearly seasonality is also fitted at the same time, preventing trend from catching yearly periodical changes.
A rule-based guard function is applied at the end to ensure change points are not too close, as specified by
actual_changepoint_min_distance
.- Parameters
df (
pandas.DataFrame
) – The data dftime_col (str) – Time column name in
df
value_col (str) – Value column name in
df
yearly_seasonality_order (int, default 8) – Fourier series order to capture yearly seasonality.
yearly_seasonality_change_freq (DateOffset, Timedelta or str or None, default None) –
How often to change the yearly seasonality model. Set to None to disable this feature.
This is useful if you have more than 2.5 years of data and the detected trend without this feature is inaccurate because yearly seasonality changes over the training period. Modeling yearly seasonality separately over the each period can prevent trend changepoints from fitting changes in yearly seasonality. For example, if you have 2.5 years of data and yearly seasonality increases in magnitude after the first year, setting this parameter to “365D” will model each year’s yearly seasonality differently and capture both shapes. However, without this feature, both years will have the same yearly seasonality, roughly the average effect across the training set.
Note that if you use str as input, the maximal supported unit is day, i.e., you might use “200D” but not “12M” or “1Y”.
resample_freq (DateOffset, Timedelta, str or None, default “D”.) – The frequency to aggregate data. Coarser aggregation leads to fitting longer term trends. If None, no aggregation will be done.
trend_estimator (str in [“ridge”, “lasso” or “ols”], default “ridge”.) – The estimator to estimate trend. The estimated trend is only for plotting purposes. ‘ols’ is not recommended when
yearly_seasonality_order
is specified other than 0, because significant over-fitting will happen. In this case, the given value is overridden by “ridge”.adaptive_lasso_initial_estimator (str in [“ridge”, “lasso” or “ols”], default “ridge”.) – The initial estimator to compute adaptive lasso weights
regularization_strength (float in [0, 1] or None) – The regularization for change points. Greater value implies fewer change points. 0 indicates all change points, and 1 indicates no change point. If None, the turning parameter will be selected by cross-validation. If a value is given, it will be used as the tuning parameter.
actual_changepoint_min_distance (DateOffset, Timedelta or str, default “30D”) – The minimal distance allowed between detected change points. If consecutive change points are within this minimal distance, the one with smaller absolute change coefficient will be dropped. Note: maximal unit is ‘D’, i.e., you may use units no more than ‘D’ such as ‘10D’, ‘5H’, ‘100T’, ‘200S’. The reason is that ‘W’, ‘M’ or higher has either cycles or indefinite number of days, thus is not parsable by pandas as timedelta.
potential_changepoint_distance (DateOffset, Timedelta, str or None, default None) – The distance between potential change points. If provided, will override the parameter
potential_changepoint_n
. Note: maximal unit is ‘D’, i.e., you may only use units no more than ‘D’ such as ‘10D’, ‘5H’, ‘100T’, ‘200S’. The reason is that ‘W’, ‘M’ or higher has either cycles or indefinite number of days, thus is not parsable by pandas as timedelta.potential_changepoint_n (int, default 100) – Number of change points to be evenly distributed, recommended 1-2 per month, based on the training data length.
potential_changepoint_n_max (int or None, default None) – The maximum number of potential changepoints. This parameter is effective when user specifies
potential_changepoint_distance
, and the number of potential changepoints in the training data is more thanpotential_changepoint_n_max
, then it is equivalent to specifyingpotential_changepoint_n = potential_changepoint_n_max
, and ignoringpotential_changepoint_distance
.no_changepoint_distance_from_begin (DateOffset, Timedelta, str or None, default None) – The length of time from the beginning of training data, within which no change point will be placed. If provided, will override the parameter
no_changepoint_proportion_from_begin
. Note: maximal unit is ‘D’, i.e., you may only use units no more than ‘D’ such as ‘10D’, ‘5H’, ‘100T’, ‘200S’. The reason is that ‘W’, ‘M’ or higher has either cycles or indefinite number of days, thus is not parsable by pandas as timedelta.no_changepoint_proportion_from_begin (float in [0, 1], default 0.0.) –
potential_changepoint_n
change points will be placed evenly over the whole training period, however, change points that are located within the firstno_changepoint_proportion_from_begin
proportion of training period will not be used for change point detection.no_changepoint_distance_from_end (DateOffset, Timedelta, str or None, default None) – The length of time from the end of training data, within which no change point will be placed. If provided, will override the parameter
no_changepoint_proportion_from_end
. Note: maximal unit is ‘D’, i.e., you may only use units no more than ‘D’ such as ‘10D’, ‘5H’, ‘100T’, ‘200S’. The reason is that ‘W’, ‘M’ or higher has either cycles or indefinite number of days, thus is not parsable by pandas as timedelta.no_changepoint_proportion_from_end (float in [0, 1], default 0.0.) –
potential_changepoint_n
change points will be placed evenly over the whole training period, however, change points that are located within the lastno_changepoint_proportion_from_end
proportion of training period will not be used for change point detection.fast_trend_estimation (bool, default True) – If True, the trend estimation is not refitted on the original data, but is a linear interpolation of the fitted trend from the resampled time series. If False, the trend estimation is refitted on the original data.
- Returns
result – result dictionary with keys:
"trend_feature_df"
pandas.DataFrame
The augmented df for change detection, in other words, the design matrix for the regression model. Columns:
’changepoint0’: regressor for change point 0, equals the continuous time of the observation minus the continuous time for time of origin.
…
’changepoint{potential_changepoint_n}’: regressor for change point {potential_changepoint_n}, equals the continuous time of the observation minus the continuous time of the {potential_changepoint_n}th change point.
’cos1_conti_year_yearly’: cosine yearly seasonality regressor of first order.
’sin1_conti_year_yearly’: sine yearly seasonality regressor of first order.
…
’cos{yearly_seasonality_order}_conti_year_yearly’ : cosine yearly seasonality regressor of {yearly_seasonality_order}th order.
’sin{yearly_seasonality_order}_conti_year_yearly’ : sine yearly seasonality regressor of {yearly_seasonality_order}th order.
"trend_changepoints"
listThe list of detected change points.
"changepoints_dict"
dictThe change point dictionary that is compatible as an input with
forecast
"trend_estimation"
pandas.Series
The estimated trend with detected trend change points.
- Return type
dict
- find_seasonality_changepoints(df, time_col, value_col, seasonality_components_df= name period order seas_names 0 tod 24.0 3 daily 1 tow 7.0 3 weekly 2 conti_year 1.0 5 yearly, resample_freq='H', regularization_strength=0.6, actual_changepoint_min_distance='30D', potential_changepoint_distance=None, potential_changepoint_n=50, no_changepoint_distance_from_end=None, no_changepoint_proportion_from_end=0.0, trend_changepoints=None)[source]
Finds the seasonality change points (defined as the time points where seasonality magnitude changes, i.e., the time series becomes “fatter” or “thinner”.)
Subtracts the estimated trend from the original time series first, then uses regression-based regularization methods to select important seasonality change points. Regressors are built from truncated Fourier series.
If you have run
find_trend_changepoints
before runningfind_seasonality_changepoints
with the same df, the estimated trend will be automatically used for removing trend infind_seasonality_changepoints
. Otherwise,find_trend_changepoints
will be run automatically with the same parameters as you passed tofind_seasonality_changepoints
. If you do not want to use the same parameters, runfind_trend_changepoints
with your desired parameter before callingfind_seasonality_changepoints
.The algorithm does an aggregation with a user-defined frequency, default hourly.
The regression features consists of
potential_changepoint_n
+ 1 blocks of predictors. The first block consists of Fourier series according toseasonality_components_df
, and other blocks are a copy of the first block truncated at the corresponding potential change point.If
potential_changepoint_distance
is not given,potential_changepoint_n
potential change points are evenly distributed over the time period, elsepotential_changepoint_n
is overridden by:total_time_length / ``potential_changepoint_distance``
Users can specify either
no_changepoint_proportion_from_end
to specify what proportion from the end of data they do not want changepoints, orno_changepoint_distance_from_end
(overridesno_changepoint_proportion_from_end
) to specify how long from the end they do not want change points.Then all potential change points will be selected by adaptive lasso, with the initial estimator specified by
adaptive_lasso_initial_estimator
. The regularization strength is specified byregularization_strength
, which lies between 0 and 1.A rule-based guard function is applied at the end to ensure change points are not too close, as specified by
actual_changepoint_min_distance
.- Parameters
df (
pandas.DataFrame
) – The data dftime_col (str) – Time column name in
df
value_col (str) – Value column name in
df
seasonality_components_df (
pandas.DataFrame
) – The df to generate seasonality design matrix, which is compatible withseasonality_components_df
infind_seasonality_changepoints
resample_freq (DateOffset, Timedelta or str, default “H”.) – The frequency to aggregate data. Coarser aggregation leads to fitting longer term trends.
regularization_strength (float in [0, 1] or None, default 0.6.) – The regularization for change points. Greater value implies fewer change points. 0 indicates all change points, and 1 indicates no change point. If None, the turning parameter will be selected by cross-validation. If a value is given, it will be used as the tuning parameter. Here “None” is not recommended, because seasonality change has different levels, and automatic selection by cross-validation may produce more change points than desired. Practically, 0.6 is a good choice for most cases. Tuning around 0.6 is recommended.
actual_changepoint_min_distance (DateOffset, Timedelta or str, default “30D”) – The minimal distance allowed between detected change points. If consecutive change points are within this minimal distance, the one with smaller absolute change coefficient will be dropped. Note: maximal unit is ‘D’, i.e., you may use units no more than ‘D’ such as ‘10D’, ‘5H’, ‘100T’, ‘200S’. The reason is that ‘W’, ‘M’ or higher has either cycles or indefinite number of days, thus is not parsable by pandas as timedelta.
potential_changepoint_distance (DateOffset, Timedelta, str or None, default None) – The distance between potential change points. If provided, will override the parameter
potential_changepoint_n
. Note: maximal unit is ‘D’, i.e., you may only use units no more than ‘D’ such as ‘10D’, ‘5H’, ‘100T’, ‘200S’. The reason is that ‘W’, ‘M’ or higher has either cycles or indefinite number of days, thus is not parsable by pandas as timedelta.potential_changepoint_n (int, default 50) – Number of change points to be evenly distributed, recommended 1 per month, based on the training data length.
no_changepoint_distance_from_end (DateOffset, Timedelta, str or None, default None) – The length of time from the end of training data, within which no change point will be placed. If provided, will override the parameter
no_changepoint_proportion_from_end
. Note: maximal unit is ‘D’, i.e., you may only use units no more than ‘D’ such as ‘10D’, ‘5H’, ‘100T’, ‘200S’. The reason is that ‘W’, ‘M’ or higher has either cycles or indefinite number of days, thus is not parsable by pandas as timedelta.no_changepoint_proportion_from_end (float in [0, 1], default 0.0.) –
potential_changepoint_n
change points will be placed evenly over the whole training period, however, only change points that are not located within the lastno_changepoint_proportion_from_end
proportion of training period will be used for change point detection.trend_changepoints (list or None) – A list of user specified trend change points, used to estimated the trend to be removed from the time series before detecting seasonality change points. If provided, the algorithm will not check existence of detected trend change points or run
find_trend_changepoints
, but will use these change points directly for trend estimation.
- Returns
result – result dictionary with keys:
"seasonality_feature_df"
pandas.DataFrame
The augmented df for seasonality changepoint detection, in other words, the design matrix for the regression model. Columns:
”cos1_tod_daily”: cosine daily seasonality regressor of first order at change point 0.
”sin1_tod_daily”: sine daily seasonality regressor of first order at change point 0.
…
”cos1_conti_year_yearly”: cosine yearly seasonality regressor of first order at change point 0.
”sin1_conti_year_yearly”: sine yearly seasonality regressor of first order at change point 0.
…
”cos{daily_seasonality_order}_tod_daily_cp{potential_changepoint_n}” : cosine daily seasonality regressor of {yearly_seasonality_order}th order at change point {potential_changepoint_n}.
”sin{daily_seasonality_order}_tod_daily_cp{potential_changepoint_n}” : sine daily seasonality regressor of {yearly_seasonality_order}th order at change point {potential_changepoint_n}.
…
”cos{yearly_seasonality_order}_conti_year_yearly_cp{potential_changepoint_n}” : cosine yearly seasonality regressor of {yearly_seasonality_order}th order at change point {potential_changepoint_n}.
”sin{yearly_seasonality_order}_conti_year_yearly_cp{potential_changepoint_n}” : sine yearly seasonality regressor of {yearly_seasonality_order}th order at change point {potential_changepoint_n}.
"seasonality_changepoints"
dict`[`list`[`datetime]]The dictionary of detected seasonality change points for each component. Keys are component names, and values are list of change points.
"seasonality_estimation"
pandas.Series
- The estimated seasonality with detected seasonality change points.
The series has the same length as
original_df
. Index is timestamp, and values are the estimated seasonality at each timestamp. The seasonality estimation is the estimated of seasonality effect with trend estimated byestimate_trend_with_detected_changepoints
removed.
"seasonality_components_df
pandas.DataFrame
The processed
seasonality_components_df
. Daily component row is removed if inferred frequency or aggregation frequency is at least one day.
- Return type
dict
- plot(observation=True, observation_original=True, trend_estimate=True, trend_change=True, yearly_seasonality_estimate=False, adaptive_lasso_estimate=False, seasonality_change=False, seasonality_change_by_component=True, seasonality_estimate=False, plot=True)[source]
Makes a plot to show the observations/estimations/change points.
In this function, component parameters specify if each component in the plot is included or not. These are bool variables. For those components that are set to True, their values will be replaced by the corresponding data. Other components values will be set to None. Then these variables will be fed into
plot_change
- Parameters
observation (bool) – Whether to include observation
observation_original (bool) – Set True to plot original observations, and False to plot aggregated observations. No effect is
observation
is Falsetrend_estimate (bool) – Set True to add trend estimation.
trend_change (bool) – Set True to add change points.
yearly_seasonality_estimate (bool) – Set True to add estimated yearly seasonality.
adaptive_lasso_estimate (bool) – Set True to add adaptive lasso estimated trend.
seasonality_change (bool) – Set True to add seasonality change points.
seasonality_change_by_component (bool) – If true, seasonality changes will be plotted separately for different components, else all will be in the same symbol. No effect if
seasonality_change
is Falseseasonality_estimate (bool) – Set True to add estimated seasonality. The seasonality if plotted around trend, so the actual seasonality shown is trend estimation + seasonality estimation.
plot (bool, default True) – Set to True to display the plot, and set to False to return the plotly figure object.
- Returns
None (if
plot
== True) – The function shows a plot.fig (
plotly.graph_objects.Figure
) – The plot object.
Benchmarking
- class greykite.framework.benchmark.benchmark_class.BenchmarkForecastConfig(df: ~pandas.core.frame.DataFrame, configs: ~typing.Dict[str, ~greykite.framework.templates.autogen.forecast_config.ForecastConfig], tscv: ~greykite.sklearn.cross_validation.RollingTimeSeriesSplit, forecaster: ~greykite.framework.templates.forecaster.Forecaster = <greykite.framework.templates.forecaster.Forecaster object>)[source]
Class for benchmarking multiple ForecastConfig on a rolling window basis.
- df
Timeseries data to forecast. Contains columns [time_col, value_col], and optional regressor columns. Regressor columns should include future values for prediction.
- Type
- configs
Dictionary of model configurations. A model configuration is a
ForecastConfig
. SeeForecastConfig
for details on validForecastConfig
. Validity of theconfigs
for benchmarking is checked via thevalidate
method.- Type
Dict
[str,ForecastConfig
]
- tscv
Cross-validation object that determines the rolling window evaluation. See
RollingTimeSeriesSplit
for details. Theforecast_horizon
andperiods_between_train_test
parameters ofconfigs
are matched against that oftscv
. A ValueError is raised if there is a mismatch.
- forecaster
Forecaster used to create the forecasts.
- Type
- is_run
Indicator of whether the
run
method is executed. After executingrun
, this indicator is set to True. Some class methods likeget_forecast
requiresis_run
to be True to be executed.- Type
bool, default False
- result
Stores the benchmarking results. Has the same keys as
configs
.- Type
dict
- forecasts
Merged DataFrame of forecasts, upper and lower confidence interval for all input
configs
. Also stores train end date and forecast step for each prediction.- Type
pandas.DataFrame
, default None
- validate()[source]
Validates the inputs to the class for the method
run
.Raises a ValueError if there is a mismatch between the following parameters of
configs
andtscv
:forecast_horizon
periods_between_train_test
Raises ValueError if all the
configs
do not have the samecoverage
parameter.
- run()[source]
Runs every config and stores the output of the
forecast_pipeline
. This function runs only if theconfigs
andtscv
are jointly valid.- Returns
self
- Return type
Returns self. Stores pipeline output of every config in
self.result
.
- extract_forecasts()[source]
Extracts forecasts, upper and lower confidence interval for each individual config. This is saved as a
pandas.DataFrame
with the namerolling_forecast_df
within the corresponding config ofself.result
. e.g. if config key is “silverkite”, then the forecasts are stored inself.result["silverkite"]["rolling_forecast_df"]
.This method also constructs a merged DataFrame of forecasts, upper and lower confidence interval for all input
configs
.
- plot_forecasts_by_step(forecast_step: int, config_names: Optional[List] = None, xlabel: str = 'ts', ylabel: str = 'y', title: Optional[str] = None, showlegend: bool = True)[source]
Returns a
forecast_step
ahead rolling forecast plot. The plot consists one line for each valid.config_names
. If available, the corresponding actual values are also plotted.For a more customizable plot, see
plot_multivariate
- Parameters
forecast_step (int) – Which forecast step to plot. A forecast step is an integer between 1 and the forecast horizon, inclusive, indicating the number of periods from train end date to the prediction date (# steps ahead).
config_names (list [str], default None) – Which config results to plot. A list of config names. If None, uses all the available config keys.
xlabel (str or None, default TIME_COL) – x-axis label.
ylabel (str or None, default VALUE_COL) – y-axis label.
title (str or None, default None) – Plot title. If None, default is based on
forecast_step
.showlegend (bool, default True) – Whether to show the legend.
- Returns
fig – Interactive plotly graph. Plots multiple column(s) in
self.forecasts
againstTIME_COL
.See
plot_forecast_vs_actual
return value for how to plot the figure and add customization.- Return type
- plot_forecasts_by_config(config_name: str, colors: List = ['rgb(31, 119, 180)', 'rgb(255, 127, 14)', 'rgb(44, 160, 44)', 'rgb(214, 39, 40)', 'rgb(148, 103, 189)', 'rgb(140, 86, 75)', 'rgb(227, 119, 194)', 'rgb(127, 127, 127)', 'rgb(188, 189, 34)', 'rgb(23, 190, 207)'], xlabel: str = 'ts', ylabel: str = 'y', title: Optional[str] = None, showlegend: bool = True)[source]
Returns a rolling plot of the forecasts by
config_name
againstTIME_COL
. The plot consists of one line for each available split. Some lines may overlap if test period in corresponding splits intersect. Hence every line is given a different color. If available, the corresponding actual values are also plotted.For a more customizable plot, see
plot_multivariate_grouped
- Parameters
config_name (str) – Which config result to plot. The name must match the name of one of the input
configs
.colors ([str,
List
[str]], defaultDEFAULT_PLOTLY_COLORS
) – Which colors to use to build the color palette. This can be a list of RGB colors or a str fromPLOTLY_SCALES
. To use a single color for all lines, pass aList
with a single color.xlabel (str or None, default TIME_COL) – x-axis label.
ylabel (str or None, default VALUE_COL) – y-axis label.
title (str or None, default None) – Plot title. If None, default is based on
config_name
.showlegend (bool, default True) – Whether to show the legend.
- Returns
fig – Interactive plotly graph. Plots multiple column(s) in
self.forecasts
againstTIME_COL
.- Return type
- get_evaluation_metrics(metric_dict: Dict, config_names: Optional[List] = None)[source]
Returns rolling train and test evaluation metric values.
- Parameters
metric_dict (dict [str, callable]) –
Evaluation metrics to compute.
key: evaluation metric name, used to create column name in output.
- value: metric function to apply to forecast df in each split to generate the column value.
Signature (y_true: str, y_pred: str) -> transformed value: float.
For example:
metric_dict = { "median_residual": lambda y_true, y_pred: np.median(y_true - y_pred), "mean_squared_error": lambda y_true, y_pred: np.mean((y_true - y_pred)**2) }
Some predefined functions are available in
evaluation
. For example:metric_dict = { "correlation": lambda y_true, y_pred: correlation(y_true, y_pred), "RMSE": lambda y_true, y_pred: root_mean_squared_error(y_true, y_pred), "Q_95": lambda y_true, y_pred: partial(quantile_loss(y_true, y_pred, q=0.95)) }
As shorthand, it is sufficient to provide the corresponding
EvaluationMetricEnum
member. These are auto-expanded into the appropriate function. So the following is equivalent:metric_dict = { "correlation": EvaluationMetricEnum.Correlation, "RMSE": EvaluationMetricEnum.RootMeanSquaredError, "Q_95": EvaluationMetricEnum.Quantile95 }
config_names (list [str], default None) – Which config results to plot. A list of config names. If None, uses all the available config keys.
- Returns
evaluation_metrics_df – A DataFrame containing splitwise train and test evaluation metrics for
metric_dict
andconfig_names
.For example. Let’s assume:
metric_dict = { "RMSE": EvaluationMetricEnum.RootMeanSquaredError, "Q_95": EvaluationMetricEnum.Quantile95 } config_names = ["default_prophet", "custom_silverkite"] These are valid ``config_names`` and there are 2 splits for each. Then evaluation_metrics_df = config_name split_num train_RMSE test_RMSE train_Q_95 test_Q_95 default_prophet 0 * * * * default_prophet 1 * * * * custom_silverkite 0 * * * * custom_silverkite 1 * * * * where * represents computed values.
- Return type
pd.DataFrame
- plot_evaluation_metrics(metric_dict: Dict, config_names: Optional[List] = None, xlabel: Optional[str] = None, ylabel: str = 'Metric value', title: Optional[str] = None, showlegend: bool = True)[source]
Returns a barplot of the train and test values of
metric_dict
ofconfig_names
. Value of a metric for allconfig_names
are plotted as a grouped bar. Train and test values of a metric are plot side-by-side for easy comparison.- Parameters
metric_dict (dict [str, callable]) – Evaluation metrics to compute. Same as get_evaluation_metrics. To get the best visualization, keep number of metrics <= 2.
config_names (list [str], default None) – Which config results to plot. A list of config names. If None, uses all the available config keys.
xlabel (str or None, default None) – x-axis label.
ylabel (str or None, default “Metric value”) – y-axis label.
title (str or None, default None) – Plot title.
showlegend (bool, default True) – Whether to show the legend.
- Returns
fig – Interactive plotly bar plot.
- Return type
- get_grouping_evaluation_metrics(metric_dict: Dict, config_names: Optional[List] = None, which: str = 'train', groupby_time_feature: Optional[str] = None, groupby_sliding_window_size: Optional[int] = None, groupby_custom_column: Optional[Series] = None)[source]
- Returns splitwise rolling evaluation metric values.
These values are grouped by the grouping method chosen by
groupby_time_feature
,groupby_sliding_window_size
andgroupby_custom_column
.
See
get_grouping_evaluation
for details on grouping method.Parameters
get_evaluation_metrics.
- config_nameslist [str], default None
Which config results to plot. A list of config names. If None, uses all the available config keys.
- which: str
“train” or “test”. Which dataset to evaluate.
- groupby_time_featurestr or None, default None
If provided, groups by a column generated by
build_time_features_df
. See that function for valid values.- groupby_sliding_window_sizeint or None, default None
If provided, sequentially partitions data into groups of size
groupby_sliding_window_size
.- groupby_custom_column
pandas.Series
or None, default None If provided, groups by this column value. Should be same length as the DataFrame.
- Returns
grouped_evaluation_df – A DataFrame containing splitwise train and test evaluation metrics for
metric_dict
andconfig_names
. The evaluation metrics are grouped by the grouping method.- Return type
- plot_grouping_evaluation_metrics(metric_dict: Dict, config_names: Optional[List] = None, which: str = 'train', groupby_time_feature: Optional[str] = None, groupby_sliding_window_size: Optional[int] = None, groupby_custom_column: Optional[Series] = None, xlabel=None, ylabel='Metric value', title=None, showlegend=True)[source]
Returns a line plot of the grouped evaluation values of
metric_dict
ofconfig_names
. These values are grouped by the grouping method chosen bygroupby_time_feature
,groupby_sliding_window_size
andgroupby_custom_column
.See
get_grouping_evaluation
for details on grouping method.Parameters
get_evaluation_metrics. To get the best visualization, keep number of metrics <= 2.
- config_nameslist [str], default None
Which config results to plot. A list of config names. If None, uses all the available config keys.
- which: str
“train” or “test”. Which dataset to evaluate.
- groupby_time_featurestr or None, optional
If provided, groups by a column generated by
build_time_features_df
. See that function for valid values.- groupby_sliding_window_sizeint or None, optional
If provided, sequentially partitions data into groups of size
groupby_sliding_window_size
.- groupby_custom_column
pandas.Series
or None, optional If provided, groups by this column value. Should be same length as the DataFrame.
- xlabelstr or None, default None
x-axis label. If None, label is determined by the groupby column name.
- ylabelstr or None, default “Metric value”
y-axis label.
- titlestr or None, default None
Plot title. If None, default is based on
config_name
.- showlegendbool, default True
Whether to show the legend.
- Returns
fig – Interactive plotly graph.
- Return type
- get_runtimes(config_names: Optional[List] = None)[source]
Returns rolling average runtime in seconds for
config_names
.- Parameters
config_names (list [str], default None) – Which config results to plot. A list of config names. If None, uses all the available config keys.
- Returns
runtimes_df – A DataFrame containing splitwise runtime in seconds for
config_names
.For example. Let’s assume:
config_names = ["default_prophet", "custom_silverkite"] These are valid ``config_names`` and there are 2 splits for each. Then runtimes_df = config_name split_num runtime_sec default_prophet 0 * default_prophet 1 * custom_silverkite 0 * custom_silverkite 1 * where * represents computed values.
- Return type
pd.DataFrame
- plot_runtimes(config_names: Optional[List] = None, xlabel: Optional[str] = None, ylabel: str = 'Mean runtime in seconds', title: str = 'Average runtime across rolling windows', showlegend: bool = True)[source]
Returns a barplot of the runtimes of
config_names
.- Parameters
config_names (list [str], default None) – Which config results to plot. A list of config names. If None, uses all the available config keys.
xlabel (str or None, default None) – x-axis label.
ylabel (str or None, default “Mean runtime in seconds”) – y-axis label.
title (str or None, default “Average runtime across rolling windows”) – Plot title.
showlegend (bool, default True) – Whether to show the legend.
- Returns
fig – Interactive plotly bar plot.
- Return type
- get_valid_config_names(config_names: Optional[List] = None)[source]
Validate
config_names
against keys ofconfigs
. Raises a ValueError in case of a mismatch.- Parameters
config_names (list [str], default None) – Which config results to plot. A list of config names. If None, uses all the available config keys.
- Returns
config_names – List of valid config names.
- Return type
list
- static autocomplete_metric_dict(metric_dict, enum_class)[source]
Sweeps through
metric_dict
, converting members ofenum_class
to their corresponding evaluation function.For example:
metric_dict = { "correlation": EvaluationMetricEnum.Correlation, "RMSE": EvaluationMetricEnum.RootMeanSquaredError, "Q_95": EvaluationMetricEnum.Quantile95 "custom_metric": custom_function } is converted to metric_dict = { "correlation": correlation(y_true, y_pred), "RMSE": root_mean_squared_error(y_true, y_pred), "Q_95": quantile_loss_q(y_true, y_pred, q=0.95), "custom_function": custom_function }
- Parameters
metric_dict (dict [str, callable]) – Evaluation metrics to compute. Same as get_evaluation_metrics.
enum_class (Enum) – The enum class
metric_dict
elements might be member of. It must have a methodget_metric_func
.
- Returns
updated_metric_dict – Autocompleted metric dict.
- Return type
dict
Cross Validation
- class greykite.sklearn.cross_validation.RollingTimeSeriesSplit(forecast_horizon, min_train_periods=None, expanding_window=False, use_most_recent_splits=False, periods_between_splits=None, periods_between_train_test=0, max_splits=3)[source]
Flexible splitter for time-series cross validation and rolling window evaluation. Suitable for use in GridSearchCV.
- min_splits
Guaranteed min number of splits. This is always set to 1. If provided configuration results in 0 splits, the cross validator will yield a default split.
- Type
- __starting_test_index
Test end index of the first CV split. Actual offset = __starting_test_index + _get_offset(X), for a particular dataset X. Cross validator ensures the last test split contains the last observation in X.
- Type
Examples
>>> from greykite.sklearn.cross_validation import RollingTimeSeriesSplit >>> X = np.random.rand(20, 4) >>> tscv = RollingTimeSeriesSplit(forecast_horizon=3, max_splits=4) >>> tscv.get_n_splits(X=X) 4 >>> for train, test in tscv.split(X=X): ... print(train, test) [2 3 4 5 6 7] [ 8 9 10] [ 5 6 7 8 9 10] [11 12 13] [ 8 9 10 11 12 13] [14 15 16] [11 12 13 14 15 16] [17 18 19] >>> X = np.random.rand(20, 4) >>> tscv = RollingTimeSeriesSplit(forecast_horizon=2, ... min_train_periods=4, ... expanding_window=True, ... periods_between_splits=4, ... periods_between_train_test=2, ... max_splits=None) >>> tscv.get_n_splits(X=X) 4 >>> for train, test in tscv.split(X=X): ... print(train, test) [0 1 2 3] [6 7] [0 1 2 3 4 5 6 7] [10 11] [ 0 1 2 3 4 5 6 7 8 9 10 11] [14 15] [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15] [18 19] >>> X = np.random.rand(5, 4) # default split if there is not enough data >>> for train, test in tscv.split(X=X): ... print(train, test) [0 1 2 3] [4]
- split(X, y=None, groups=None)[source]
- Generates indices to split data into training and test CV folds according to rolling
window time series cross validation
- Parameters
X (array-like, shape (n_samples, n_features)) – Training data, where n_samples is the number of samples and n_features is the number of features. Must have shape method.
y (array-like, shape (n_samples,), optional) – The target variable for supervised learning problems. Always ignored, exists for compatibility.
groups (array-like, with shape (n_samples,), optional) – Group labels for the samples used while splitting the dataset into train/test set. Always ignored, exists for compatibility.
- Yields
train (
numpy.array
) – The training set indices for that split.test (
numpy.array
) – The testing set indices for that split.
- get_n_splits(X=None, y=None, groups=None)[source]
Returns the number of splitting iterations yielded by the cross-validator
- get_n_splits_without_capping(X=None)[source]
- Returns the number of splitting iterations in the cross-validator as configured, ignoring
self.max_splits and self.min_splits
- Parameters
X (array-like, shape (n_samples, n_features)) – Input data to split
- Returns
n_splits – The number of splitting iterations in the cross-validator as configured, ignoring self.max_splits and self.min_splits
- Return type
- _get_offset(X=None)[source]
Returns an offset to add to test set indices when creating CV splits CV splits are shifted so that the last test observation is the last point in X. This shift does not affect the total number of splits.
- Parameters
X (array-like, shape (n_samples, n_features)) – Input data to split
- Returns
offset – The number of observations to ignore at the beginning of X when creating CV splits
- Return type
- _sample_splits(num_splits, seed=48912)[source]
Samples up to
max_splits
items from list(range(num_splits)).If
use_most_recent_splits
is True, highest split indices up tomax_splits
are retained. Otherwise, the following sampling scheme is implemented:takes the last 2 splits
samples from the rest uniformly at random
- Parameters
num_splits (int) – Number of splits before sampling.
seed (int) – Seed for random sampling.
- Returns
n_splits – Indices of splits to keep (subset of list(range(num_splits))).
- Return type
list
- _iter_test_indices(X=None, y=None, groups=None)[source]
Class directly implements
split
instead of providing this function
- _iter_test_masks(X=None, y=None, groups=None)
Generates boolean masks corresponding to test sets.
By default, delegates to _iter_test_indices(X, y, groups)
Transformers
- class greykite.sklearn.transform.zscore_outlier_transformer.ZscoreOutlierTransformer(z_cutoff=None, use_fit_baseline=False)[source]
Replaces outliers in data with NaN. Outliers are determined by z-score cutoff. Columns are handled independently.
- Parameters
z_cutoff (float or None, default None) – z-score cutoff to define outliers. If None, this transformer is a no-op.
use_fit_baseline (bool, default False) –
If True, the z-scores are calculated using the mean and standard deviation of the dataset passed to
fit
.If False, the transformer is stateless. z-scores are calculated for the dataset passed to
transform
, regardless offit
.
- mean
Mean of each column. NaNs are ignored.
- Type
- std
Standard deviation of each column. NaNs are ignored.
- Type
- _is_fitted
Whether the transformer is fitted.
- Type
bool
- fit(X, y=None)[source]
Computes the column mean and standard deviation, stored as
mean
andstd
attributes.- Parameters
X (
pandas.DataFrame
) – Training input data. e.g. each column is a timeseries. Columns are expected to be numeric.y (None) – There is no need of a target in a transformer, yet the pipeline API requires this parameter.
- Returns
self – Returns self.
- Return type
- transform(X)[source]
Replaces outliers with NaN.
- Parameters
X (
pandas.DataFrame
) – Data to transform. e.g. each column is a timeseries. Columns are expected to be numeric.- Returns
X_outlier – A copy of the data frame with original values and outliers replaced with NaN.
- Return type
- fit_transform(X, y=None, **fit_params)
Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters
X (array-like of shape (n_samples, n_features)) – Input samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).
**fit_params (dict) – Additional fit parameters.
- Returns
X_new – Transformed array.
- Return type
ndarray array of shape (n_samples, n_features_new)
- get_params(deep=True)
Get parameters for this estimator.
- set_params(**params)
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters
**params (dict) – Estimator parameters.
- Returns
self – Estimator instance.
- Return type
estimator instance
- class greykite.sklearn.transform.normalize_transformer.NormalizeTransformer(normalize_algorithm=None, normalize_params=None)[source]
Normalizes time series data.
- Parameters
normalize_algorithm (str or None, default None) –
Which algorithm to use. Valid options are:
”MinMaxScaler” :
sklearn.preprocessing.MinMaxScaler
,”MaxAbsScaler” :
sklearn.preprocessing.MaxAbsScaler
,”StandardScaler” :
sklearn.preprocessing.StandardScaler
,”RobustScaler” :
sklearn.preprocessing.RobustScaler
,”Normalizer” :
sklearn.preprocessing.Normalizer
,”QuantileTransformer” :
sklearn.preprocessing.QuantileTransformer
,”PowerTransformer” :
sklearn.preprocessing.PowerTransformer
,
If None, this transformer is a no-op. No normalization is done.
normalize_params (dict or None, default None) – Params to initialize the normalization scaler/transformer.
- scaler
sklearn class used for normalization
- Type
class
- _is_fitted
Whether the transformer is fitted.
- Type
bool
- fit(X, y=None)[source]
Fits the normalization transform.
- Parameters
X (
pandas.DataFrame
) – Training input data. e.g. each column is a timeseries. Columns are expected to be numeric.y (None) – There is no need of a target in a transformer, yet the pipeline API requires this parameter.
- Returns
self – Returns self.
- Return type
- transform(X)[source]
Normalizes data using the specified scaling method.
- Parameters
X (
pandas.DataFrame
) – Data to transform. e.g. each column is a timeseries. Columns are expected to be numeric.- Returns
X_normalized – A normalized copy of the data frame.
- Return type
- fit_transform(X, y=None, **fit_params)
Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters
X (array-like of shape (n_samples, n_features)) – Input samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).
**fit_params (dict) – Additional fit parameters.
- Returns
X_new – Transformed array.
- Return type
ndarray array of shape (n_samples, n_features_new)
- get_params(deep=True)
Get parameters for this estimator.
- set_params(**params)
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters
**params (dict) – Estimator parameters.
- Returns
self – Estimator instance.
- Return type
estimator instance
- class greykite.sklearn.transform.null_transformer.NullTransformer(max_frac=0.1, impute_algorithm=None, impute_params=None, impute_all=True)[source]
Imputes nulls in time series data.
This transform is stateless in the sense that
transform
output does not depend on the data passed tofit
. The dataset passed totransform
is used to impute itself.- Parameters
max_frac (float, default 0.10) – issues warning if fraction of nulls is above this value
impute_algorithm (str or None, default “interpolate”) –
Which imputation algorithm to use. Valid options are:
”interpolate” :
pandas.DataFrame.interpolate
”ts_interpolate” :
impute_with_lags_multi
.
If None, this transformer is a no-op. No null imputation is done.
impute_params (dict or None, default None) –
Params to pass to the imputation algorithm. See
pandas.DataFrame.interpolate
andimpute_with_lags_multi
for their respective options.For pandas “interpolate”, the “ffill”, “pad”, “bfill”, “backfill” methods are not allowed to avoid confusion with the fill axis parameter. Use “linear” with
axis=0
instead, with direction controlled bylimit_direction
.If None, uses the defaults provided in this class.
impute_all (bool, default True) –
Whether to impute all values. If True, NaNs are not allowed in the transformed result. Ignored if
impute_algorithm
is None.The transform specified by
impute_algorithm
andimpute_params
may leave NaNs in the dataset. For example, if it fills in the forward direction but the first value in a column is NaN.A first pass is taken with the impute algorithm specified. A second pass is taken with the “interpolate” algorithm (method=”linear”, limit_direction=”both”) to fill in remaining NaNs.
- null_frac
The fraction data points that are null
- Type
int
- _is_fitted
Whether the transformer is fitted.
- Type
bool
- missing_info
Information about the missing data. Set by
transform
ifimpute_algorithm = "ts_interpolate"
.- Type
dict
- fit(X, y=None)[source]
Updates self.impute_params.
- Parameters
X (
pandas.DataFrame
) – Training input data. e.g. each column is a timeseries. Columns are expected to be numeric.y (None) – There is no need of a target in a transformer, yet the pipeline API requires this parameter.
- Returns
self – Returns self.
- Return type
- transform(X)[source]
Imputes missing values in input time series.
Checks the % of data points that are null, and provides warning if it exceeds
self.max_frac
.- Parameters
X (
pandas.DataFrame
) – Data to transform. e.g. each column is a timeseries. Columns are expected to be numeric.- Returns
X_imputed – A copy of the data frame with original values and missing values imputed
- Return type
- fit_transform(X, y=None, **fit_params)
Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters
X (array-like of shape (n_samples, n_features)) – Input samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).
**fit_params (dict) – Additional fit parameters.
- Returns
X_new – Transformed array.
- Return type
ndarray array of shape (n_samples, n_features_new)
- get_params(deep=True)
Get parameters for this estimator.
- set_params(**params)
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters
**params (dict) – Estimator parameters.
- Returns
self – Estimator instance.
- Return type
estimator instance
- class greykite.sklearn.transform.drop_degenerate_transformer.DropDegenerateTransformer(drop_degenerate=False)[source]
Removes degenerate (constant) columns.
- Parameters
drop_degenerate (bool, default False) – Whether to drop degenerate columns.
- drop_cols
Degenerate columns to drop
- Type
list [str] or None
- keep_cols
Columns to keep
- Type
list [str] or None
- fit(X, y=None)[source]
Identifies the degenerate columns, and sets
self.keep_cols
andself.drop_cols
.- Parameters
X (
pandas.DataFrame
) – Training input data. e.g. each column is a timeseries. Columns are expected to be numeric.y (None) – There is no need of a target in a transformer, yet the pipeline API requires this parameter.
- Returns
self – Returns self.
- Return type
- transform(X)[source]
Normalizes data using the specified scaling method.
- Parameters
X (
pandas.DataFrame
) – Data to transform. e.g. each column is a timeseries. Columns are expected to be numeric.- Returns
X_subset – Selected columns of X. Keeps columns that were not degenerate on the training data.
- Return type
- fit_transform(X, y=None, **fit_params)
Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters
X (array-like of shape (n_samples, n_features)) – Input samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).
**fit_params (dict) – Additional fit parameters.
- Returns
X_new – Transformed array.
- Return type
ndarray array of shape (n_samples, n_features_new)
- get_params(deep=True)
Get parameters for this estimator.
- set_params(**params)
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters
**params (dict) – Estimator parameters.
- Returns
self – Estimator instance.
- Return type
estimator instance
Quantile Regression
- class greykite.algo.common.l1_quantile_regression.QuantileRegression(quantile: float = 0.9, alpha: float = 0.001, sample_weight: Optional[np.typing.ArrayLike] = None, feature_weight: Optional[np.typing.ArrayLike] = None, max_iter: int = 100, tol: float = 0.01, fit_intercept: bool = True, optimize_mape: bool = False)[source]
Implements the quantile regression model.
Supports weighted sample, l1 regularization and weighted l1 regularization. These options can be configured to support different use cases. For example, specifying quantile to be 0.5 and sample weight to be the inverse absolute value of response minimizes the MAPE.
- fit(X: np.typing.ArrayLike, y: np.typing.ArrayLike) QuantileRegression [source]
Fits the quantile regression model.
- Parameters
X (
numpy.array
,pandas.DataFrame
orpandas.Series
) – The design matrix.y (
numpy.array
,pandas.DataFrame
orpandas.Series
) – The response vector.
- Return type
self
- predict(X: np.typing.ArrayLike) np.array [source]
Makes prediction for a given x.
- Parameters
X (
numpy.array
,pandas.DataFrame
orpandas.Series
) – The design matrix used for prediction.
- get_params(deep=True)
Get parameters for this estimator.
- score(X, y, sample_weight=None)
Return the coefficient of determination of the prediction.
The coefficient of determination \(R^2\) is defined as \((1 - \frac{u}{v})\), where \(u\) is the residual sum of squares
((y_true - y_pred)** 2).sum()
and \(v\) is the total sum of squares((y_true - y_true.mean()) ** 2).sum()
. The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a \(R^2\) score of 0.0.- Parameters
X (array-like of shape (n_samples, n_features)) – Test samples. For some estimators this may be a precomputed kernel matrix or a list of generic objects instead with shape
(n_samples, n_samples_fitted)
, wheren_samples_fitted
is the number of samples used in the fitting for the estimator.y (array-like of shape (n_samples,) or (n_samples, n_outputs)) – True values for X.
sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.
- Returns
score – \(R^2\) of
self.predict(X)
wrt. y.- Return type
Notes
The \(R^2\) score used when calling
score
on a regressor usesmultioutput='uniform_average'
from version 0.23 to keep consistent with default value ofr2_score
. This influences thescore
method of all the multioutput regressors (except forMultiOutputRegressor
).
- set_params(**params)
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters
**params (dict) – Estimator parameters.
- Returns
self – Estimator instance.
- Return type
estimator instance
Hierarchical Forecast
- class greykite.algo.reconcile.convex.reconcile_forecasts.ReconcileAdditiveForecasts[source]
Reconciles forecasts to satisfy additive constraints.
Constraints can be encoded by the tree structure via
levels
. In the tree formulation, a parent’s value must be the sum of its children’s values.Or, constraints can be encoded as a matrix via
constraint_matrix
, specifying additive expressions that must equal 0. The constraints need not have a tree representation.Provides standard methods such as bottom up, ols, MinT. Also supports a custom method that minimizes user-specified types of error. The solution is derived by convex optimization. If desired, a constraint is added to require the transformation to be unbiased.
If not using method=”ols” or method=”bottom_up”, which don’t depend on the data, forecast reconciliation should be trained once per horizon (# periods between forecasted date and train_end_date), because the optimal adjustment may differ.
- forecasts
Original forecasted values, used to train the method. Also known as “base” forecasts. Long format where each column is a time series. and each row is a time step. For proper variance estimates for the variance penalty, values should be at a fixed-horizon (e.g. always 7-step ahead).
- Type
pandas.DataFrame
, shape (n, m)
- actuals
Actual values to train the method, corresponding to
forecasts
. Must have the same shape and column names asforecasts
.- Type
pandas.DataFrame
, shape (n, m)
- constraint_matrix
Constraints.
c x m
array encodingc
constraints ofm
variables. We requireconstraint_matrix @ transform_matrix = 0
. For example, to encode-x1 + x2 + x3 == 0 and -x2 + x4 + x5 == 0
:constraint_matrix = np.array([ [-1, 1, 1, 0, 0], [0, -1, 0, 1, 1]])
Entries are typically in [-1, 0, 1], but this is not required. Either
constraint_matrix
orlevels
must be provided.- Type
numpy.array
, shape (c, m), or None
- levels
A simpler way to encode tree constraints. Overrides
constraint_matrix
if provided. Specifies the number of children of each parent (internal) node in the tree. The number of inner lists is the height of the tree. The ith inner list provides the number of children of each node at depth i. For example:# root node with 3 children levels = [[3]] # root node with 3 children, who have 2, 3, 3 children respectively levels = [[3], [2, 3, 3]]
All leaf nodes must have the same depth. Thus, the first sublist must have one integer, the length of a sublist must equal the sum of the previous sublist, and all integers in
levels
must be positive.Either
constraint_matrix
orlevels
must be provided.- Type
list [list [int]] or None
- order_dict
How to order the columns before fitting. The key is the column name, the value is its position. When
levels
is used, map each column name to the order of its corresponding node in a BFS traversal of the tree. Whenconstraint_matrix
is used, this shuffles the order of the columns before the constraints are applied (thus, columns inconstraint_matrix
refer to the columns after reordering).If None, no reordering is done.
- Type
dict [str, float] or None
- method
Which reconciliation method to use. Valid values are “bottom_up”, “ols”, “mint_sample”, “custom”:
- “bottom_up”Sums leaf nodes. Unbiased transform that uses only the values of the leaf nodes
to propagate up the tree. Each node’s value is the sum of its corresponding leaf nodes’ values (a leaf node corresponds to a node T if it is a leaf node of the subtree with T as its root, i.e. a descendant of T or T itself). See Dangerfield and Morris 1992 “Top-down or bottom-up: Aggregate versus disaggregate extrapolations” for one discussion of this method. Depends only on the structure of the hierarchy, not on the data itself.
- “ols”OLS estimate proposed by https://robjhyndman.com/papers/Hierarchical6.pdf
(Hyndman et al. 2010, “Optimal combination forecasts for hierarchical time series”) Also see https://robjhyndman.com/papers/mint.pdf section 2.4.1. (Wickramasuriya et al. 2019 “Optimal forecast reconciliation for hierarchical and grouped time series through trace minimization”.) Unbiased transform that minimizes variance of adjusted residuals, using “identity” estimate of original residual variance. Optimal if original forecast errors are uncorrelated with equal variance (unlikely). Depends only on the structure of the hierarchy, not on the data itself.
- “mint_sample”Unbiased transform that minimizes variance of adjusted residuals,
using “sample” estimate of original residual variance. Assumes base forecasts are unbiased. See Wickramasuriya et al. 2019 section 2.4.4. Depends on the structure of the hierarchy and forecast error covariances.
- “custom”Optimization parameters can be set by the user. See
greykite.algo.reconcile.convex.reconcile_forecasts.ReconcileAdditiveForecasts.fit
method for parameters and their default values. Depends on the structure of the hierarchy, base forecasts, and actuals, if all terms are included in the objective.
If “custom”, uses the parameters passed to
greykite.algo.reconcile.convex.reconcile_forecasts.ReconcileAdditiveForecasts.fit
to formulate the convex optimization problem.If “bottom_up”, “ols”, or “mint_sample”, the other fit parameters are ignored.
- Type
str
- lower_bound
Lower bound on each entry of
transform_matrix
. If None, no lower bound is applied.- Type
float or None
- upper_bound
Upper bound on each entry of
transform_matrix
. If None, no upper bound is applied.- Type
float or None
- unbiased
Whether the resulting transformation must be unbiased.
- Type
bool
- lam_adj
Weight for the adjustment penalty. The adjustment penalty is the mean squared difference between adjusted forecasts and base forecasts.
- Type
float
- lam_bias
Weight for the bias penalty. The bias penalty is the mean squared difference between adjusted actuals and actuals. For an unbiased transformation (
unbiased=True
), the bias penalty is 0 so this has no effect.- Type
float
- lam_train
Weight for the training MSE penalty. The train MSE penalty measures the mean squared difference between adjusted forecasts and actuals.
- Type
float
- lam_var
Weight for the variance penalty. The variance penalty measures the variance of adjusted forecast errors for an unbiased transformation. It is reported as the average of the variances across timeseries. It is based on the variance of the base forecast error variance,
covariance
. For biased transforms, this is an underestimate of the true variance.- Type
float
- covariance
Variance-covariance matrix of base forecast errors. Used to compute the variance penalty.
If a
numpy.array
, row/column i corresponds to the ith column after reordering byorder_dict
. Should be reported on the original scale of the data.If “sample”, the sample covariance of residuals assuming base forecasts are unbiased. Unlike
numpy.cov
, does not mean center the residuals, and divides byn
instead ofn-1
.If “identity”, the identity matrix.
- Type
numpy.array
of shape (m, m), or “sample” or “identity”
- weight_adj
Weight for the adjustment penalty that allows a different weight per-timeseries.
If a numpy array/list, values specify the weight for each forecast after reordering by
order_dict
.If “MedAPE”, proportional to the MedAPE of the forecast.
If “InverseMedAPE”, proportional to 1 / MedAPE of the forecast. This can be useful to penalize adjustment to base forecasts that are already accurate.
If None, the identity matrix (equal weights).
- Type
numpy.array
or list [float] of length m or “MedAPE” or “InverseMedAPE” or None
- weight_bias
Weight for the bias penalty that allows a different weight per-timeseries.
If a numpy array/list, values specify the weight for each forecast after reordering by
order_dict
.If “MedAPE”, proportional to the MedAPE of the forecast. This can be useful to focus more on improving the base forecasts with high error.
If “InverseMedAPE”, proportional to 1 / MedAPE of the forecast.
If None, the identity matrix (equal weights).
For an unbiased transformation (
unbiased=True
), the bias penalty is 0 so this has no effect.- Type
numpy.array
or list [float] of length m or “MedAPE” or “InverseMedAPE” or None
- weight_train
Weight for the train MSE penalty that allows a different weight per-timeseries.
If a numpy array/list, values specify the weight for each forecast after reordering by
order_dict
.If “MedAPE”, proportional to the MedAPE of the forecast. This can be useful to focus more on improving the base forecasts with high error.
If “InverseMedAPE”, proportional to 1 / MedAPE of the forecast.
If None, the identity matrix (equal weights).
- Type
numpy.array
or list [float] of length m or “MedAPE” or “InverseMedAPE” or None
- weight_var
Weight for the variance penalty that allows a different weight per-timeseries.
If a numpy array/list, values specify the weight for each forecast after reordering by
order_dict
.If “MedAPE”, proportional to the MedAPE of the forecast. This can be useful to focus more on improving the base forecasts with high error.
If “InverseMedAPE”, proportional to 1 / MedAPE of the forecast.
If None, the identity matrix (equal weights).
- Type
numpy.array
or list [float] of length m or “MedAPE” or “InverseMedAPE” or None
- names
Names of
forecast
columns after reordering byorder_dict
.- Type
- tree
If
levels
is provided, represents the tree structure encoded by the levels. Else None.
- transform_variable
Optimization variable to learn the transform matrix. None if a rule-based method is used, e.g.
method == bottom_up
- Type
cvxpy.Variable
, shape (m, m) or None
- transform_matrix
Transformation matrix. Same as
transform_variable.value
, unless the solver failed the find a solution, and a backup value is used. Adjusted forecasts are computed by applying the transform from the left to reordered and transposedforecasts
. Seetransform
in this class.- Type
numpy.array
, shape (m, m)
- prob
Convex optimization problem.
- Type
cvxpy.Problem
- is_optimization_solution
Whether
transform_matrix
is a solution found by convex optimization solution. If False, thentransform_matrix
may be set to a backup value (bottom up transform). Checkprob.status
for more details about solver status.- Type
bool
- objective_fn
Evaluates the objective function for a given transform matrix and dataset. Takes
transform_matrix, forecast_matrix (optional), actual_matrix (optional)
. Return value has same format asobjective_fn_val
. If forecast_matrix/actual_matrix are not provided, uses the fitting datasets.- Type
callable
- objective_fn_val
Dictionary containing the objective value, and its components, as evaluated on the training set for the identified optimal solution from convex optimization. Keys are:
"adj"
: adjustment size"bias"
: bias of estimator"train"
: train set MSE"var"
: variance of unbiased estimator"total"
: sum of the above- Type
dict [str, float]
- objective_weights
Weights used in the objective function, derived from
covariance
,weight_*
,forecasts
,actuals
. Keys are:weight_adj
weight_bias
weight_train
weight_var
covariance
- Type
dict [str,
np.array
of shape (m, m)]
- adjusted_forecasts
Adjusted
forecasts
that satisfy the constraints.- Type
pandas.DataFrame
, shape (n, m)
- constraint_violation
The normalized constraint violations on training set. Keys are “actual”, “forecast”, and “adjusted”. Root mean squared constraint violation is divided by root mean squared actual value.
- Type
dict [str, float]
- evaluation_df
DataFrame of evaluation results on training set. Rows are timeseries, columns are metrics. See
evaluate
in this class.- Type
pandas.DataFrame
, shape (m, # metrics)
- figures
Plotly figures to visualize evaluation results on training set. Keys are: “base_adj” (base vs adjusted forecast), “adj_size” (adjustment size %), “error” (% error). Each figure contains multiple subplots, one for each timeseries.
- Type
dict [str,
plotly.graph_objects.Figure
] or None
- forecasts_test
Forecasted values to test the method. Long format where each column is a time series and each row is a time step. Must have the same column names as
forecasts
. Can have a different number of rows (observations).- Type
pandas.DataFrame
, shape (q, m)
- actuals_test
Actual values to test the method. Must have the same shape and column names as
forecasts_test
.- Type
pandas.DataFrame
, shape (q, m)
- adjusted_forecasts_test
Adjusted
forecasts_test
that satisfy the constraints.- Type
pandas.DataFrame
, shape (q, m)
- constraint_violation_test
The normalized constraint violations on test set. Keys are “actual”, “forecast”, and “adjusted”. Root mean squared constraint violation is divided by root mean squared actual value on test set.
- Type
dict [str, float]
- evaluation_df_test
DataFrame of evaluation results on test set. Rows are timeseries, columns are metrics. See
evaluate()
in this class.- Type
pandas.DataFrame
, shape (m, # metrics)
- figures_test
Plotly figures to visualize evaluation results on test set. Keys are: “base_adj” (base vs adjusted forecast), “adj_size” (adjustment size %), “error” (% error). Each figure contains multiple subplots, one for each timeseries.
- Type
dict [str,
plotly.graph_objects.Figure
] or None
- fit : callable
Fits the
transform_matrix
from training data.
- transform : callable
Adjusts a forecast to satisfy additive constraints using the
transform_matrix
.
- evaluate : callable
Evaluates the adjustment quality by its impact to MAPE, MedAPE, and RMSE.
- fit_transform : callable
Fits and transforms the training data.
- fit_transform_evaluate : callable
Fits, transforms, and evaluates on training data.
- transform_evaluate : calllable
Transforms and evaluates on a new test set.
- fit(forecasts, actuals, order_dict=None, method='custom', levels=None, constraint_matrix=None, lower_bound=None, upper_bound=None, unbiased=True, lam_adj=1.0, lam_bias=1.0, lam_train=1.0, lam_var=1.0, covariance='sample', weight_adj=None, weight_bias=None, weight_train=None, weight_var=None, **solver_kwargs)[source]
Fits the
transform_matrix
based on input data, constraint, and objective function.Sets the attributes between
forecasts
andobjective_weights
as noted in the class description, inclusive, includingtransform_matrix
,transform_variable
,prob
,objective_fn_val
.If method != “bottom_up” and there is no solution, gives a warning and
self.is_optimization_solution
is set to False. Uses “bottom_up” solution as fallback approach iflevels
is provided.- Parameters
forecasts (
pandas.DataFrame
, shape (n, m)) – See attributes ofReconcileAdditiveForecasts
.actuals (
pandas.DataFrame
, shape (n, m)) – See attributes ofReconcileAdditiveForecasts
.order_dict (dict [str, float] or None, default None) – See attributes of
ReconcileAdditiveForecasts
.method (str, default
DEFAULT_METHOD
) – See attributes ofReconcileAdditiveForecasts
. If provided, the parameters fromlower_bound
toweight_var
below are ignored.levels (list [list [int]] or None, default None) – See attributes of
ReconcileAdditiveForecasts
.constraint_matrix (
numpy.array
, shape (c, m) or None, default None) – See attributes ofReconcileAdditiveForecasts
.lower_bound (float or None, default None) – See attributes of
ReconcileAdditiveForecasts
.upper_bound (float or None, default None) – See attributes of
ReconcileAdditiveForecasts
.unbiased (bool, default True) – See attributes of
ReconcileAdditiveForecasts
.lam_adj (float, default 1.0) – See attributes of
ReconcileAdditiveForecasts
.lam_bias (float, default 1.0) – See attributes of
ReconcileAdditiveForecasts
.lam_train (float, default 1.0) – See attributes of
ReconcileAdditiveForecasts
.lam_var (float, default 1.0) – See attributes of
ReconcileAdditiveForecasts
.covariance (
numpy.array
of shape (m, m), or “sample” or “identity”, default “sample”) – See attributes ofReconcileAdditiveForecasts
.weight_adj (
numpy.array
or list [float] of length m or “MedAPE” or “InverseMedAPE” or None, default None) – See attributes ofReconcileAdditiveForecasts
.weight_bias (
numpy.array
or list [float] of length m or “MedAPE” or “InverseMedAPE” or None, default None) – See attributes ofReconcileAdditiveForecasts
.weight_train (
numpy.array
or list [float] of length m or “MedAPE” or “InverseMedAPE” or None, default None) – See attributes ofReconcileAdditiveForecasts
.weight_var (
numpy.array
or list [float] of length m or “MedAPE” or “InverseMedAPE” or None, default None) – See attributes ofReconcileAdditiveForecasts
.solver_kwargs (dict) – Specify the CVXPY solver and parameters. E.g. dict(verbose=True). See https://www.cvxpy.org/tutorial/advanced/index.html#setting-solver-options.
- Returns
transform_matrix – Transformation matrix. Same as
transform_variable.value
, unless the solver failed the find a solution, and a backup value is used. Adjusted forecasts are computed by applying the transform from the left to reordered and transposedforecasts
. Seetransform()
in this class.- Return type
numpy.array
, shape (m, m)
- transform(forecasts_test=None)[source]
Transforms the provided forecasts using the fitted
self.transform_matrix
.- Parameters
forecasts_test (
pandas.DataFrame
, shape (r, m) or None) – Forecasted values to transform. Must have the same columns asself.forecasts
. If None, usesself.forecasts
.- Returns
adjusted_forecasts (
pandas.DataFrame
, shape (r, m)) – Adjusted forecasts that satisfy additive constraints. Columns are reordered according toself.order_dict
.If
forecasts
is None, results are stored toself.adjusted_forecasts
.Else, results are stored to
self.adjusted_forecasts_test
, and theprovided
forecasts_test
toself.forecasts_test
.
- evaluate(is_train, actuals_test=None, ipython_display=False, plot=False, plot_num_cols=3)[source]
Evaluates the adjustment quality. Computes the following metrics for each of the m timeseries:
“Base MAPE” : MAPE of base forecasts “Base MedAPE” : MedAPE of base forecasts “Base RMSE” : RMSE of base forecasts “Adjusted MAPE” : MAPE of adjusted forecasts “Adjusted MedAPE” : MedAPE of adjusted forecasts “Adjusted RMSE” : RMSE of adjusted forecasts “RMSE % change” : (Adjusted RMSE) / (Base RMSE) - 1 “MAPE pp change” : (Adjusted MAPE) - (Base MAPE) “MedAPE pp change” : (Adjusted MedAPE) - (Base MedAPE)
“pp change” refers to percentage point change (difference in %).
Must call
fit
andtransform
before calling this method.- Parameters
is_train (bool) – Whether to evaluate on training set or test set. If True, evaluates training adjustment quality. Else, evaluates test adjustment quality. In this case,
actuals_test
must be provided.actuals_test (
pandas.DataFrame
) – Actual values on test set, required ifis_train==False
. Must have the same shape as the forecasts passed totransform()
, i.e.self.forecasts_test.shape
.ipython_display (bool, default False) – Whether to display the evaluation statistics.
plot (bool, default False) – Whether to display the evaluation plots.
plot_num_cols (int, default 3) – Number of columns in the plot. This is the number of timeseries to plot in each row.
- Returns
evaluation_result (dict [str, dict, or
pandas.DataFrame
]) –"constraint_violation"
dict [str, float]The normalized constraint violations. Keys are “actual”, “forecast”, and “adjusted”. The value is root mean squared constraint violation divided by root mean squared actual value. Constraint violation of actuals should be close to 0.
"evaluation_df"
pandas.DataFrame
, shape (m, # metrics)Evaluation results. DataFrame with one row for each timeseries, and a column for each metric listed above.
"figures"
dict [str,plotly.graph_objects.Figure
]Plotly figures to visualize evaluation results. Keys are: “base_adj” (base vs adjusted forecast), “adj_size” (adjustment size %), “error” (% error). Each figure contains multiple subplots, one for each timeseries.
If
is_train
, results are stored toself.constraint_violation
,self.evaluation_df
.Otherwise, they are stored to
self.constraint_violation_test
,self.evaluation_df_test
.
- fit_transform(forecasts, actuals, **fit_kwargs)[source]
Fits and transforms training data.
- Parameters
forecasts (
pandas.DataFrame
) – Forecasts to fit the adjustment. Seefit
.actuals (
pandas.DataFrame
) – Actuals to fit the adjustment. Seefit
.fit_kwargs (dict, optional) – Additional parameters to pass to
fit
.
- Returns
adjusted_forecasts – Adjusted forecasts.
- Return type
- fit_transform_evaluate(forecasts, actuals, fit_kwargs=None, evaluate_kwargs=None)[source]
Fits, transforms, and evaluates on training data.
- Parameters
forecasts (
pandas.DataFrame
) – Forecasts to fit the adjustment. Seefit
.actuals (
pandas.DataFrame
) – Actuals to fit the adjustment. Seefit
.fit_kwargs (dict, optional, default None) – Additional parameters to pass to
fit
.evaluate_kwargs (dict, optional, default None) – Additional parameters to pass to
evaluate
.
- Returns
evaluation_df – Evaluation results on provided
forecasts
.- Return type
- transform_evaluate(forecasts_test, actuals_test, **evaluate_kwargs)[source]
Transforms and evaluates on test data.
Must call
fit
before calling this method.- forecasts_test
pandas.DataFrame
Forecasts to make consistent. Should be different from the training data.
- actuals_test
pandas.DataFrame
Actuals to check quality of the adjustment.
- evaluate_kwargsdict, optional, default None
Additional parameters to pass to
evaluate
.
- Returns
evaluation_df_test – Evaluation results on provided
forecasts_test
.- Return type
- forecasts_test
- plot_transform_matrix(color_continuous_scale='RdBu', zmin=-1.5, zmax=1.5, **kwargs)[source]
Plots the transform matrix visually, as a grid. By default, negative values are red and positive values are blue.
- Parameters
color_continuous_scale (str or list [str], default “RdBu”) – Colormap used to map scalar data to colors. See
plotly.express.imshow
.zmin (scalar or iterable, default -1.5) – The minimum value covered by the colormap. See
plotly.express.imshow
.zmax (scalar or iterable, default 1.5) – The maximum value covered by the colormap. See
plotly.express.imshow
.kwargs (keyword arguments) – Additional keyword arguments for
plotly.express.imshow
.
- Returns
fig – The transform matrix plot object.
- Return type
- class greykite.algo.reconcile.hierarchical_relationship.HierarchicalRelationship(levels)[source]
Represents hierarchical relationships between nodes (time series).
Nodes are indexed by their position in the tree, in breadth-first search (BFS) order. Matrix attributes such as
bottom_up_transform
are applied from the left against tree values, represented as anumpy.array
2D array with the values of each node as a row.- levels
Specifies the number of children of each parent (internal) node in the tree. The number of inner lists is the height of the tree. The ith inner list provides the number of children of each node at depth i. For example:
# root node with 3 children levels = [[3]] # root node with 3 children, who have 2, 3, 3 children respectively levels = [[3], [2, 3, 3]] # These children are ordered from "left" to "right", so that the one with # 2 children is the first in the 2nd level. # This will be used as our running example. # 0 # level 0 # 1 2 3 # level 1 # 4 5 6 7 8 9 10 11 # level 2
All leaf nodes must have the same depth. Thus, the first sublist must have one integer, the length of a sublist must equal the sum of the previous sublist, and all integers in
levels
must be positive.- Type
list [list [`int]] or None
- num_children_per_parent
Flattened version of
levels
. The number of children for each parent (internal) node. [3, 2, 3, 3] in our example.- Type
list [int]
- num_internal_nodes
The number of internal (parent) nodes (i.e. with children). 4 in our example.
- Type
int
- num_leaf_nodes
The number of leaf nodes (i.e. without children). 8 in our example.
- Type
int
- num_nodes
The total number of nodes. 12 in our example.
- Type
int
- nodes_per_level
The number of nodes at each level of the tree. [1, 3, 8] in our example.
- Type
list [int]
- starting_index_per_level
The index of the first node in each level. [0, 1, 4] in our example.
- Type
list [int]
- starting_child_index_per_parent
For each parent node, the index of its first child. [1, 4, 6, 9] in our example.
- Type
list [int]
- sum_matrix
Sum matrix used to compute values of all nodes from the leaf nodes. When applied to a matrix with the values for leaf nodes, returns values for every node by bubbling up leaf node values to the internal nodes. A node’s value is equal to the sum of its corresponding leaf nodes’ values.
Y_{all} = sum_matrix @ Y_{leaf}
In our example:# 4 5 6 7 8 9 10 11 (leaf nodes) [[1., 1., 1., 1., 1., 1., 1., 1.], # 0 [1., 1., 0., 0., 0., 0., 0., 0.], # 1 [0., 0., 1., 1., 1., 0., 0., 0.], # 2 [0., 0., 0., 0., 0., 1., 1., 1.], # 3 [1., 0., 0., 0., 0., 0., 0., 0.], # 4 [0., 1., 0., 0., 0., 0., 0., 0.], # 5 [0., 0., 1., 0., 0., 0., 0., 0.], # 6 [0., 0., 0., 1., 0., 0., 0., 0.], # 7 [0., 0., 0., 0., 1., 0., 0., 0.], # 8 [0., 0., 0., 0., 0., 1., 0., 0.], # 9 [0., 0., 0., 0., 0., 0., 1., 0.], # 10 [0., 0., 0., 0., 0., 0., 0., 1.]] # 11 (all nodes)
- Type
numpy.array
, shape (self.num_nodes
,self.num_leaf_nodes
)
- leaf_projection_matrix
Projection matrix to get leaf nodes. When applied to a matrix with the values for all nodes, the projection matrix selects only the rows corresponding to leaf nodes.
Y_{leaf} = leaf_projection_matrix @ Y_{actual}
In our example:# 0 1 2 3 4 5 6 7 8 9 10 11 (all nodes) [[0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.], # 4 [0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.], # 5 [0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.], # 6 [0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.], # 7 [0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.], # 8 [0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.], # 9 [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0.], # 10 [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.]] # 11 (leaf nodes)
- Type
numpy.array
, shape (self.num_leaf_nodes
,self.num_nodes
)
- bottom_up_transform
Bottom-up transformation matrix. When applied to a matrix with the values for all nodes, returns values for every node by bubbling up leaf node values to the internal nodes. The original values of internal nodes are ignored.
Y_{bu} = bottom_up_transform @ Y_{actual}
Note thatbottom_up_transform = sum_matrix @ leaf_projection_matrix
. In our example:# 0 1 2 3 4 5 6 7 8 9 10 11 (all nodes) [[0., 0., 0., 0., 1., 1., 1., 1., 1., 1., 1., 1.], # 0 [0., 0., 0., 0., 1., 1., 0., 0., 0., 0., 0., 0.], # 1 [0., 0., 0., 0., 0., 0., 1., 1., 1., 0., 0., 0.], # 2 [0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 1.], # 3 [0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.], # 4 [0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.], # 5 [0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.], # 6 [0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.], # 7 [0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.], # 8 [0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.], # 9 [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0.], # 10 [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.]] # 11 (all nodes)
- Type
numpy.array
, shape (self.num_nodes,
self.num_nodes
)
- constraint_matrix
Constraint matrix representing hierarchical additive constraints, where a parent’s value is equal the sum of its leaf nodes’ values.
constraint_matrix @ Y_{all} = 0
ifY_{all}
satisfies the constraints. In our example:# 0 1 2 3 4 5 6 7 8 9 10 11 (all nodes) [[-1., 0., 0., 0., 1., 1., 1., 1., 1., 1., 1., 1.], # 0 [ 0., -1., 0., 0., 1., 1., 0., 0., 0., 0., 0., 0.], # 1 [ 0., 0., -1., 0., 0., 0., 1., 1., 1., 0., 0., 0.], # 2 [ 0., 0., 0., -1., 0., 0., 0., 0., 0., 1., 1., 1.]] # 3 (internal nodes)
- Type
numpy.array
, shape (self.num_internal_nodes
,self.num_nodes
)
- get_level_of_node : callable
Returns a node’s level in the tree
- get_child_nodes : callable
Returns the indices of a node’s children in the tree
- __set_sum_matrix : callable
Constructs the summing matrix to compute values of all nodes from the leaf nodes.
- __set_leaf_projection_matrix : callable
Constructs leaf projection matrix to retain only values of leaf nodes.
- __set_constraint_matrix : callable
Constructs constraint matrix that requires each parent’s value to be the sum of its leaf node’s values.
Utility Functions
Functions to generate derived time features useful in forecasting, such as growth, seasonality, holidays.
- greykite.common.features.timeseries_features.convert_date_to_continuous_time(dt)[source]
Converts date to continuous time. Each year is one unit.
- Parameters
dt (datetime object) – the date to convert
- Returns
conti_date – the date represented in years
- Return type
float
- greykite.common.features.timeseries_features.get_default_origin_for_time_vars(df, time_col)[source]
Sets default value for origin_for_time_vars
- Parameters
df (
pandas.DataFrame
) – Training data. A data frame which includes the timestamp and value columnstime_col (str) – The column name in df representing time for the time series data.
- Returns
dt_continuous_time – The time origin used to create continuous variables for time
- Return type
float
- greykite.common.features.timeseries_features.pytz_is_dst_fcn(time_zone)[source]
For a given timezone, it constructs a function which determines if a timestamp (dt) is inside the daylight saving period or not for a list of timestamps.
This function, should work for regions in US / Canada and Europe.
The returned function assumes that the timestamps are in the given
time_zone
. Note that since daylight saving is the same for all of mainland US / Canada, one can pass any US time zone e.g."US/Pacific"
to construct a function which works for all of mainland US. Similarly for most of Europe, it is suffcient to pass any Europe time zone e.g."Europe/London"
.Note: Since this function is slow, a faster version is available:
is_dst_fcn
. However, we expect the current function would be more accurate assuming the packagepytz
keeps up to date with potential changes in DST.- Parameters
time_zone (str) – A string denoting the timestamp e.g. “US/Pacific”, “Canada/Eastern”, “Europe/London”.
- Returns
is_dst – A function which takes a list of datetime-like objects and returns a list of colleans to determine if each timestamp is in daylight saving.
- Return type
callable
- greykite.common.features.timeseries_features.get_us_dst_start(year)[source]
For each year, it returns the second Sunday in March, which is the start of the daylight saving (DST) in US/Canada.
We assume DST starts on Second Sunday of March at 2 a.m.
- Parameters
year (int) – Year for which DST start date is desired.
- Returns
result – The timestamp of start of DST in US/Canada.
- Return type
- greykite.common.features.timeseries_features.get_us_dst_end(year)[source]
For each year, it returns the first Sunday in November, which is the end of the daylight saving (DST) in US/Canada.
We assume DST ends on Second Sunday of Novemeber at 2 a.m.
- Parameters
year (int) – Year for which DST end date is desired.
- Returns
result – The timestamp of end of DST in US/Canada.
- Return type
- greykite.common.features.timeseries_features.get_eu_dst_start(year)[source]
For each year, it returns the last Sunday in March, which is the start of the daylight saving (DST) in Europe.
We assume Europe DST starts on last Sunday of March at 1 a.m.
- Parameters
year (int) – Year for which DST start date is desired.
- Returns
result – The timestamp of start of DST in Europe.
- Return type
- greykite.common.features.timeseries_features.get_eu_dst_end(year)[source]
For each year, it returns the last Sunday in October, which is the end of the daylight saving (DST) in Europe.
We assume Europe DST ends on last Sunday of October at 2 a.m.
- Parameters
year (int) – Year for which DST end date is desired.
- Returns
result – The timestamp of end of DST in Europe.
- Return type
- greykite.common.features.timeseries_features.is_dst_fcn(time_zone)[source]
For a given timezone, it constructs a function which determines if a timestamp (dt) is inside the daylight saving period or not for a list of timestamps.
This function, should work for regions in US / Canada and Europe.
The returned function assumes that the timestamps are in the given
time_zone
. Note that since daylight saving is the same for all of mainland US / Canada, one can pass any US time zone e.g."US/Pacific"
to construct a function which works for all of mainland US. Similarly for most of Europe, it is suffcient to pass any Europe time zone e.g."Europe/London"
.Some references on when did DST start in modern era:
Europe: https://www.timeanddate.com/time/europe/daylight-saving-history.html
US: https://en.wikipedia.org/wiki/Daylight_saving_time_in_the_United_States
Note: This function assumes the DST rules remain the same as what they are in the year 2022 (when this code was written). A potentially more accurate (but much slower) version is available:
pytz_is_dst_fcn
. However, we expect the current function would be much faster and it can be updated in case DST rules change.- Parameters
time_zone (str) – A string denoting the timestamp e.g. “US/Pacific”, “Canada/Eastern”, “Europe/London”.
- Returns
is_dst – A function which takes a list of datetime-like objects and returns a list of colleans to determine if each timestamp is in daylight saving.
- Return type
callable
- greykite.common.features.timeseries_features.build_time_features_df(dt, conti_year_origin, add_dst_info=True)[source]
This function gets a datetime-like vector and creates new columns containing temporal features useful for time series analysis and forecasting e.g. year, week of year, etc.
- Parameters
dt (array-like (1-dimensional)) – A vector of datetime-like values
conti_year_origin (float) – The origin used for creating continuous time which is in years unit.
add_dst_info (bool, default True) – Determines if daylight saving columns for US and Europe should be added.
- Returns
time_features_df –
Dataframe with the following time features.
”datetime”:
datetime.datetime
object, a combination of date and a time”date”:
datetime.date
object, date with the format (year, month, day)”year”: integer, year of the date e.g. 2018
”year_length”: integer, number of days in the year e.g. 365 or 366
”quarter”: integer, quarter of the date, 1, 2, 3, 4
”quarter_start”:
pandas.DatetimeIndex
, date of beginning of the current quarter”quarter_length”: integer, number of days in the quarter, 90/91 for Q1, 91 for Q2, 92 for Q3 and Q4
”month”: integer, month of the year, January=1, February=2, …, December=12
”month_length”: integer, number of days in the month, 28/ 29/ 30/ 31
”woy”: integer, ISO 8601 week of the year where a week starts from Monday, 1, 2, …, 53
”doy”: integer, ordinal day of the year, 1, 2, …, year_length
”doq”: integer, ordinal day of the quarter, 1, 2, …, quarter_length
”dom”: integer, ordinal day of the month, 1, 2, …, month_length
”dow”: integer, day of the week, Monday=1, Tuesday=2, …, Sunday=7
”str_dow”: string, day of the week as a string e.g. “1-Mon”, “2-Tue”, …, “7-Sun”
”str_doy”: string, day of the year e.g. “2020-03-20” for March 20, 2020
”hour”: integer, discrete hours of the datetime, 0, 1, …, 23
”minute”: integer, minutes of the datetime, 0, 1, …, 59
”second”: integer, seconds of the datetime, 0, 1, …, 3599
”year_month”: string, (year, month) e.g. “2020-03” for March 2020
”year_woy”: string, (year, week of year) e.g. “2020_42” for 42nd week of 2020
”month_dom”: string, (month, day of month) e.g. “02/20” for February 20th
”year_woy_dow”: string, (year, week of year, day of week) e.g. “2020_03_6” for Saturday of 3rd week in 2020
”woy_dow”: string, (week of year, day of week) e.g. “03_6” for Saturday of 3rd week
”dow_hr”: string, (day of week, hour) e.g. “4_09” for 9am on Thursday
”dow_hr_min”: string, (day of week, hour, minute) e.g. “4_09_10” for 9:10am on Thursday
”tod”: float, time of day, continuous, 0.0 to 24.0
”tow”: float, time of week, continuous, 0.0 to 7.0
”tom”: float, standardized time of month, continuous, 0.0 to 1.0
”toq”: float, time of quarter, continuous, 0.0 to 1.0
”toy”: float, standardized time of year, continuous, 0.0 to 1.0
”conti_year”: float, year in continuous time, eg 2018.5 means middle of the year 2018
”is_weekend”: boolean, weekend indicator, True for weekend, else False
”dow_grouped”: string, Monday-Thursday=1234-MTuWTh, Friday=5-Fri, Saturday=6-Sat, Sunday=7-Sun
”ct1”: float, linear growth based on conti_year_origin, -infinity to infinity
”ct2”: float, signed quadratic growth, -infinity to infinity
”ct3”: float, signed cubic growth, -infinity to infinity
”ct_sqrt”: float, signed square root growth, -infinity to infinity
”ct_root3”: float, signed cubic root growth, -infinity to infinity
- ”us_dst”: bool, determines if the time inside the daylight saving time of US
This column is only generated if
add_dst_info=True
”eu_dst”: bool, determines if the time inside the daylight saving time of Europe. This column is only generated if
add_dst_info=True
- Return type
- greykite.common.features.timeseries_features.add_time_features_df(df, time_col, conti_year_origin, add_dst_info=True)[source]
Adds a time feature data frame to a data frame by calling
build_time_features_df
.- Parameters
df (pandas.Dataframe) – The input data frame
time_col (str) – The name of the time column of interest
conti_year_origin – The origin of time for the continuous time variable which is in years unit.
add_dst_info (bool, default True) – Determines if daylight saving columns for US and Europe should be added.
- Returns
result – The same data frame (df) augmented with new columns generated by
build_time_features_df
- Return type
pandas.Dataframe
- greykite.common.features.timeseries_features.get_holidays(countries, year_start, year_end)[source]
This function extracts a holiday data frame for the period of interest [year_start to year_end] for the given countries. This is done using the holidays libraries in pypi:holidays-ext
- Parameters
countries (list [str]) – countries for which we need holidays
year_start (int) – first year of interest, inclusive
year_end (int) – last year of interest, inclusive
- Returns
holiday_df_dict –
key: country name
value: data frame with holidays for that country Each data frame has two columns: EVENT_DF_DATE_COL, EVENT_DF_LABEL_COL
- Return type
dict [str,
pandas.DataFrame
]
- greykite.common.features.timeseries_features.get_available_holiday_lookup_countries(countries=None)[source]
Returns list of available countries for modeling holidays
- Parameters
countries – List[str] only look for available countries in this set
- Returns
List[str] list of available countries for modeling holidays
- greykite.common.features.timeseries_features.get_available_holidays_in_countries(countries, year_start, year_end)[source]
- Returns a dictionary mapping each country to its holidays
between the years specified.
- Parameters
countries – List[str] countries for which we need holidays
year_start – int first year of interest
year_end – int last year of interest
- Returns
Dict[str, List[str]] key: country name value: list of holidays in that country between [year_start, year_end]
- greykite.common.features.timeseries_features.get_available_holidays_across_countries(countries, year_start, year_end)[source]
Returns a list of holidays that occur any of the countries between the years specified.
- Parameters
countries – List[str] countries for which we need holidays
year_start – int first year of interest
year_end – int last year of interest
- Returns
List[str] names of holidays in any of the countries between [year_start, year_end]
- greykite.common.features.timeseries_features.add_daily_events(df, event_df_dict, date_col='date', regular_day_label='', neighbor_impact=None, shifted_effect=None)[source]
For each key of event_df_dict, it adds a new column to a data frame (df) with a date column (date_col). Each new column will represent the events given for that key. This function also generates 3 binary event flags
IS_EVENT_EXACT_COL
,IS_EVENT_ADJACENT_COL
andIS_EVENT_COL
given the information inevent_df_dict
with the following logic:(1) If the key contains “_minus_” or “_plus_”, that means the event was generated by the
add_event_window
function, and it is a neighboring day of some exact event day. In this case,IS_EVENT_ADJACENT_COL
will be 1 for all days in this key.(2) Otherwise the key indicates that it is on the exact event day being modeled. In this case,
IS_EVENT_EXACT_COL
will be 1 for all days in this key.If a date appears in both types of keys, both above columns will be 1.
IS_EVENT_COL
is 1 for all dates in the providedevent_df_dict
.
- Parameters
df (
pandas.DataFrame
) – The data frame which has a date column.event_df_dict (dict [str,
pandas.DataFrame
]) –A dictionary of data frames, each representing events data for the corresponding key. Values are DataFrames with two columns:
The first column contains the date. Must be at the same frequency as
df[date_col]
for proper join. Must be in a format recognized bypandas.to_datetime
.The second column contains the event label for each date
date_col (str) – Column name in
df
that contains the dates for joining against the events inevent_df_dict
.regular_day_label (str) – The label used for regular days which are not “events”.
neighbor_impact (int, list [int], callable or None, default None) –
The impact of neighboring timestamps of the events in
event_df_dict
. This is for daily events so the units below are all in days.For example, if the data is weekly (“W-SUN”) and an event is daily, it may not exactly fall on the weekly date. But you can specify for New Year’s day on 1-1, it affects all dates in the week, e.g. 12-31, 1-1, …, 1-6, then it will be mapped to the weekly date. In this case you may want to map a daily event’s date to a few dates, and can specify
neighbor_impact=lambda x: [x-timedelta(days=x.isocalendar()[2]-1) + timedelta(days=i) for i in range(7)]
.Another example is that the data is rolling 7 day daily data, thus a holiday may affect the t, t+1, …, t+6 dates. You can specify
neighbor_impact=7
.If input is int, the mapping is t, t+1, …, t+neighbor_impact-1. If input is list, the mapping is [t+x for x in neighbor_impact]. If input is a function, it maps each daily event’s date to a list of dates.
shifted_effect (list [str] or None, default None) – Additional neighbor events based on given events. For example, passing [“-1D”, “7D”] will add extra daily events which are 1 day before and 7 days after the given events. Offset format is {d}{freq} with any integer plus a frequency string. Must be parsable by pandas
to_offset
. The new events’ names will be the current events’ names with suffix “{offset}_before” or “{offset}_after”. For example, if we have an event named “US_Christmas Day”, a “7D” shift will have name “US_Christmas Day_7D_after”. This is useful when you expect an offset of the current holidays also has impact on the time series, or you want to interact the lagged terms with autoregression. Ifneighbor_impact
is also specified, this will be applied after adding neighboring days.
- Returns
df_daily_events – An augmented data frame version of df with new label columns – one for each key of
event_df_dict
.- Return type
- greykite.common.features.timeseries_features.add_event_window(df, time_col, label_col, time_delta='1D', pre_num=1, post_num=1, events_name='')[source]
- For a data frame of events with a time_col and label_col
it adds shifted events prior and after the given events For example if the event data frame includes the row
‘2019-12-25, Christmas’
- the function will produce dataframes with the events:
‘2019-12-24, Christmas’ and ‘2019-12-26, Christmas’
if pre_num and post_num are 1 or more.
- Parameters
df – pd.DataFrame the events data frame with two columns ‘time_col’ and ‘label_col’
time_col – str The column with the timestamp of the events. This can be daily but does not have to
label_col – str the column with labels for the events
time_delta – str the amount of the shift for each unit specified by a string e.g. “1D” stands for one day delta
pre_num – int the number of events to be added prior to the given event for each event in df
post_num – int the number of events to be added after to the given event for each event in df
events_name –
str for each shift, we generate a new data frame and those data frames will be stored in a dictionary with appropriate keys. Each key starts with “events_name” and follow up with:
”_minus_1”, “_minus_2”, “_plus_1”, “_plus_2”, …
depending on pre_num and post_num
- Returns
dict[key: pd.Dataframe] A dictionary of dataframes for each needed shift. For example if pre_num=2 and post_num=3. 2 + 3 = 5 data frames will be stored in the return dictionary.
- greykite.common.features.timeseries_features.get_evenly_spaced_changepoints_values(df, continuous_time_col='ct1', n_changepoints=2)[source]
- Partitions interval into n_changepoints + 1 segments,
placing a changepoint at left endpoint of each segment. The left most segment doesn’t get a changepoint. Changepoints should be determined from training data.
- Parameters
df – pd.DataFrame training dataset. contains continuous_time_col
continuous_time_col – str name of continuous time column (e.g. conti_year, ct1)
n_changepoints – int number of changepoints requested
- Returns
np.array values of df[continuous_time_col] at the changepoints
- greykite.common.features.timeseries_features.get_evenly_spaced_changepoints_dates(df, time_col, n_changepoints)[source]
- Partitions interval into n_changepoints + 1 segments,
placing a changepoint at left endpoint of each segment. The left most segment doesn’t get a changepoint. Changepoints should be determined from training data.
- Parameters
df – pd.DataFrame training dataset. contains continuous_time_col
time_col – str name of time column
n_changepoints – int number of changepoints requested
- Returns
pd.Series values of df[time_col] at the changepoints
- greykite.common.features.timeseries_features.get_custom_changepoints_values(df, changepoint_dates, time_col='ts', continuous_time_col='ct1')[source]
- Returns the values of continuous_time_col at the
requested changepoint_dates.
- Parameters
df – pd.DataFrame training dataset. contains continuous_time_col and time_col
changepoint_dates – Iterable[Union[int, float, str, datetime]] Changepoint dates, interpreted by pd.to_datetime. Changepoints are set at the closest time on or after these dates in the dataset
time_col – str The column name in df representing time for the time series data The time column can be anything that can be parsed by pandas DatetimeIndex
continuous_time_col – str name of continuous time column (e.g. conti_year, ct1)
- Returns
np.array values of df[continuous_time_col] at the changepoints
- greykite.common.features.timeseries_features.get_changepoint_string(changepoint_dates)[source]
Gets proper formatted strings for changepoint dates.
The default format is “_%Y_%m_%d_%H”. When necessary, it appends “_%M” or “_%M_%S”.
- Parameters
changepoint_dates (list) – List of changepoint dates, parsable by
pandas.to_datetime
.- Returns
date_strings – List of string formatted changepoint dates.
- Return type
list[`str]`
- greykite.common.features.timeseries_features.get_changepoint_features(df, changepoint_values, continuous_time_col='ct1', growth_func=None, changepoint_dates=None)[source]
- Returns features for growth terms with continuous time origins at
the changepoint_values (locations) specified
- Generates a time series feature for each changepoint:
Let t = continuous_time value, c = changepoint value Then the changepoint feature value at time point t is
growth_func(t - c) * I(t >= c), where I is the indicator function
This represents growth as a function of time, where the time origin is the changepoint
- In the typical case where growth_func(0) = 0 (has origin at 0),
the total effect of the changepoints is continuous in time. If growth_func is the identity function, and continuous_time represents the year in continuous time, these terms form the basis for a continuous, piecewise linear curve to the growth trend. Fitting these terms with linear model, the coefficents represent slope change at each changepoint
Intended usage
- To make predictions (on test set)
Allow growth term as a function of time to change at these points.
:param : The dataset to make predictions. Contains column continuous_time_col. :type : param df: pd.Dataframe :param : List of changepoint values (on same scale as df[continuous_time_col]).
Should be determined from training data
:type : param changepoint_values: array-like :param : Name of continuous time column in df
growth_func is applied to this column to generate growth term If None, uses “ct1”, linear growth
:type : param continuous_time_col: Optional[str] :param : Growth function for defining changepoints (scalar -> scalar).
If None, uses identity function to use continuous_time_col directly as growth term
:type : param growth_func: Optional[callable] :param : List of change point dates, parsable by
pandas.to_datetime
. :type : param changepoint_dates: Optional[list] :param : Changepoint features, 0-indexed :type : return: pd.DataFrame, shape (df.shape[0], len(changepoints))
- greykite.common.features.timeseries_features.get_changepoint_values_from_config(changepoints_dict, time_features_df, time_col='ts')[source]
Applies the changepoint method specified in changepoints_dict to return the changepoint values
- Parameters
changepoints_dict –
Optional[Dict[str, any]] Specifies the changepoint configuration. “method”: str
- The method to locate changepoints. Valid options:
”uniform”. Places n_changepoints evenly spaced changepoints to allow growth to change. “custom”. Places changepoints at the specified dates.
Additional keys to provide parameters for each particular method are described below.
- ”continuous_time_col”: Optional[str]
Column to apply growth_func to, to generate changepoint features Typically, this should match the growth term in the model
- ”growth_func”: Optional[func]
Growth function (scalar -> scalar). Changepoint features are created by applying growth_func to “continuous_time_col” with offsets. If None, uses identity function to use continuous_time_col directly as growth term
- If changepoints_dict[“method”] == “uniform”, this other key is required:
- ”n_changepoints”: int
number of changepoints to evenly space across training period
- If changepoints_dict[“method”] == “custom”, this other key is required:
- ”dates”: Iterable[Union[int, float, str, datetime]]
Changepoint dates. Must be parsable by pd.to_datetime. Changepoints are set at the closest time on or after these dates in the dataset.
time_features_df – pd.Dataframe training dataset. contains column “continuous_time_col”
time_col – str The column name in time_features_df representing time for the time series data The time column can be anything that can be parsed by pandas DatetimeIndex Used only in the “custom” method.
- Returns
np.array values of df[continuous_time_col] at the changepoints
- greykite.common.features.timeseries_features.get_changepoint_features_and_values_from_config(df, time_col, changepoints_dict=None, origin_for_time_vars=None)[source]
Extracts changepoints from changepoint configuration and input data
- Parameters
df – pd.DataFrame Training data. A data frame which includes the timestamp and value columns
time_col – str The column name in df representing time for the time series data The time column can be anything that can be parsed by pandas DatetimeIndex
changepoints_dict –
Optional[Dict[str, any]] Specifies the changepoint configuration. “method”: str
- The method to locate changepoints. Valid options:
”uniform”. Places n_changepoints evenly spaced changepoints to allow growth to change. “custom”. Places changepoints at the specified dates.
Additional keys to provide parameters for each particular method are described below.
- ”continuous_time_col”: Optional[str]
Column to apply growth_func to, to generate changepoint features Typically, this should match the growth term in the model
- ”growth_func”: Optional[func]
Growth function (scalar -> scalar). Changepoint features are created by applying growth_func to “continuous_time_col” with offsets. If None, uses identity function to use continuous_time_col directly as growth term
- If changepoints_dict[“method”] == “uniform”, this other key is required:
- ”n_changepoints”: int
number of changepoints to evenly space across training period
- If changepoints_dict[“method”] == “custom”, this other key is required:
- ”dates”: Iterable[Union[int, float, str, datetime]]
Changepoint dates. Must be parsable by pd.to_datetime. Changepoints are set at the closest time on or after these dates in the dataset.
origin_for_time_vars – Optional[float] The time origin used to create continuous variables for time
- Returns
Dict[str, any] Dictionary with the requested changepoints and associated information changepoint_df: pd.DataFrame, shape (df.shape[0], len(changepoints))
Changepoint features for modeling the training data
- changepoint_values: array-like
List of changepoint values (on same scale as df[continuous_time_col]) Can be used to generate changepoints for prediction.
- continuous_time_col: Optional[str]
Name of continuous time column in df growth_func is applied to this column to generate growth term. If None, uses “ct1”, linear growth Can be used to generate changepoints for prediction.
- growth_func: Optional[callable]
Growth function for defining changepoints (scalar -> scalar). If None, uses identity function to use continuous_time_col directly as growth term. Can be used to generate changepoints for prediction.
- changepoint_cols: List[str]
Names of the changepoint columns for modeling
- greykite.common.features.timeseries_features.get_changepoint_dates_from_changepoints_dict(changepoints_dict, df=None, time_col=None)[source]
Gets the changepoint dates from
changepoints_dict
- Parameters
changepoints_dict (dict or None) – The
changepoints_dict
which is compatible withforecast
df (
pandas.DataFrame
or None, default None) – The data df to put changepoints on.time_col (str or None, default None) – The column name of time column in
df
.
- Returns
changepoint_dates – List of changepoint dates.
- Return type
list
- greykite.common.features.timeseries_features.add_event_window_multi(event_df_dict, time_col, label_col, time_delta='1D', pre_num=1, post_num=1, pre_post_num_dict=None)[source]
For a given dictionary of events data frames with a time_col and label_col it adds shifted events prior and after the given events For example if the event data frame includes the row ‘2019-12-25, Christmas’ as a row the function will produce dataframes with the events ‘2019-12-24, Christmas’ and ‘2019-12-26, Christmas’ if pre_num and post_num are 1 or more.
- Parameters
event_df_dict (dict [str,
pandas.DataFrame
]) – A dictionary of events data frames with each having two columns:time_col
andlabel_col
.time_col (str) – The column with the timestamp of the events. This can be daily but does not have to be.
label_col (str) – The column with labels for the events.
time_delta (str, default “1D”) – The amount of the shift for each unit specified by a string e.g. ‘1D’ stands for one day delta
pre_num (int, default 1) – The number of events to be added prior to the given event for each event in df.
post_num (int, default 1) – The number of events to be added after to the given event for each event in df.
pre_post_num_dict (dict [str, (int, int)] or None, default None) – Optionally override
pre_num
andpost_num
for each key inevent_df_dict
. For example, ifevent_df_dict
has keys “US” and “India”, this parameter can be set topre_post_num_dict = {"US": [1, 3], "India": [1, 2]}
, denoting that the “US”pre_num
is 1 andpost_num
is 3, and “India”pre_num
is 1 andpost_num
is 2. Keys not specified bypre_post_num_dict
use the default given bypre_num
andpost_num
.
- Returns
df – A dictionary of dataframes for each needed shift. For example if pre_num=2 and post_num=3. 2 + 3 = 5 data frames will be stored in the return dictionary.
- Return type
dict [str,
pandas.DataFrame
]
- greykite.common.features.timeseries_features.get_fourier_col_name(k, col_name, function_name='sin', seas_name=None)[source]
Returns column name corresponding to a particular fourier term, as returned by fourier_series_fcn
- Parameters
k – int fourier term
col_name – str column in the dataframe used to generate fourier series
function_name – str sin or cos
seas_name – strcols_interact appended to new column names added for fourier terms
- Returns
str column name in DataFrame returned by fourier_series_fcn
- greykite.common.features.timeseries_features.fourier_series_fcn(col_name, period=1.0, order=1, seas_name=None)[source]
Generates a function which creates fourier series matrix for a column of an input df :param col_name: str
is the column name in the dataframe which is to be used for generating fourier series. It needs to be a continuous variable.
- Parameters
period – float the period of the fourier series
order – int the order of the fourier series
seas_name – Optional[str] appended to new column names added for fourier terms. Useful to distinguish multiple fourier series on same col_name with different periods.
- Returns
callable a function which can be applied to any data.frame df with a column name being equal to col_name
- greykite.common.features.timeseries_features.fourier_series_multi_fcn(col_names, periods=None, orders=None, seas_names=None)[source]
Generates a func which adds multiple fourier series with multiple periods.
- Parameters
col_names (list [str]) – the column names which are to be used to generate Fourier series. Each column can have its own period and order.
periods (list [float] or None) – the periods corresponding to each column given in col_names
orders (list [int] or None) – the orders for each of the Fourier series
seas_names (list [str] or None) – Appended to the Fourier series name. If not provided (None) col_names will be used directly.
- greykite.common.features.timeseries_features.signed_pow(x, y)[source]
Takes the absolute value of x and raises it to power of y. Then it multiplies the result by sign of x. This guarantees this function is non-decreasing. This is useful in many contexts e.g. statistical modeling. :param x: the base number which can be any real number :param y: the power which can be any real number :return: returns abs(x) to power of y multiplied by sign of x
- greykite.common.features.timeseries_features.logistic(x, growth_rate=1.0, capacity=1.0, floor=0.0, inflection_point=0.0)[source]
- Evaluates the logistic function at x with the specified growth rate,
capacity, floor, and inflection point.
- Parameters
- Returns
value of the logistic function at t
- Return type
- greykite.common.features.timeseries_features.get_logistic_func(growth_rate=1.0, capacity=1.0, floor=0.0, inflection_point=0.0)[source]
- Returns a function that evaluates the logistic function at t with the
specified growth rate, capacity, floor, and inflection point.
f(x) = floor + capacity / (1 + exp(-growth_rate * (x - inflection_point)))
- greykite.algo.forecast.silverkite.forecast_simple_silverkite_helper.get_event_pred_cols(daily_event_df_dict, daily_event_shifted_effect=None)[source]
Generates the names of internal predictor columns from the event dictionary passed to
forecast
. These can be passed via theextra_pred_cols
parameter to model event effects.Note
The returned strings are patsy model formula terms. Each provides full set of levels so that prediction works even if a level is not found in the training set.
If a level does not appear in the training set, its coefficient may be unbounded in the “linear” fit_algorithm. A method with regularization avoids this issue (e.g. “ridge”, “elastic_net”).
- Parameters
daily_event_df_dict (dict or None, optional, default None) – A dictionary of data frames, each representing events data for the corresponding key. See
forecast
.daily_event_shifted_effect (list [str] or None, default None) – Additional neighbor events based on given events. For example, passing [“-1D”, “7D”] will add extra daily events which are 1 day before and 7 days after the given events. Offset format is {d}{freq} with any integer plus a frequency string. Must be parsable by pandas
to_offset
. The new events’ names will be the current events’ names with suffix “{offset}_before” or “{offset}_after”. For example, if we have an event named “US_Christmas Day”, a “7D” shift will have name “US_Christmas Day_7D_after”. This is useful when you expect an offset of the current holidays also has impact on the time series, or you want to interact the lagged terms with autoregression. The interaction can be specified with e.g.y_lag7:events_US_Christmas Day_7D_after
. Ifdaily_event_neighbor_impact
is also specified, this will be applied after adding neighboring days.
- Returns
event_pred_cols – List of patsy model formula terms, one for each key of
daily_event_df_dict
.- Return type
list [str]
- greykite.framework.pipeline.utils.get_basic_pipeline(estimator=SimpleSilverkiteEstimator(), score_func='MeanAbsolutePercentError', score_func_greater_is_better=False, agg_periods=None, agg_func=None, relative_error_tolerance=None, coverage=0.95, null_model_params=None, regressor_cols=None, lagged_regressor_cols=None)[source]
Returns a basic pipeline for univariate forecasting. Allows for outlier detection, normalization, null imputation, degenerate column removal, and forecast model fitting. By default, only null imputation is enabled. See source code for the pipeline steps.
Notes
While
score_func
is used to define the estimator’s score function, the thescoring
parameter ofRandomizedSearchCV
should be provided when using this pipeline in grid search. Otherwise, grid search assumes higher values are better forscore_func
.- Parameters
estimator (instance of an estimator that implements
greykite.sklearn.estimator.base_forecast_estimator.BaseForecastEstimator
, default SimpleSilverkiteEstimator() # noqa: E501) – Estimator to use as the final step in the pipeline.score_func (str or callable, default
EvaluationMetricEnum.MeanAbsolutePercentError.name
) – Score function used to select optimal model in CV. If a callable, takes arraysy_true
,y_pred
and returns a float. If a string, must be either aEvaluationMetricEnum
member name orFRACTION_OUTSIDE_TOLERANCE
.score_func_greater_is_better (bool, default False) – True if
score_func
is a score function, meaning higher is better, and False if it is a loss function, meaning lower is better. Must be provided ifscore_func
is a callable (custom function). Ignored ifscore_func
is a string, because the direction is known.agg_periods (int or None, default None) – Number of periods to aggregate before evaluation. Model is fit at original frequency, and forecast is aggregated according to
agg_periods
E.g. fit model on hourly data, and evaluate performance at daily level If None, does not apply aggregationagg_func (callable or None, default None) – Takes an array and returns a number, e.g. np.max, np.sum Used to aggregate data prior to evaluation (applied to actual and predicted) Ignored if
agg_periods
is Nonerelative_error_tolerance (float or None, default None) – Threshold to compute the
FRACTION_OUTSIDE_TOLERANCE
metric, defined as the fraction of forecasted values whose relative error is strictly greater thanrelative_error_tolerance
. For example, 0.05 allows for 5% relative error. Required ifscore_func
isFRACTION_OUTSIDE_TOLERANCE
.coverage (float or None, default=0.95) – Intended coverage of the prediction bands (0.0 to 1.0) If None, the upper/lower predictions are not returned Ignored if pipeline is provided. Uses coverage of the
pipeline
estimator instead.null_model_params (dict or None, default None) –
Defines baseline model to compute
R2_null_model_score
evaluation metric.R2_null_model_score
is the improvement in the loss function relative to a null model. It can be used to evaluate model quality with respect to a simple baseline. For details, seer2_null_model_score
.The null model is a
DummyRegressor
, which returns constant predictions.Valid keys are “strategy”, “constant”, “quantile”. See
DummyRegressor
. For example:null_model_params = { "strategy": "mean", } null_model_params = { "strategy": "median", } null_model_params = { "strategy": "quantile", "quantile": 0.8, } null_model_params = { "strategy": "constant", "constant": 2.0, }
If None,
R2_null_model_score
is not calculated.Note: CV model selection always optimizes
score_func`, not the ``R2_null_model_score
.regressor_cols (list [str] or None, default None) – A list of regressor columns used in the training and prediction DataFrames. It should contain only the regressors that are being used in the grid search. If None, no regressor columns are used. Regressor columns that are unavailable in
df
are dropped.lagged_regressor_cols (list [str] or None, default None) – A list of additional columns needed for lagged regressors in the training and prediction DataFrames. This list can have overlap with
regressor_cols
. If None, no additional columns are added to the DataFrame. Lagged regressor columns that are unavailable indf
are dropped.
- Returns
pipeline – sklearn Pipeline for univariate forecasting.
- Return type
- greykite.framework.utils.exploratory_data_analysis.get_exploratory_plots(df, time_col, value_col, freq=None, anomaly_info=None, output_path=None)[source]
Computes multiple exploratory data analysis (EDA) plots to visualize the metric in
value_col``and aid in modeling. The EDA plots are written in an `html` file at ``output_path
.For details on how to interpret these EDA plots, check the tutorials.
- Parameters
df (
pandas.DataFrame
) – Input timeseries. A data frame which includes the timestamp column as well as the value column.time_col (str) – The column name in
df
representing time for the time series data. The time column can be anything that can be parsed by pandas DatetimeIndex.value_col (str) – The column name which has the value of interest to be forecasted.
freq (str or None, default None) – Timeseries frequency, DateOffset alias, If None automatically inferred.
anomaly_info (dict or list [dict] or None, default None) – Anomaly adjustment info. Anomalies in
df
are corrected before any plotting is done.output_path (str or None, default None) – Path where the
html
file is written. If None, it is set to “EDA_{value_col}.html”.
- Returns
eda.html – An html file containing the EDA plots is written at
output_path
.- Return type
html
file
- greykite.framework.utils.result_summary.summarize_grid_search_results(grid_search, only_changing_params=True, combine_splits=True, decimals=None, score_func='MeanAbsolutePercentError', score_func_greater_is_better=False, cv_report_metrics='ALL', column_order=None)[source]
Summarizes CV results for each grid search parameter combination.
While
grid_search.cv_results_
could be imported into apandas.DataFrame
without this function, the following conveniences are provided:returns the correct ranks based on each metric’s greater_is_better direction.
summarizes the hyperparameter space, only showing the parameters that change
combines split scores into a tuple to save table width
rounds the values to specified decimals
orders columns by type (test score, train score, metric, etc.)
- Parameters
grid_search (
RandomizedSearchCV
) – Grid search output (fitted RandomizedSearchCV object).only_changing_params (bool, default True) – If True, only show parameters with multiple values in the hyperparameter_grid.
combine_splits (bool, default True) –
Whether to report split scores as a tuple in a single column.
If True, adds a column for the test splits scores for each requested metric. Adds a column with train split scores if those are available.
For example, “split_train_score” would contain the values (split1_train_score, split2_train_score, split3_train_score) as as tuple.
If False, this summary column is not added.
The original split columns are available either way.
decimals (int or None, default None) – Number of decimal places to round to. If decimals is negative, it specifies the number of positions to the left of the decimal point. If None, does not round.
score_func (str or callable, default
EvaluationMetricEnum.MeanAbsolutePercentError.name
) –Score function used to select optimal model in CV. If a callable, takes arrays
y_true
,y_pred
and returns a float. If a string, must be either aEvaluationMetricEnum
member name orFRACTION_OUTSIDE_TOLERANCE
.Used in this function to fix the
"rank_test_score"
column ifscore_func_greater_is_better=False
.Should be the same as what was passed to
run_forecast_config
, orforecast_pipeline
, orget_hyperparameter_searcher
.score_func_greater_is_better (bool, default False) –
True if
score_func
is a score function, meaning higher is better, and False if it is a loss function, meaning lower is better. Must be provided ifscore_func
is a callable (custom function). Ignored ifscore_func
is a string, because the direction is known.Used in this function to fix the
"rank_test_score"
column ifscore_func_greater_is_better=False
.Should be the same as what was passed to
run_forecast_config
, orforecast_pipeline
, orget_hyperparameter_searcher
.cv_report_metrics (
CV_REPORT_METRICS_ALL
, or list [str], or None, default CV_REPORT_METRICS_ALL # noqa: E501) –Additional metrics to show in the summary, besides the one specified by
score_func
.If a metric is specified but not available, a warning will be given.
Should be the same as what was passed to
run_forecast_config
, orforecast_pipeline
, orget_hyperparameter_searcher
, or a subset of computed metric to show.If a list of strings, valid strings are
greykite.common.evaluation.EvaluationMetricEnum
member names andFRACTION_OUTSIDE_TOLERANCE
.column_order (list [str] or None, default None) –
How to order the columns. A list of regex to order column names, in greedy fashion. Column names matching the first item are placed first. Among remaining items, those matching the second items are placed next, etc. Use “*” as the last element to select all available columns, if desired. If None, uses default ordering:
column_order = ["rank_test", "mean_test", "split_test", "mean_train", "params", "param", "split_train", "time", ".*"]
Notes
Metrics are named in
grid_search.cv_results_
according to thescoring
parameter passed toRandomizedSearchCV
."score"
is the default used by sklearn for single metric evaluation.If a dictionary is provided to
scoring
, as is the case through templates, then the metrics are named by its keys, and the metric used for selection is defined byrefit
. The keys are derived fromscore_func
andcv_report_metrics
inget_scoring_and_refit
.The key for
score_func
if it is a callable is CUSTOM_SCORE_FUNC_NAME.The key for
EvaluationMetricEnum
member name is the short name from.get_metric_name()
.The key for
FRACTION_OUTSIDE_TOLERANCE
is FRACTION_OUTSIDE_TOLERANCE_NAME.
- Returns
cv_results – A summary of cross-validation results in tabular format. Each row corresponds to a set of parameters used in the grid search.
The columns have the following format, where name is the canonical short name for the metric.
"rank_test_{name}"
intThe params ranked by mean_test_score (1 is best).
"mean_test_{name}"
floatAverage test score.
"split_test_{name}"
list [float]Test score on each split. [split 0, split 1, …]
"std_test_{name}"
floatStandard deviation of test scores.
"mean_train_{name}"
floatAverage train score.
"split_train_{name}"
list [float]Train score on each split. [split 0, split 1, …]
"std_train_{name}"
floatStandard deviation of train scores.
"mean_fit_time"
floatAverage time to fit each CV split (in seconds)
"std_fit_time"
floatStd of time to fit each CV split (in seconds)
"mean_score_time"
floatAverage time to score each CV split (in seconds)
"std_score_time"
floatStd of time to score each CV split (in seconds)
"params"
dictThe parameters used. If
only_changing==True
, only shows the parameters which are not identical across all CV splits."param_{pipeline__param__name}"
AnyThe value of pipeline parameter pipeline__param__name for each row.
- Return type
- greykite.framework.utils.result_summary.get_ranks_and_splits(grid_search, score_func='MeanAbsolutePercentError', greater_is_better=False, combine_splits=True, decimals=None, warn_metric=True)[source]
Extracts CV results from
grid_search
for the specified score function. Returns the correct ranks on the test set and a tuple of the scores across splits, for both test set and train set (if available).Notes
While
cv_results
contains keys with the ranks, these ranks are inverted if lower values are better and thescoring
function was initialized withgreater_is_better=True
to report metrics with their original sign.This function always returns the correct ranks, accounting for metric direction.
- Parameters
grid_search (
RandomizedSearchCV
) – Grid search output (fitted RandomizedSearchCV object).score_func (str or callable, default
EvaluationMetricEnum.MeanAbsolutePercentError.name
) –Score function to get the ranks for. If a callable, takes arrays
y_true
,y_pred
and returns a float. If a string, must be either aEvaluationMetricEnum
member name orFRACTION_OUTSIDE_TOLERANCE
.Should be the same as what was passed to
run_forecast_config
, orforecast_pipeline
, orget_hyperparameter_searcher
.greater_is_better (bool or None, default False) –
True if
score_func
is a score function, meaning higher is better, and False if it is a loss function, meaning lower is better. Must be provided ifscore_func
is a callable (custom function). Ignored ifscore_func
is a string, because the direction is known.Used in this function to rank values in the proper direction.
Should be the same as what was passed to
run_forecast_config
, orforecast_pipeline
, orget_hyperparameter_searcher
.combine_splits (bool, default True) – Whether to report split scores as a tuple in a single column. If True, a single column is returned for all the splits of a given metric and train/test set. For example, “split_train_score” would contain the values (split1_train_score, split2_train_score, split3_train_score) as as tuple. If False, they are reported in their original columns.
decimals (int or None, default None) – Number of decimal places to round to. If decimals is negative, it specifies the number of positions to the left of the decimal point. If None, does not round.
warn_metric (bool, default True) – Whether to issue a warning if the requested metric is not found in the CV results.
- Returns
ranks_and_splits – Ranks and split scores. Dictionary with the following keys:
"short_name"
intCanonical short name for the
score_func
."ranks"
numpy.array
Ranks of the test scores for the
score_func
, where 1 is the best."split_train"
list [list [float]]Train split scores. Outer list corresponds to the parameter setting; inner list contains the scores for that parameter setting across all splits.
"split_test"
list [list [float]]Test split scores. Outer list corresponds to the parameter setting; inner list contains the scores for that parameter setting across all splits.
- Return type
dict
- greykite.common.viz.timeseries_plotting.plot_multivariate(df, x_col, y_col_style_dict='plotly', default_color='rgba(0, 145, 202, 1.0)', xlabel=None, ylabel='y', title=None, showlegend=True)[source]
Plots one or more lines against the same x-axis values.
- Parameters
df (
pandas.DataFrame
) – Data frame withx_col
and columns named by the keys iny_col_style_dict
.x_col (str) – Which column to plot on the x-axis.
y_col_style_dict (dict [str, dict or None] or “plotly” or “auto” or “auto-fill”, default “plotly”) –
The column(s) to plot on the y-axis, and how to style them.
If a dictionary:
- keystr
column name in
df
- valuedict or None
Optional styling options, passed as kwargs to
go.Scatter
. If None, uses the default: line labeled by the column name. See reference page forplotly.graph_objects.Scatter
for options (e.g. color, mode, width/size, opacity). https://plotly.com/python/reference/#scatter.
If a string, plots all columns in
df
besidesx_col
againstx_col
:”plotly”: plot lines with default plotly styling
”auto”: plot lines with color
default_color
, sorted by value (ascending)”auto-fill”: plot lines with color
default_color
, sorted by value (ascending), and fills between lines
default_color (str, default “rgba(0, 145, 202, 1.0)” (blue)) – Default line color when
y_col_style_dict
is one of “auto”, “auto-fill”.xlabel (str or None, default None) – x-axis label. If None, default is
x_col
.ylabel (str or None, default
VALUE_COL
) – y-axis labeltitle (str or None, default None) – Plot title. If None, default is based on axis labels.
showlegend (bool, default True) – Whether to show the legend.
- Returns
fig – Interactive plotly graph of one or more columns in
df
againstx_col
.See
plot_forecast_vs_actual
return value for how to plot the figure and add customization.- Return type
- greykite.common.viz.timeseries_plotting.plot_univariate(df, x_col, y_col, xlabel=None, ylabel=None, title=None, color='rgb(32, 149, 212)', showlegend=True)[source]
Simple plot of univariate timeseries.
- Parameters
df (
pandas.DataFrame
) – Data frame withx_col
andy_col
x_col (str) – x-axis column name, usually the time column
y_col (str) – y-axis column name, the value the plot
xlabel (str or None, default None) – x-axis label
ylabel (str or None, default None) – y-axis label
title (str or None, default None) – Plot title. If None, default is based on axis labels.
color (str, default “rgb(32, 149, 212)” (light blue)) – Line color
showlegend (bool, default True) – Whether to show the legend
- Returns
fig – Interactive plotly graph of the value against time.
See
plot_forecast_vs_actual
return value for how to plot the figure and add customization.- Return type
See also
None
Provides more styling options. Also consider using plotly’s
go.Scatter
andgo.Layout
directly.
- greykite.common.viz.timeseries_plotting.plot_forecast_vs_actual(df, time_col='ts', actual_col='actual', predicted_col='forecast', predicted_lower_col='forecast_lower', predicted_upper_col='forecast_upper', xlabel='ts', ylabel='y', train_end_date=None, title=None, showlegend=True, actual_mode='lines+markers', actual_points_color='rgba(250, 43, 20, 0.7)', actual_points_size=2.0, actual_color_opacity=1.0, forecast_curve_color='rgba(0, 90, 181, 0.7)', forecast_curve_dash='solid', ci_band_color='rgba(0, 90, 181, 0.15)', ci_boundary_curve_color='rgba(0, 90, 181, 0.5)', ci_boundary_curve_width=0.0, vertical_line_color='rgba(100, 100, 100, 0.9)', vertical_line_width=1.0)[source]
Plots forecast with prediction intervals, against actuals Adapted from plotly user guide: https://plot.ly/python/v3/continuous-error-bars/#basic-continuous-error-bars
- Parameters
df (
pandas.DataFrame
) – Timestamp, predicted, and actual valuestime_col (str, default
TIME_COL
) – Column in df with timestamp (x-axis)actual_col (str, default
ACTUAL_COL
) – Column in df with actual valuespredicted_col (str, default
PREDICTED_COL
) – Column in df with predicted valuespredicted_lower_col (str or None, default
PREDICTED_LOWER_COL
) – Column in df with predicted lower boundpredicted_upper_col (str or None, default
PREDICTED_UPPER_COL
) – Column in df with predicted upper boundxlabel (str, default
TIME_COL
) – x-axis label.ylabel (str, default
VALUE_COL
) – y-axis label.train_end_date (
datetime.datetime
or None, default None) – Train end date. Must be a value indf[time_col]
.title (str or None, default None) – Plot title.
showlegend (bool, default True) – Whether to show a plot legend.
actual_mode (str, default “lines+markers”) – How to show the actuals. Options:
markers
,lines
,lines+markers
actual_points_color (str, default “rgba(99, 114, 218, 1.0)”) – Color of actual line/marker.
actual_points_size (float, default 2.0) – Size of actual markers. Only used if “markers” is in
actual_mode
.actual_color_opacity (float or None, default 1.0) – Opacity of actual values points.
forecast_curve_color (str, default “rgba(0, 145, 202, 1.0)”) – Color of forecasted values.
forecast_curve_dash (str, default “solid”) – ‘dash’ property of forecast
scatter.line
. One of:['solid', 'dot', 'dash', 'longdash', 'dashdot', 'longdashdot']
or a string containing a dash length list in pixels or percentages (e.g.'5px 10px 2px 2px'
,'5, 10, 2, 2'
,'10% 20% 40%'
)ci_band_color (str, default “rgba(0, 145, 202, 0.15)”) – Fill color of the prediction bands.
ci_boundary_curve_color (str, default “rgba(0, 145, 202, 0.15)”) – Color of the prediction upper/lower lines.
ci_boundary_curve_width (float, default 0.0) – Width of the prediction upper/lower lines. default 0.0 (hidden)
vertical_line_color (str, default “rgba(100, 100, 100, 0.9)”) – Color of the vertical line indicating train end date. Default is black with opacity of 0.9.
vertical_line_width (float, default 1.0) – width of the vertical line indicating train end date
- Returns
fig – Plotly figure of forecast against actuals, with prediction intervals if available.
Can show, convert to HTML, update:
# show figure fig.show() # get HTML string, write to file fig.to_html(include_plotlyjs=False, full_html=True) fig.write_html("figure.html", include_plotlyjs=False, full_html=True) # customize layout (https://plot.ly/python/v3/user-guide/) update_layout = dict( yaxis=dict(title="new ylabel"), title_text="new title", title_x=0.5, title_font_size=30) fig.update_layout(update_layout)
- Return type
- greykite.common.features.timeseries_impute.impute_with_lags(df, value_col, orders, agg_func='mean', iter_num=1)[source]
A function to impute timeseries values (given in
df
) and invalue_col
with chosen lagged values or an aggregated of those. For example for daily data one could use the 7th lag to impute using the value of the same day of past week as opposed to the closest value available which can be inferior for business related timeseries.The imputation can be done multiple times by specifying
iter_num
to decrease the number of missing in some cases. Note that there are no guarantees to impute all missing values with this method by design. However the original number of missing values and the final number of missing values are returned by the function along with the imputed dataframe.- Parameters
df (
pandas.DataFrame
) – Input dataframe which must include value_col as a column.value_col (str) – The column name in
df
representing the values of the timeseries.orders (list of int) – The lag orders to be used for aggregation.
agg_func ("mean" or callable, default: "mean") –
pandas.Series
-> float An aggregation function to aggregate the chosen lags. If “mean”, usespandas.DataFrame.mean
.iter_num (int, default 1) – Maximum number of iterations to impute the series. Each iteration represent an imputation of the series using the provided lag orders (
orders
) and return an imputed dataframe. It might be the case that with one iterations some values are not imputed but with more iterations one can achieve more imputed values.
- Returns
impute_info – A dictionary with following items:
- ”df”
pandas.DataFrame
A dataframe with the imputed values.
- ”initial_missing_num”int
Initial number of missing values.
- ”final_missing_num”int
Final number of missing values after imputations.
- ”df”
- Return type
dict
- greykite.common.features.timeseries_impute.impute_with_lags_multi(df, orders, agg_func=<function mean>, iter_num=1, cols=None)[source]
Imputes every column of
df
usingimpute_with_lags
.- Parameters
df (
pandas.DataFrame
) – Input dataframe which must include value_col as a column.orders (list of int) – The lag orders to be used for aggregation.
agg_func (callable, default
np.mean
) –pandas.Series
-> float An aggregation function to aggregate the chosen lags.iter_num (int, default 1) – Maximum number of iterations to impute the series. Each iteration represent an imputation of the series using the provided lag orders (
orders
) and return an imputed dataframe. It might be the case that with one iterations some values are not imputed but with more iterations one can achieve more imputed values.cols (list [str] or None, default None) – Which columns to impute. If None, imputes all columns.
- Returns
impute_info – A dictionary with following items:
- ”df”
pandas.DataFrame
A dataframe with the imputed values.
- ”missing_info”dict
Dictionary with information about the missing info.
Key = name of a column in
df
Value = dictionary containing:- ”initial_missing_num”int
Initial number of missing values.
- ”final_missing_num”int
Final number of missing values after imputation.
- ”df”
- Return type
dict
- greykite.common.features.adjust_anomalous_data.adjust_anomalous_data(df, time_col, value_col, anomaly_df, start_time_col='start_time', end_time_col='end_time', adjustment_delta_col=None, filter_by_dict=None, filter_by_value_col=None, adjustment_method='add')[source]
This function takes:
a time series, in the form of a dataframe:
df
the anomaly information, in the form of a dataframe:
anomaly_df
.
It then adjusts the values of the time series based on the perceived impact of the anomalies given in the column
adjustment_delta_col
and assignsnp.nan
if the impact is not given.Note that
anomaly_df
can contain the anomaly information for many different timeseries. This is enabled by allowing multiple metrics and dimensions to be listed in the same anomaly dataframe. Columns can indicate the metric name and dimension value.This function first subsets the
anomaly_df
to the relevant rows for thevalue_col
as specified byfilter_by_dict
, then makes the specified adjustments todf
.- Parameters
df (
pandas.DataFrame
) – A data frame which includes the timestamp column as well as the value column.time_col (str) – The column name in
df
representing time for the time series data. The time column can be anything that can be parsed bypandas.DatetimeIndex
.value_col (str) – The column name which has the value of interest to be forecasted.
anomaly_df (
pandas.DataFrame
) –A dataframe which includes the anomaly information for the input series (
df
) but potentially for multiple series and dimensions.This dataframe must include these two columns:
start_time_col
end_time_col
and include
adjustment_delta_col
if it is not None in the function call.
Moreover if dimensions are requested by passing the
filter_by_dict
argument (not None), all of this dictionary keys must also appear inanomaly_df
.Here is an example:
anomaly_df = pd.DataFrame({ "start_time": ["1/1/2018", "1/4/2018", "1/8/2018", "1/10/2018"], "end_time": ["1/2/2018", "1/6/2018", "1/9/2018", "1/10/2018"], "adjustment_delta": [np.nan, 3, -5, np.nan], # extra columns for filtering "metric": ["y", "y", "z", "z"], "dimension1": ["level_1", "level_1", "level_2", "level_2"], "dimension2": ["level_1", "level_2", "level_1", "level_1"], })
In the above example,
”start_time” is the start date of the anomaly, which is provided using the argument
start_time_col
.”end_time” is the end date of the anomaly, which is provided using the argument
end_time_col
.”adjustment_delta” is the column which includes the delta if it is known. The name of this column is provided using the argument
adjustment_delta_col
. Usenumpy.nan
if the adjustment size is not known, and the adjusted value will be set tonumpy.nan
.”metric”, “dimension1”, and “dimension2” are example columns for filtering. They contain the metric name and dimensions for which the anomaly is applicable.
filter_by_dict` is used to filter on these columns to get the relevant anomalies for the timeseries represented by ``df[value_col]
.
start_time_col (str, default
START_TIME_COL
) – The column name inanomaly_df
representing the start timestamp of the anomalous period, inclusive. The format can be anything that can be parsed by pandas DatetimeIndex.end_time_col (str, default
END_TIME_COL
) – The column name in anomaly_df representing the start timestamp of the anomalous period, inclusive. The format can be anything that can be parsed by pandas DatetimeIndex.adjustment_delta_col (str or None, default None) –
The column name in
anomaly_df
for the impact delta of the anomalies on the values of the series.If the value is available, it will be used to adjust the timeseries values in the given period by adding or subtracting this value to the raw series values in that period. Whether to add or subtract is specified by
adjustment_method
. If the value for a row is “” or np.nan, the adjusted value is set to np.nan.If
adjustment_delta_col
is None, all adjusted values are set to np.nan.filter_by_dict (dict [str, any] or None, default None) –
A dictionary whose keys are column names of
anomaly_df
, and values are the desired value for that column (e.g. a string or int). If the value is an iterable (list, tuple, set), then it enumerates all allowed values for that column.This dictionary is used to filter
anomaly_df
to the matching anomalies. This helps when theanomaly_df
includes the anomalies for various metrics and dimensions, so matching is needed to get the relevant anomalies fordf
.Columns in
anomaly_df
can contain information on metric name, metric dimension (e.g. mobile/desktop), issue severity, etc. for filtering.filter_by_value_col (str or None, default None) –
If provided,
{filter_by_value_col: value_col}
is added tofilter_by_dict
for filtering. This filtersanomaly_df
to rows whereanomaly_df[filter_by_value_col] == value_col
.If
value_col
is the metric name, this is a convenient way to find anomalies matching the metric name.adjustment_method (str (“add” or “subtract”), default “add”) –
How the adjustment in
anomaly_df
should be used to adjust the value indf
.If “add”, the value in
adjustment_delta_col
is added to the original value.If “subtract”, it is subtracted from the original value.
- Returns
Result – A dictionary with the following items (specified by key):
- ”adjusted_df”:
pandas.DataFrame
A dataframe identical to the input dataframe
df
, but withvalue_col
updated to the adjusted values.
- ”adjusted_df”:
- ”augmented_df”:
pandas.DataFrame
A dataframe identical to the input dataframe
df
, with two extra columnsANOMALY_COL: Anomaly labels for the time series.
1 and 0 indicates anomalous and non-anomalous points, respectively. -
f"adjusted_{value_col}"
: Adjusted values.value_col
retains the original values. This is useful to inspect which values have changed.
- ”augmented_df”:
- Return type
dict
- greykite.common.evaluation.r2_null_model_score(y_true, y_pred, y_pred_null=None, y_train=None, loss_func=<function mean_squared_error>)[source]
Calculates improvement in the loss function compared to the predictions of a null model. Can be used to evaluate model quality with respect to a simple baseline model.
The score is defined as:
R2_null_model_score = 1.0 - loss_func(y_true, y_pred) / loss_func(y_true, y_pred_null)
- Parameters
y_true (list [float] or
numpy.array
) – Observed response (usually on a test set).y_pred (list [float] or
numpy.array
) – Model predictions (usually on a test set).y_pred_null (list [float] or
numpy.array
or None) – A baseline prediction model to compare against. If None, derived fromy_train
ory_true
.y_train (list [float] or
numpy.array
or None) – Response values in the training data. Ify_pred_null
is None, theny_pred_null
is set to the mean ofy_train
. Ify_train
is also None, theny_pred_null
is set to the mean ofy_true
.loss_func (callable, default
sklearn.metrics.mean_squared_error
) – The error loss function with signature (true_values, predicted_values).
- Returns
r2_null_model – A value within (-infty, 1.0]. Higher scores are better. Can be interpreted as the improvement in the loss function compared to the predictions of the null model. For example, a score of 0.74 means the loss is 74% lower than for the null model.
- Return type
float
Notes
There is a connection between
R2_null_model_score
andR2
.R2_null_model_score
can be interpreted as the additional improvement in the coefficient of determination (i.e.R2
, seesklearn.metrics.r2_score
) with respect to a null model.Under the default settings of this function, where
loss_func
is mean squared error andy_pred_null
is the average ofy_true
, the scores are equivalent:# simplified definition of R2_score, where SSE is sum of squared error y_true_avg = np.repeat(np.average(y_true), y_true.shape[0]) R2_score := 1.0 - SSE(y_true, y_pred) / SSE(y_true, y_true_avg) R2_score := 1.0 - MSE(y_true, y_pred) / VAR(y_true) # equivalent definition r2_null_model_score(y_true, y_pred) == r2_score(y_true, y_pred)
r2_score
is 0 if simply predicting the mean (y_pred = y_true_avg).If
y_pred_null
is passed, and ifloss_func
is mean squared error andy_true
has nonzero variance, this function measures how much “r2_score of the predictions (y_pred
)” closes the gap between “r2_score of the null model (y_pred
)” and the “r2_score of the best possible model (y_true
)”, which is 1.0:R2_pred = r2_score(y_true, y_pred) # R2 of predictions R2_null = r2_score(y_pred_null, y_pred) # R2 of null model r2_null_model_score(y_true, y_pred, y_pred_null) == (R2_pred - R2_null) / (1.0 - R2_null)
When
y_pred_null=y_true_avg
,R2_null
is 0 and this reduces to the formula above.Summary (for
loss_func=mean_squared_error
):If
R2_null>0
(good null model), thenR2_null_model_score < R2_score
If
R2_null=0
(uninformative null model), thenR2_null_model_score = R2_score
If
R2_null<0
(poor null model), thenR2_null_model_score > R2_score
For other loss functions,
r2_null_model_score
has the same connection to pseudo R2.
- greykite.common.evaluation.mean_interval_score(observed, lower, upper, coverage)[source]
Calculates the mean interval score. If an observed value falls within the interval, the score is simply the width of the interval. If an observed value falls outside the interval, the score is the width of the interval plus an error term proportional to distance between the actual and its closest interval boundary. The proportionality constant is 2.0 / (1.0 -
coverage
). See Strictly Proper Scoring Rules, Prediction, and Estimation, Tilmann Gneiting and Adrian E. Raftery, 2007, Journal of the American Statistical Association, Volume 102, 2007 - Issue 477.- Parameters
observed (
pandas.Series
ornumpy.array
) – Numeric, observed values.lower (
pandas.Series
ornumpy.array
) – Numeric, lower bound.upper (
pandas.Series
ornumpy.array
) – Numeric, upper bound.coverage (float) – Intended coverage of the prediction bands (0.0 to 1.0)
- Returns
mean_interval_score – The mean interval score.
- Return type
float
- greykite.framework.pipeline.utils.get_score_func_with_aggregation(score_func, greater_is_better=None, agg_periods=None, agg_func=None, relative_error_tolerance=None)[source]
Returns a score function that pre-aggregates inputs according to
agg_func
, and filters out invalid true values before evaluation. This allows fitting the model at a granular level, yet evaluating at a coarser level.Also returns the proper direction and short name for the score function.
- Parameters
score_func (str or callable) – If callable, a function that maps two arrays to a number:
(true, predicted) -> score
.greater_is_better (bool, default False) – True if
score_func
is a score function, meaning higher is better, and False if it is a loss function, meaning lower is better. Must be provided ifscore_func
is a callable (custom function). Ignored ifscore_func
is a string, because the direction is known.agg_periods (int or None, default None) – Number of periods to aggregate before evaluation. Model is fit at original frequency, and forecast is aggregated according to
agg_periods
E.g. fit model on hourly data, and evaluate performance at daily level If None, does not apply aggregationagg_func (callable or None, default None) – Takes an array and returns a number, e.g. np.max, np.sum Used to aggregate data prior to evaluation (applied to actual and predicted) Ignored if
agg_periods
is Nonerelative_error_tolerance (float or None, default None) – Threshold to compute the
FRACTION_OUTSIDE_TOLERANCE
metric, defined as the fraction of forecasted values whose relative error is strictly greater thanrelative_error_tolerance
. For example, 0.05 allows for 5% relative error. Required ifscore_func
isFRACTION_OUTSIDE_TOLERANCE
.
- Returns
score_func (callable) – scorer with pre-aggregation function and filter,
greater_is_better (bool) – Whether
greater_is_better
for the scorer. Uses the providedgreater_is_better
if the providedscore_func
is a callable. Otherwise, looks up the direction.short_name (str) – Canonical short name for the
score_func
.
- greykite.framework.pipeline.utils.get_hyperparameter_searcher(hyperparameter_grid, model, cv=None, hyperparameter_budget=None, n_jobs=1, verbose=1, **kwargs) RandomizedSearchCV [source]
Returns RandomizedSearchCV object for hyperparameter tuning via cross validation
sklearn.model_selection.RandomizedSearchCV
runs a full grid search ifhyperparameter_budget
is sufficient to exhaust the fullhyperparameter_grid
, otherwise it samples uniformly at random from the space.- Parameters
hyperparameter_grid (dict or list [dict]) –
Dictionary with parameters names (string) as keys and distributions or lists of parameters to try. Distributions must provide a
rvs
method for sampling (such as those from scipy.stats.distributions). Lists of parameters are sampled uniformly.May also be a list of such dictionaries to avoid undesired combinations of parameters. Passed as
param_distributions
tosklearn.model_selection.RandomizedSearchCV
, see docs for more info.model (estimator object) – A object of that type is instantiated for each grid point. This is assumed to implement the scikit-learn estimator interface.
cv (int, cross-validation generator, iterable, or None, default None) – Determines the cross-validation splitting strategy. See
sklearn.model_selection.RandomizedSearchCV
.hyperparameter_budget (int or None, default None) –
max number of hyperparameter sets to try within the hyperparameter_grid search space If None, uses defaults:
exhaustive grid search if all values are constant
10 if any value is a distribution to sample from
n_jobs (int or None, default 1) – Number of jobs to run in parallel (the maximum number of concurrently running workers).
-1
uses all CPUs.-2
uses all CPUs but one.None
is treated as 1 unless in ajoblib.Parallel
backend context that specifies otherwise.verbose (int, default 1) –
Verbosity level during CV.
if > 0, prints number of fits
if > 1, prints fit parameters, total score + fit time
if > 2, prints train/test scores
kwargs (additional parameters) –
Keyword arguments to pass to
get_scoring_and_refit
. Accepts the following parameters:"score_func"
"score_func_greater_is_better"
"cv_report_metrics"
"agg_periods"
"agg_func"
"relative_error_tolerance"
- Returns
grid_search – Object that can run randomized search on hyper parameters.
- Return type
- greykite.framework.pipeline.utils.get_scoring_and_refit(score_func='MeanAbsolutePercentError', score_func_greater_is_better=False, cv_report_metrics=None, agg_periods=None, agg_func=None, relative_error_tolerance=None)[source]
Provides
scoring
andrefit
parameters forRandomizedSearchCV
.Together,
scoring
andrefit
specify how what metrics to evaluate and how to evaluate the predictions on the test set to identify the optimal model.Notes
Sets
greater_is_better=True
in scoring for all metrics to report them with their original sign, and properly accounts for this inrefit
to extract the best index.Pass both scoring and refit to
RandomizedSearchCV
- Parameters
score_func (str or callable, default
EvaluationMetricEnum.MeanAbsolutePercentError.name
) – Score function used to select optimal model in CV. If a callable, takes arraysy_true
,y_pred
and returns a float. If a string, must be either aEvaluationMetricEnum
member name orFRACTION_OUTSIDE_TOLERANCE
.score_func_greater_is_better (bool, default False) – True if
score_func
is a score function, meaning higher is better, and False if it is a loss function, meaning lower is better. Must be provided ifscore_func
is a callable (custom function). Ignored ifscore_func
is a string, because the direction is known.cv_report_metrics (CV_REPORT_METRICS_ALL, or list [str], or None, default None # noqa: E501) –
Additional metrics to compute during CV, besides the one specified by
score_func
.If the string constant greykite.common.constants.CV_REPORT_METRICS_ALL, computes all metrics in
EvaluationMetricEnum
. Also computesFRACTION_OUTSIDE_TOLERANCE
ifrelative_error_tolerance
is not None. The results are reported by the short name (.get_metric_name()
) forEvaluationMetricEnum
members andFRACTION_OUTSIDE_TOLERANCE_NAME
forFRACTION_OUTSIDE_TOLERANCE
.If a list of strings, each of the listed metrics is computed. Valid strings are
greykite.common.evaluation.EvaluationMetricEnum
member names andFRACTION_OUTSIDE_TOLERANCE
.For example:
["MeanSquaredError", "MeanAbsoluteError", "MeanAbsolutePercentError", "MedianAbsolutePercentError", "FractionOutsideTolerance2"]
If None, no additional metrics are computed.
agg_periods (int or None, default None) – Number of periods to aggregate before evaluation. Model is fit at original frequency, and forecast is aggregated according to
agg_periods
E.g. fit model on hourly data, and evaluate performance at daily level If None, does not apply aggregationagg_func (callable or None, default None) – Takes an array and returns a number, e.g. np.max, np.sum Used to aggregate data prior to evaluation (applied to actual and predicted) Ignored if
agg_periods
is Nonerelative_error_tolerance (float or None, default None) – Threshold to compute the
FRACTION_OUTSIDE_TOLERANCE
metric, defined as the fraction of forecasted values whose relative error is strictly greater thanrelative_error_tolerance
. For example, 0.05 allows for 5% relative error. If None, the metric is not computed.
- Returns
scoring (dict) – A dictionary of metrics to evaluate for each CV split. The key is the metric name, the value is an instance of evaluation_PredictScorerDF generated by
make_scorer_df
.The value has a score method that takes actual and predicted values and returns a single number.
There is one item in the dictionary for
score_func
and an additional item for each additional element incv_report_metrics
.The key for
score_func
if it is a callable is CUSTOM_SCORE_FUNC_NAME.The key for
EvaluationMetricEnum
member name is the short name from.get_metric_name()
.The key for
FRACTION_OUTSIDE_TOLERANCE
is FRACTION_OUTSIDE_TOLERANCE_NAME.
See
RandomizedSearchCV
.refit (callable) – Callable that takes
cv_results_
from grid search and returns the best index.See
RandomizedSearchCV
.
- greykite.framework.pipeline.utils.get_best_index(results, metric='score', greater_is_better=False)[source]
Suitable for use as the refit parameter to
RandomizedSearchCV
, after wrapping withfunctools.partial
.Callable that takes
cv_results_
from grid search and returns the best index.- Parameters
results (dict [str,
numpy.array
]) – Results from CV grid search. SeeRandomizedSearchCV
cv_results_
attribute for the format.metric (str, default “score”) – Which metric to use to select the best parameters. In single metric evaluation, the metric name should be “score”. For multi-metric evaluation, the
scoring
parameter toRandomizedSearchCV
is a dictionary, andmetric
must be a key ofscoring
.greater_is_better (bool, default False) – If True, selects the parameters with highest test values for
metric
. Otherwise, selects those with the lowest test values formetric
.
- Returns
best_index – Best index to use for refitting the model.
- Return type
int
Examples
>>> from functools import partial >>> from sklearn.model_selection import RandomizedSearchCV >>> refit = partial(get_best_index, metric="score", greater_is_better=False) >>> # RandomizedSearchCV(..., refit=refit)
- greykite.framework.pipeline.utils.get_forecast(df, trained_model: Pipeline, train_end_date=None, test_start_date=None, forecast_horizon=None, xlabel='ts', ylabel='y', relative_error_tolerance=None) UnivariateForecast [source]
Runs model predictions on
df
and creates aUnivariateForecast
object.- Parameters
df (
pandas.DataFrame
) – Has columns cst.TIME_COL, cst.VALUE_COL, to forecast.trained_model (
sklearn.pipeline
) – A fitted Pipeline withestimator
step and predict function.train_end_date (
datetime.datetime
, default None) – Train end date. Passed toUnivariateForecast
.test_start_date (
datetime.datetime
, default None) – Test start date. Passed toUnivariateForecast
.forecast_horizon (int or None, default None) – Number of periods forecasted into the future. Must be > 0. Passed to
UnivariateForecast
.xlabel (str) – Time column to use in representing forecast (e.g. x-axis in plots).
ylabel (str) – Time column to use in representing forecast (e.g. y-axis in plots).
relative_error_tolerance (float or None, default None) – Threshold to compute the
Outside Tolerance
metric, defined as the fraction of forecasted values whose relative error is strictly greater thanrelative_error_tolerance
. For example, 0.05 allows for 5% relative error. If None, the metric is not computed.
- Returns
univariate_forecast – Forecasts represented as a
UnivariateForecast
object.- Return type
- greykite.framework.templates.pickle_utils.dump_obj(obj, dir_name, obj_name='obj', dump_design_info=True, overwrite_exist_dir=False, top_level=True)[source]
Uses DFS to recursively dump an object to pickle files. Originally intended for dumping the
ForecastResult
instance, but could potentially used for other objects.For each object, if it’s picklable, a file with {object_name}.pkl will be generated, otherwise, depending on its type, a {object_name}.type file will be generated storing it’s type, and a folder with {object_name} will be generated to store each of its elements/attributes.
For example, if the folder to store results is forecast_result, the items in the folders could be:
timeseries.pkl: a picklable item.
model.type: model is not picklable, this file includes the class (Pipeline)
model: this folder includes the elements in model.
forecast.type: forecast is not picklable, this file includes the class (UnivariateForecast)
forecast: this folder includes the elements in forecast.
backtest.type: backtest is not picklable, this file includes the class (UnivariateForecast)
backtest: this folder includes the elements in backtest.
grid_search.type: grid_search is not picklable, this file includes the class (GridSearchCV)
grid_search: this folder includes the elements in grid_search.
The items in each subfolder follows the same rule.
The current supported recursion types are:
list/tuple: type name is “list” or “tuple”, each element is attempted to be pickled independently if the entire list/tuple is not picklable. The order is preserved.
OrderedDict: type name is “ordered_dict”, each key and value are attempted to be pickled independently if the entire dict is not picklable. The order is preserved.
dict: type name is “dict”, each key and value are attempted to be pickled independently if the entire dict is not picklable. The order is not preserved.
class instance: type name is the class object, used to create new instance. Each attribute is attempted to be pickled independently if the entire instance is not picklable.
- Parameters
obj (object) – The object to be pickled.
dir_name (str) – The directory to store the pickled results.
obj_name (str, default “obj”) – The name for the pickled items. Applies to the top level object only when recursion is used.
dump_design_info (bool, default True) –
Whether to dump the design info in ForecastResult. The design info is specifically for Silverkite and can be accessed from
ForecastResult.model[-1].model_dict[“x_design_info”]
ForecastResult.forecast.estimator.model_dict[“x_design_info”]
ForecastResult.backtest.estimator.model_dict[“x_design_info”]
The design info is a class from
patsy
and contains a significant amount of instances that can not be pickled directly. Recursively pickling them takes longer to run. If speed is important and you don’t need these information, you can turn it off.overwrite_exist_dir (bool, default False) – If True and the directory in
dir_name
already exists, the existing directory will be removed. If False and the directory indir_name
already exists, an exception will be raised.top_level (bool, default True) – Whether the implementation is an initial call (applies to the root object you want to pickle, not a recursive call). When you use this function to dump an object, this parameter should always be True. Only top level checks if the dir exists, because subsequent recursive calls may write files to the same directory, and the check for dir exists will not be implemented. Setting this parameter to False may cause problems.
- Return type
The function writes files to local directory and does not return anything.
- greykite.framework.templates.pickle_utils.load_obj(dir_name, obj=None, load_design_info=True)[source]
Loads the pickled files which are pickled by
dump_obj
. Originally intended for loading theForecastResult
instance, but could potentially used for other objects.- Parameters
dir_name (str) – The directory that stores the pickled files. Must be the top level dir when having nested pickling results.
obj (object, default None) – The object type for the next-level files. Can be one of “list”, “tuple”, “dict”, “ordered_dict” or a class.
load_design_info (bool, default True) –
Whether to load the design info in ForecastResult. The design info is specifically for Silverkite and can be accessed from
ForecastResult.model[-1].model_dict[“x_design_info”]
ForecastResult.forecast.estimator.model_dict[“x_design_info”]
ForecastResult.backtest.estimator.model_dict[“x_design_info”]
The design info is a class from
patsy
and contains a significant amount of instances that can not be pickled directly. Recursively loading them takes longer to run. If speed is important and you don’t need these information, you can turn it off.
- Returns
result – The loaded object from the pickled files.
- Return type
object
- class greykite.common.data_loader.DataLoader[source]
Returns datasets included in the library in
pandas.DataFrame
format.- available_datasets
The names of the available datasets.
- Type
list [str]
- static get_data_home(data_dir=None, data_sub_dir=None)[source]
Returns the folder path
data_dir/data_sub_dir
. Ifdata_dir
is None returns the internal data directory. By default the Greykite data dir is set to a folder named ‘data’ in the project source code. Alternatively, it can be set programmatically by giving an explicit folder path.- Parameters
data_dir (str or None, default None) – The path to the input data directory.
data_sub_dir (str or None, default None) – The name of the input data sub directory. Updates path by appending to the
data_dir
at the end. If None,data_dir
path is unchanged.
- Returns
data_home – Path to the data folder.
- Return type
str
- static get_data_names(data_path)[source]
Returns the names of the
.csv
and.csv.xz
files indata_path
.- Parameters
data_path (str) – Path to the data folder.
- Returns
file_names – The names of the
.csv
and.csv.xz
files indata_path
.- Return type
list [str]
- static get_aggregated_data(df, agg_freq=None, agg_func=None)[source]
Returns aggregated data.
- Parameters
df (
pandas.DataFrame
.) – The input data must have TIME_COL (“ts”) column and the columns in the keys ofagg_func
.agg_freq (str or None, default None) – If None, data will not be aggregated and will include all columns. Possible values: “hourly”, “daily”, “weekly”, or “monthly”.
agg_func (Dict [str, str], default None) – A dictionary of the columns to be aggregated and the corresponding aggregating functions. Possible aggregating functions include “sum”, “mean”, “median”, “max”, “min”, etc. An example input can be {“col1”:”mean”, “col2”:”sum”} If None, data will not be aggregated and will include all columns.
- Returns
df – The aggregated dataframe.
- Return type
- get_data_inventory()[source]
Returns the names of the available internal datasets.
- Returns
file_names – The names of the available internal datasets.
- Return type
list [str]
- get_df(data_path, data_name)[source]
Returns a
pandas.DataFrame
containing the dataset fromdata_path/data_name
. The input data must be in.csv
or.csv.xz
format. Raises a ValueError if the the specified input file is not found.- Parameters
data_path (str) – Path to the data folder.
data_name (str) – Name of the csv file to be loaded from. For example ‘peyton_manning’.
- Returns
df – Input dataset.
- Return type
- load_peyton_manning()[source]
Loads the Daily Peyton Manning dataset.
This dataset contains log daily page views for the Wikipedia page for Peyton Manning. One of the primary datasets used for demonstrations by Facebook
Prophet
algorithm. Source: https://github.com/facebook/prophet/blob/master/examples/example_wp_log_peyton_manning.csvBelow is the dataset attribute information:
ts : date of the page view y : log of the number of page views
- Returns
df –
Has the following columns:
”ts” : date of the page view. “y” : log of the number of page views.
- Return type
pandas.DataFrame
object with Peyton Manning data.
- load_parking(system_code_number=None)[source]
Loads the Hourly Parking dataset. This dataset contains occupancy rates (8:00 to 16:30) from 2016/10/04 to 2016/12/19 from car parks in Birmingham that are operated by NCP from Birmingham City Council. Source: https://archive.ics.uci.edu/ml/datasets/Parking+Birmingham UK Open Government Licence (OGL)
Below is the dataset attribute information:
SystemCodeNumber: car park ID Capacity: car park capacity Occupancy: car park occupancy rate LastUpdated: date and time of the measure
- Parameters
system_code_number (str or None, default None) – If None, occupancy rate is averaged across all the
SystemCodeNumber
. Else only the occupancy rate of the givensystem_code_number
is returned.- Returns
df –
Has the following columns:
”LastUpdated” : time, rounded to the nearest half hour. “Capacity” : car park capacity “Occupancy” : car park occupancy rate “OccupancyRatio” :
Occupancy
divided byCapacity
.- Return type
pandas.DataFrame
object with Parking data.
- load_bikesharing(agg_freq=None, agg_func=None)[source]
Loads the Hourly Bike Sharing Count dataset with possible aggregations.
This dataset contains aggregated hourly count of the number of rented bikes. The data also includes weather data: Maximum Daily temperature (tmax); Minimum Daily Temperature (tmin); Precipitation (pn) The raw bike-sharing data is provided by Capital Bikeshare. Source: https://www.capitalbikeshare.com/system-data The raw weather data (Baltimore-Washington INTL Airport) https://www.ncdc.noaa.gov/data-access/land-based-station-data
Below is the dataset attribute information:
ts : hour and date count : number of shared bikes tmin : minimum daily temperature tmax : maximum daily temperature pn : precipitation
- Parameters
get_aggregated_data. (Refer to the input of function) –
- Returns
df –
If no
freq
was specified, the returned data has the following columns:”date” : day of year “ts” : hourly timestamp “count” : number of rented bikes across Washington DC. “tmin” : minimum daily temperature “tmax” : maximum daily temperature “pn” : precipitation
Otherwise, only
agg_col
column is returned.- Return type
pandas.DataFrame
with bikesharing data.
- load_solarpower(agg_freq=None, agg_func=None)[source]
Loads the Hourly Solar Power dataset.
This dataset contains the solar power production of an Australian wind farm from August 2019 to July 2020, with original frequency 4-second. We aggregated it to an hourly series and removed any incomplete hours. Source: https://zenodo.org/record/4656027#.YrpHbuzMLGp
Below is the dataset attribute information:
ts : hourly timestamp y : solar power production in MW (megawatt)
- Parameters
agg_freq (str or None, default None) – Possible values: “daily”, “weekly”, or “monthly”. If None, data will not be aggregated and will include all columns.
agg_func (Dict [str, str], default None) – A dictionary of the columns to be aggregated and the corresponding aggregating functions. Possible aggregating functions include “sum”, “mean”, “median”, “max”, “min”, etc. An example input can be {“col1”:”mean”, “col2”:”sum”} If None, data will not be aggregated and will include all columns.
- Returns
df –
Has the following columns:
”ts” : hourly timestamp “y” : solar power production in MW (megawatt)
- Return type
pandas.DataFrame
object with Solar Power data.
- load_windpower(agg_freq=None, agg_func=None)[source]
Loads the Hourly Wind Power dataset.
This dataset contains the wind power production of an Australian wind farm from August 2019 to July 2020, with original frequency 4-second. We aggregated it to an hourly series and removed any incomplete hours. Source: https://zenodo.org/record/4656032#.YrpJTezMLGp
Below is the dataset attribute information:
ts : hourly timestamp y : wind power production in MW (megawatt)
- Parameters
agg_freq (str or None, default None) – Possible values: “daily”, “weekly”, or “monthly”. If None, data will not be aggregated and will include all columns.
agg_func (Dict [str, str], default None) – A dictionary of the columns to be aggregated and the corresponding aggregating functions. Possible aggregating functions include “sum”, “mean”, “median”, “max”, “min”, etc. An example input can be {“col1”:”mean”, “col2”:”sum”} If None, data will not be aggregated and will include all columns.
- Returns
df –
Has the following columns:
”ts” : hourly timestamp “y” : wind power production in MW (megawatt)
- Return type
pandas.DataFrame
object with Wind Power data.
- load_electricity(agg_freq=None, agg_func=None)[source]
Loads the Hourly Electricity dataset.
This dataset contains the hourly consumption (in Kilowatt) of 321 clients from 2012 to 2014 published by Monash. We aggregated them by taking the average across the 321 clients. Source: https://zenodo.org/record/4656140#.YrpKtezMJqs
Below is the dataset attribute information:
ts : hourly timestamp y : average electricity consumption in Kilowatt
- Parameters
agg_freq (str or None, default None) – Possible values: “daily”, “weekly”, or “monthly”. If None, data will not be aggregated and will include all columns.
agg_func (Dict [str, str], default None) – A dictionary of the columns to be aggregated and the corresponding aggregating functions. Possible aggregating functions include “sum”, “mean”, “median”, “max”, “min”, etc. An example input can be {“col1”:”mean”, “col2”:”sum”} If None, data will not be aggregated and will include all columns.
- Returns
df –
Has the following columns:
”ts” : hourly timestamp “y” : average electricity consumption in Kilowatt
- Return type
pandas.DataFrame
object with Electricity data.
- load_sf_traffic(agg_freq=None, agg_func=None)[source]
Loads the Hourly San Francisco Bay Area Traffic dataset.
This dataset contains the road occupancy rates (between 0 and 1) measured by different sensors on San Francisco Bay area freeways from 2015 to 2016. Source: https://zenodo.org/record/4656132#.YrpMxuzMLGp
Below is the dataset attribute information:
ts : hourly timestamp y : average occupancy rate
- Parameters
agg_freq (str or None, default None) – Possible values: “daily”, “weekly”, or “monthly”. If None, data will not be aggregated and will include all columns.
agg_func (Dict [str, str], default None) – A dictionary of the columns to be aggregated and the corresponding aggregating functions. Possible aggregating functions include “sum”, “mean”, “median”, “max”, “min”, etc. An example input can be {“col1”:”mean”, “col2”:”sum”} If None, data will not be aggregated and will include all columns.
- Returns
df –
Has the following columns:
”ts” : hourly timestamp “y” : average occupancy rate
- Return type
pandas.DataFrame
object with San Francisco Bay Area Traffic data.
- load_bitcoin_transactions(agg_freq=None, agg_func=None)[source]
Loads the Daily Bitcoin Transactions dataset.
This dataset contains the number of Bitcoin transactions from 2009 to 2021. The dataset was curated (with missing values filled) by Monash. Source: https://zenodo.org/record/5122101#.YrpNFuzMLGp
Below is the dataset attribute information:
ts : date y : number of transactions
- Parameters
agg_freq (str or None, default None) – Possible values: “daily”, “weekly”, or “monthly”. If None, data will not be aggregated and will include all columns.
agg_func (Dict [str, str], default None) – A dictionary of the columns to be aggregated and the corresponding aggregating functions. Possible aggregating functions include “sum”, “mean”, “median”, “max”, “min”, etc. An example input can be {“col1”:”mean”, “col2”:”sum”} If None, data will not be aggregated and will include all columns.
- Returns
df –
Has the following columns:
”ts” : date “y” : number of transactions
- Return type
pandas.DataFrame
object with Bitcoin Transactions data.
- load_sunspot()[source]
Loads the Sunspot dataset.
This dataset contains the number of observed sunspots from 1818 to 2020 published by Monash. The original dataset was a daily series, and we aggregate it to a monthly time series more than 200 years long. Source: https://zenodo.org/record/4654722#.YrpQ4uzMLGp
Below is the dataset attribute information:
ts : month start date y : average number of sunspots
- Returns
df –
Has the following columns:
”ts” : date “y” : average number of sunspots
- Return type
pandas.DataFrame
object with Sunspot data.
- load_fred_housing()[source]
Loads the FRED House Supply dataset.
This dataset contains the monthly house supply in the United States from 1963 to 2021 obtained from FRED. Source: https://fred.stlouisfed.org/series/MSACSR
Below is the dataset attribute information:
ts : month start date y : monthly supply of new houses
- Returns
df –
Has the following columns:
”ts” : date “y” : monthly supply of new houses
- Return type
pandas.DataFrame
object with FRED House Supply data.
- load_beijing_pm(agg_freq=None, agg_func=None)[source]
Loads the Beijing Particulate Matter (PM2.5) dataset. https://archive.ics.uci.edu/ml/datasets/Beijing+PM2.5+Data
This hourly data set contains the PM2.5 data of US Embassy in Beijing. Meanwhile, meteorological data from Beijing Capital International Airport are also included.
The dataset’s time period is between Jan 1st, 2010 to Dec 31st, 2014. Missing data are denoted as NA.
Below is the dataset attribute information:
No : row number year : year of data in this row month : month of data in this row day : day of data in this row hour : hour of data in this row pm2.5: PM2.5 concentration (ug/m^3) DEWP : dew point (celsius) TEMP : temperature (celsius) PRES : pressure (hPa) cbwd : combined wind direction Iws : cumulated wind speed (m/s) Is : cumulated hours of snow Ir : cumulated hours of rain
- Parameters
get_aggregated_data. (Refer to the input of function) –
- Returns
df –
Has the following columns:
”ts” : hourly timestamp “year” : year of data in this row “month” : month of data in this row “day” : day of data in this row “hour” : hour of data in this row “pm” : PM2.5 concentration (ug/m^3) “dewp” : dew point (celsius) “temp” : temperature (celsius) “pres” : pressure (hPa) “cbwd” : combined wind direction “iws” : cumulated wind speed (m/s) “is” : cumulated hours of snow “ir” : cumulated hours of rain
- Return type
pandas.DataFrame
with Beijing PM2.5 data.
- load_hierarchical_actuals()[source]
Loads hierarchical actuals.
This dataset contains synthetic data that satisfy hierarchical constraints. Consider the 3-level tree with the parent-child relationships below.
00 # level 0
/ 10 11 # level 1
/ | / # noqa: W605
20 21 22 23 24 # level 2
There is one root node (00) with 2 children. The first child (10) has 3 children. The second child (11) has 2 children.
Let x_{ij} be the value of the j`th node in level `i of the tree ({ij} is shown in diagram above). We require the value of a parent to equal the sum of the values of its children. There are 3 constraints in this hierarchy, satisfied at all time points:
x_00 = x_10 + x_11
x_10 = x_20 + x_21 + x_22
x_11 = x_23 + x_24
Below is the dataset attribute information:
“ts” : date of the (synthetic) observation “00” : value for node 00 “10” : value for node 10 “11” : value for node 11 “20” : value for node 20 “21” : value for node 21 “22” : value for node 22 “23” : value for node 23 “24” : value for node 24
- Returns
df –
Has the following columns:
”ts” : date of the (synthetic) observation “00” : value for node 00 “10” : value for node 10 “11” : value for node 11 “20” : value for node 20 “21” : value for node 21 “22” : value for node 22 “23” : value for node 23 “24” : value for node 24
The values satisfy the hierarchical constraints above.
- Return type
pandas.DataFrame
object with synthetic hierarchical data.
- load_hierarchical_forecasts()[source]
Loads hierarchical forecasts.
This dataset contains forecasts for the actuals given by
load_hierarchical_actuals
. The attributes are the same.- Returns
df –
Has the following columns:
”ts” : date of the forecasted value “00” : value for node 00 “10” : value for node 10 “11” : value for node 11 “20” : value for node 20 “21” : value for node 21 “22” : value for node 22 “23” : value for node 23 “24” : value for node 24
The forecasts do not satisfy the hierarchical constraints. The index and columns are identical to
load_hierarchical_actuals
.- Return type
pandas.DataFrame
object with forecasts for synthetic hierarchical data.
- class greykite.framework.benchmark.data_loader_ts.DataLoaderTS[source]
Returns datasets included in the library in
pandas.DataFrame
orUnivariateTimeSeries
format.Extends
DataLoader
- load_peyton_manning_ts()[source]
Loads the Daily Peyton Manning dataset.
This dataset contains log daily page views for the Wikipedia page for Peyton Manning. One of the primary datasets used for demonstrations by Facebook
Prophet
algorithm. Source: https://github.com/facebook/prophet/blob/master/examples/example_wp_log_peyton_manning.csvBelow is the dataset attribute information:
ts : date of the page view y : log of the number of page views
- Returns
ts –
Peyton Manning page views data. Time and value column:
time_col
”ts”Date of the page view.
value_col
”y”Log of the number of page views.
- Return type
- load_parking_ts(system_code_number=None)[source]
Loads the Hourly Parking dataset.
This dataset contains occupancy rates (8:00 to 16:30) from 2016/10/04 to 2016/12/19 from car parks in Birmingham that are operated by NCP from Birmingham City Council. Source: https://archive.ics.uci.edu/ml/datasets/Parking+Birmingham UK Open Government Licence (OGL)
Below is the dataset attribute information:
SystemCodeNumber: car park ID Capacity: car park capacity Occupancy: car park occupancy rate LastUpdated: date and time of the measure
- Parameters
system_code_number (str or None, default None) – If None, occupancy rate is averaged across all the
SystemCodeNumber
. Else only the occupancy rate of the givensystem_code_number
is returned.- Returns
ts –
Parking data. Time and value column:
time_col
”LastUpdated”Date and Time of the Occupancy Rate, rounded to the nearest half hour.
value_col
”OccupancyRatio”Occupancy
divided byCapacity
.
- Return type
- load_bikesharing_ts()[source]
Loads the Hourly Bike Sharing Count dataset.
This dataset contains aggregated hourly count of the number of rented bikes. The data also includes weather data: Maximum Daily temperature (tmax); Minimum Daily Temperature (tmin); Precipitation (pn) The raw bike-sharing data is provided by Capital Bikeshare. Source: https://www.capitalbikeshare.com/system-data The raw weather data (Baltimore-Washington INTL Airport) https://www.ncdc.noaa.gov/data-access/land-based-station-data
Below is the dataset attribute information:
ts : hour and date count : number of shared bikes tmin : minimum daily temperature tmax : maximum daily temperature pn : precipitation
- Returns
ts –
Bike Sharing Count data. Time and value column:
time_col
”ts”Hour and Date.
value_col
”y”Number of rented bikes across Washington DC.
Additional regressors:
”tmin” : minimum daily temperature “tmax” : maximum daily temperature “pn” : precipitation
- Return type
- load_beijing_pm_ts()[source]
Loads the Beijing Particulate Matter (PM2.5) dataset. https://archive.ics.uci.edu/ml/datasets/Beijing+PM2.5+Data
This hourly data set contains the PM2.5 data of US Embassy in Beijing. Meanwhile, meteorological data from Beijing Capital International Airport are also included.
The dataset’s time period is between Jan 1st, 2010 to Dec 31st, 2014. Missing data are denoted as NA.
Below is the dataset attribute information:
No : row number year : year of data in this row month : month of data in this row day : day of data in this row hour : hour of data in this row pm2.5: PM2.5 concentration (ug/m^3) DEWP : dew point (celsius) TEMP : temperature (celsius) PRES : pressure (hPa) cbwd : combined wind direction Iws : cumulated wind speed (m/s) Is : cumulated hours of snow Ir : cumulated hours of rain
- Returns
ts –
Beijing PM2.5 data. Time and value column:
time_col
TIME_COLhourly timestamp
value_col
”pm”PM2.5 concentration (ug/m^3)
Additional regressors:
”dewp” : dew point (celsius) “temp” : temperature (celsius) “pres” : pressure (hPa) “cbwd” : combined wind direction “iws” : cumulated wind speed (m/s) “is” : cumulated hours of snow “ir” : cumulated hours of rain
- Return type
- load_data_ts(data_name, **kwargs)[source]
Loads dataset by name from the internal data library.
- Parameters
data_name (str) – Dataset to load from the internal data library.
- Returns
ts – Has the requested
data_name
.- Return type
- static get_aggregated_data(df, agg_freq=None, agg_func=None)
Returns aggregated data.
- Parameters
df (
pandas.DataFrame
.) – The input data must have TIME_COL (“ts”) column and the columns in the keys ofagg_func
.agg_freq (str or None, default None) – If None, data will not be aggregated and will include all columns. Possible values: “hourly”, “daily”, “weekly”, or “monthly”.
agg_func (Dict [str, str], default None) – A dictionary of the columns to be aggregated and the corresponding aggregating functions. Possible aggregating functions include “sum”, “mean”, “median”, “max”, “min”, etc. An example input can be {“col1”:”mean”, “col2”:”sum”} If None, data will not be aggregated and will include all columns.
- Returns
df – The aggregated dataframe.
- Return type
- static get_data_home(data_dir=None, data_sub_dir=None)
Returns the folder path
data_dir/data_sub_dir
. Ifdata_dir
is None returns the internal data directory. By default the Greykite data dir is set to a folder named ‘data’ in the project source code. Alternatively, it can be set programmatically by giving an explicit folder path.- Parameters
data_dir (str or None, default None) – The path to the input data directory.
data_sub_dir (str or None, default None) – The name of the input data sub directory. Updates path by appending to the
data_dir
at the end. If None,data_dir
path is unchanged.
- Returns
data_home – Path to the data folder.
- Return type
str
- get_data_inventory()
Returns the names of the available internal datasets.
- Returns
file_names – The names of the available internal datasets.
- Return type
list [str]
- static get_data_names(data_path)
Returns the names of the
.csv
and.csv.xz
files indata_path
.- Parameters
data_path (str) – Path to the data folder.
- Returns
file_names – The names of the
.csv
and.csv.xz
files indata_path
.- Return type
list [str]
- get_df(data_path, data_name)
Returns a
pandas.DataFrame
containing the dataset fromdata_path/data_name
. The input data must be in.csv
or.csv.xz
format. Raises a ValueError if the the specified input file is not found.- Parameters
data_path (str) – Path to the data folder.
data_name (str) – Name of the csv file to be loaded from. For example ‘peyton_manning’.
- Returns
df – Input dataset.
- Return type
- load_beijing_pm(agg_freq=None, agg_func=None)
Loads the Beijing Particulate Matter (PM2.5) dataset. https://archive.ics.uci.edu/ml/datasets/Beijing+PM2.5+Data
This hourly data set contains the PM2.5 data of US Embassy in Beijing. Meanwhile, meteorological data from Beijing Capital International Airport are also included.
The dataset’s time period is between Jan 1st, 2010 to Dec 31st, 2014. Missing data are denoted as NA.
Below is the dataset attribute information:
No : row number year : year of data in this row month : month of data in this row day : day of data in this row hour : hour of data in this row pm2.5: PM2.5 concentration (ug/m^3) DEWP : dew point (celsius) TEMP : temperature (celsius) PRES : pressure (hPa) cbwd : combined wind direction Iws : cumulated wind speed (m/s) Is : cumulated hours of snow Ir : cumulated hours of rain
- Parameters
get_aggregated_data. (Refer to the input of function) –
- Returns
df –
Has the following columns:
”ts” : hourly timestamp “year” : year of data in this row “month” : month of data in this row “day” : day of data in this row “hour” : hour of data in this row “pm” : PM2.5 concentration (ug/m^3) “dewp” : dew point (celsius) “temp” : temperature (celsius) “pres” : pressure (hPa) “cbwd” : combined wind direction “iws” : cumulated wind speed (m/s) “is” : cumulated hours of snow “ir” : cumulated hours of rain
- Return type
pandas.DataFrame
with Beijing PM2.5 data.
- load_bikesharing(agg_freq=None, agg_func=None)
Loads the Hourly Bike Sharing Count dataset with possible aggregations.
This dataset contains aggregated hourly count of the number of rented bikes. The data also includes weather data: Maximum Daily temperature (tmax); Minimum Daily Temperature (tmin); Precipitation (pn) The raw bike-sharing data is provided by Capital Bikeshare. Source: https://www.capitalbikeshare.com/system-data The raw weather data (Baltimore-Washington INTL Airport) https://www.ncdc.noaa.gov/data-access/land-based-station-data
Below is the dataset attribute information:
ts : hour and date count : number of shared bikes tmin : minimum daily temperature tmax : maximum daily temperature pn : precipitation
- Parameters
get_aggregated_data. (Refer to the input of function) –
- Returns
df –
If no
freq
was specified, the returned data has the following columns:”date” : day of year “ts” : hourly timestamp “count” : number of rented bikes across Washington DC. “tmin” : minimum daily temperature “tmax” : maximum daily temperature “pn” : precipitation
Otherwise, only
agg_col
column is returned.- Return type
pandas.DataFrame
with bikesharing data.
- load_bitcoin_transactions(agg_freq=None, agg_func=None)
Loads the Daily Bitcoin Transactions dataset.
This dataset contains the number of Bitcoin transactions from 2009 to 2021. The dataset was curated (with missing values filled) by Monash. Source: https://zenodo.org/record/5122101#.YrpNFuzMLGp
Below is the dataset attribute information:
ts : date y : number of transactions
- Parameters
agg_freq (str or None, default None) – Possible values: “daily”, “weekly”, or “monthly”. If None, data will not be aggregated and will include all columns.
agg_func (Dict [str, str], default None) – A dictionary of the columns to be aggregated and the corresponding aggregating functions. Possible aggregating functions include “sum”, “mean”, “median”, “max”, “min”, etc. An example input can be {“col1”:”mean”, “col2”:”sum”} If None, data will not be aggregated and will include all columns.
- Returns
df –
Has the following columns:
”ts” : date “y” : number of transactions
- Return type
pandas.DataFrame
object with Bitcoin Transactions data.
- load_data(data_name, **kwargs)
Loads dataset by name from the internal data library.
- Parameters
data_name (str) – Dataset to load from the internal data library.
- Returns
df
- Return type
UnivariateTimeSeries
object withdata_name
.
- load_electricity(agg_freq=None, agg_func=None)
Loads the Hourly Electricity dataset.
This dataset contains the hourly consumption (in Kilowatt) of 321 clients from 2012 to 2014 published by Monash. We aggregated them by taking the average across the 321 clients. Source: https://zenodo.org/record/4656140#.YrpKtezMJqs
Below is the dataset attribute information:
ts : hourly timestamp y : average electricity consumption in Kilowatt
- Parameters
agg_freq (str or None, default None) – Possible values: “daily”, “weekly”, or “monthly”. If None, data will not be aggregated and will include all columns.
agg_func (Dict [str, str], default None) – A dictionary of the columns to be aggregated and the corresponding aggregating functions. Possible aggregating functions include “sum”, “mean”, “median”, “max”, “min”, etc. An example input can be {“col1”:”mean”, “col2”:”sum”} If None, data will not be aggregated and will include all columns.
- Returns
df –
Has the following columns:
”ts” : hourly timestamp “y” : average electricity consumption in Kilowatt
- Return type
pandas.DataFrame
object with Electricity data.
- load_fred_housing()
Loads the FRED House Supply dataset.
This dataset contains the monthly house supply in the United States from 1963 to 2021 obtained from FRED. Source: https://fred.stlouisfed.org/series/MSACSR
Below is the dataset attribute information:
ts : month start date y : monthly supply of new houses
- Returns
df –
Has the following columns:
”ts” : date “y” : monthly supply of new houses
- Return type
pandas.DataFrame
object with FRED House Supply data.
- load_hierarchical_actuals()
Loads hierarchical actuals.
This dataset contains synthetic data that satisfy hierarchical constraints. Consider the 3-level tree with the parent-child relationships below.
00 # level 0
/ 10 11 # level 1
/ | / # noqa: W605
20 21 22 23 24 # level 2
There is one root node (00) with 2 children. The first child (10) has 3 children. The second child (11) has 2 children.
Let x_{ij} be the value of the j`th node in level `i of the tree ({ij} is shown in diagram above). We require the value of a parent to equal the sum of the values of its children. There are 3 constraints in this hierarchy, satisfied at all time points:
x_00 = x_10 + x_11
x_10 = x_20 + x_21 + x_22
x_11 = x_23 + x_24
Below is the dataset attribute information:
“ts” : date of the (synthetic) observation “00” : value for node 00 “10” : value for node 10 “11” : value for node 11 “20” : value for node 20 “21” : value for node 21 “22” : value for node 22 “23” : value for node 23 “24” : value for node 24
- Returns
df –
Has the following columns:
”ts” : date of the (synthetic) observation “00” : value for node 00 “10” : value for node 10 “11” : value for node 11 “20” : value for node 20 “21” : value for node 21 “22” : value for node 22 “23” : value for node 23 “24” : value for node 24
The values satisfy the hierarchical constraints above.
- Return type
pandas.DataFrame
object with synthetic hierarchical data.
- load_hierarchical_forecasts()
Loads hierarchical forecasts.
This dataset contains forecasts for the actuals given by
load_hierarchical_actuals
. The attributes are the same.- Returns
df –
Has the following columns:
”ts” : date of the forecasted value “00” : value for node 00 “10” : value for node 10 “11” : value for node 11 “20” : value for node 20 “21” : value for node 21 “22” : value for node 22 “23” : value for node 23 “24” : value for node 24
The forecasts do not satisfy the hierarchical constraints. The index and columns are identical to
load_hierarchical_actuals
.- Return type
pandas.DataFrame
object with forecasts for synthetic hierarchical data.
- load_parking(system_code_number=None)
Loads the Hourly Parking dataset. This dataset contains occupancy rates (8:00 to 16:30) from 2016/10/04 to 2016/12/19 from car parks in Birmingham that are operated by NCP from Birmingham City Council. Source: https://archive.ics.uci.edu/ml/datasets/Parking+Birmingham UK Open Government Licence (OGL)
Below is the dataset attribute information:
SystemCodeNumber: car park ID Capacity: car park capacity Occupancy: car park occupancy rate LastUpdated: date and time of the measure
- Parameters
system_code_number (str or None, default None) – If None, occupancy rate is averaged across all the
SystemCodeNumber
. Else only the occupancy rate of the givensystem_code_number
is returned.- Returns
df –
Has the following columns:
”LastUpdated” : time, rounded to the nearest half hour. “Capacity” : car park capacity “Occupancy” : car park occupancy rate “OccupancyRatio” :
Occupancy
divided byCapacity
.- Return type
pandas.DataFrame
object with Parking data.
- load_peyton_manning()
Loads the Daily Peyton Manning dataset.
This dataset contains log daily page views for the Wikipedia page for Peyton Manning. One of the primary datasets used for demonstrations by Facebook
Prophet
algorithm. Source: https://github.com/facebook/prophet/blob/master/examples/example_wp_log_peyton_manning.csvBelow is the dataset attribute information:
ts : date of the page view y : log of the number of page views
- Returns
df –
Has the following columns:
”ts” : date of the page view. “y” : log of the number of page views.
- Return type
pandas.DataFrame
object with Peyton Manning data.
- load_sf_traffic(agg_freq=None, agg_func=None)
Loads the Hourly San Francisco Bay Area Traffic dataset.
This dataset contains the road occupancy rates (between 0 and 1) measured by different sensors on San Francisco Bay area freeways from 2015 to 2016. Source: https://zenodo.org/record/4656132#.YrpMxuzMLGp
Below is the dataset attribute information:
ts : hourly timestamp y : average occupancy rate
- Parameters
agg_freq (str or None, default None) – Possible values: “daily”, “weekly”, or “monthly”. If None, data will not be aggregated and will include all columns.
agg_func (Dict [str, str], default None) – A dictionary of the columns to be aggregated and the corresponding aggregating functions. Possible aggregating functions include “sum”, “mean”, “median”, “max”, “min”, etc. An example input can be {“col1”:”mean”, “col2”:”sum”} If None, data will not be aggregated and will include all columns.
- Returns
df –
Has the following columns:
”ts” : hourly timestamp “y” : average occupancy rate
- Return type
pandas.DataFrame
object with San Francisco Bay Area Traffic data.
- load_solarpower(agg_freq=None, agg_func=None)
Loads the Hourly Solar Power dataset.
This dataset contains the solar power production of an Australian wind farm from August 2019 to July 2020, with original frequency 4-second. We aggregated it to an hourly series and removed any incomplete hours. Source: https://zenodo.org/record/4656027#.YrpHbuzMLGp
Below is the dataset attribute information:
ts : hourly timestamp y : solar power production in MW (megawatt)
- Parameters
agg_freq (str or None, default None) – Possible values: “daily”, “weekly”, or “monthly”. If None, data will not be aggregated and will include all columns.
agg_func (Dict [str, str], default None) – A dictionary of the columns to be aggregated and the corresponding aggregating functions. Possible aggregating functions include “sum”, “mean”, “median”, “max”, “min”, etc. An example input can be {“col1”:”mean”, “col2”:”sum”} If None, data will not be aggregated and will include all columns.
- Returns
df –
Has the following columns:
”ts” : hourly timestamp “y” : solar power production in MW (megawatt)
- Return type
pandas.DataFrame
object with Solar Power data.
- load_sunspot()
Loads the Sunspot dataset.
This dataset contains the number of observed sunspots from 1818 to 2020 published by Monash. The original dataset was a daily series, and we aggregate it to a monthly time series more than 200 years long. Source: https://zenodo.org/record/4654722#.YrpQ4uzMLGp
Below is the dataset attribute information:
ts : month start date y : average number of sunspots
- Returns
df –
Has the following columns:
”ts” : date “y” : average number of sunspots
- Return type
pandas.DataFrame
object with Sunspot data.
- load_windpower(agg_freq=None, agg_func=None)
Loads the Hourly Wind Power dataset.
This dataset contains the wind power production of an Australian wind farm from August 2019 to July 2020, with original frequency 4-second. We aggregated it to an hourly series and removed any incomplete hours. Source: https://zenodo.org/record/4656032#.YrpJTezMLGp
Below is the dataset attribute information:
ts : hourly timestamp y : wind power production in MW (megawatt)
- Parameters
agg_freq (str or None, default None) – Possible values: “daily”, “weekly”, or “monthly”. If None, data will not be aggregated and will include all columns.
agg_func (Dict [str, str], default None) – A dictionary of the columns to be aggregated and the corresponding aggregating functions. Possible aggregating functions include “sum”, “mean”, “median”, “max”, “min”, etc. An example input can be {“col1”:”mean”, “col2”:”sum”} If None, data will not be aggregated and will include all columns.
- Returns
df –
Has the following columns:
”ts” : hourly timestamp “y” : wind power production in MW (megawatt)
- Return type
pandas.DataFrame
object with Wind Power data.
- class greykite.algo.reconcile.convex.reconcile_forecasts.TraceInfo(df: DataFrame, color: Optional[str] = None, name: Optional[str] = None, legendgroup: Optional[str] = None)[source]
Contains y-values for related lines to plot, such as forecasts or actuals.
The lines share the same color, name, and legend group.
Internal Functions
- class greykite.algo.forecast.silverkite.constants.silverkite_seasonality.SilverkiteSeasonalityEnum(value)[source]
Defines default seasonalities for Silverkite estimator. Names should match those in SeasonalityEnum. The default order for various seasonalities is stored in this enum.
- DAILY_SEASONALITY: SilverkiteSeasonality = SilverkiteSeasonality(name='tod', period=24.0, order=12, seas_names='daily', default_min_days=2)
tod
is 0-24 time of day (tod granularity based on input data, up to second level). Requires at least two full cycles to add the seasonal term (default_min_days=2
).
- WEEKLY_SEASONALITY: SilverkiteSeasonality = SilverkiteSeasonality(name='tow', period=7.0, order=4, seas_names='weekly', default_min_days=14)
tow
is 0-7 time of week (tow granularity based on input data, up to second level).order=4
for full flexibility to model daily input.
- MONTHLY_SEASONALITY: SilverkiteSeasonality = SilverkiteSeasonality(name='tom', period=1.0, order=2, seas_names='monthly', default_min_days=60)
tom
is 0-1 time of month (tom granularity based on input data, up to daily level).
- QUARTERLY_SEASONALITY: SilverkiteSeasonality = SilverkiteSeasonality(name='toq', period=1.0, order=5, seas_names='quarterly', default_min_days=180)
toq
(continuous time of quarter) with natural period. Each day is mapped to a value in [0.0, 1.0) based on its position in the calendar quarter: (Jan1-Mar31, Apr1-Jun30, Jul1-Sep30, Oct1-Dec31). The start of each quarter is 0.0.
- YEARLY_SEASONALITY: SilverkiteSeasonality = SilverkiteSeasonality(name='ct1', period=1.0, order=15, seas_names='yearly', default_min_days=548)
ct1
(continuous year) with natural period.
- greykite.algo.common.ml_models.fit_ml_model(df, model_formula_str=None, fit_algorithm='linear', fit_algorithm_params=None, y_col=None, pred_cols=None, min_admissible_value=None, max_admissible_value=None, uncertainty_dict=None, normalize_method='zero_to_one', regression_weight_col=None, remove_intercept=False)[source]
Fits predictive ML (machine learning) models to continuous response vector (given in
y_col
) and returns fitted model.- Parameters
df (pd.DataFrame) – A data frame with the response vector (y) and the feature columns (
x_mat
).model_formula_str (str) – The prediction model formula string e.g. “y~x1+x2+x3*x4”. This is similar to R formulas. See https://patsy.readthedocs.io/en/latest/formulas.html#how-formulas-work.
fit_algorithm (str, optional, default “linear”) –
The type of predictive model used in fitting.
See
fit_model_via_design_matrix
for available options and their parameters.fit_algorithm_params (dict or None, optional, default None) – Parameters passed to the requested fit_algorithm. If None, uses the defaults in
fit_model_via_design_matrix
.y_col (str) – The column name which has the value of interest to be forecasted If the model_formula_str is not passed,
y_col
e.g. [“y”] is used as the response vector columnpred_cols (List[str]) – The names of the feature columns If the
model_formula_str
is not passed,pred_cols
e.g. [“x1”, “x2”, “x3”] is used as the design matrix columnsmin_admissible_value (Optional[Union[int, float, double]]) – the minimum admissible value for the
predict
function to returnmax_admissible_value (Optional[Union[int, float, double]]) – the maximum admissible value for the
predict
function to returnuncertainty_dict (dict or None) –
If passed as a dictionary an uncertainty model will be fit. The items in the dictionary are:
"uncertainty_method"
strthe title of the method as of now only “simple_conditional_residuals” is implemented which calculates CIs by using residuals
"params"
dictA dictionary of parameters needed for the
uncertainty_method
requested
normalize_method (str or None, default “zero_to_one”) – If a string is provided, it will be used as the normalization method in
normalize_df
, passed via the argumentmethod
. Available options are: “zero_to_one”, “statistical”, “minus_half_to_half”, “zero_at_origin”. If None, no normalization will be performed. See that function for more details.regression_weight_col (str or None, default None) – The column name for the weights to be used in weighted regression version of applicable machine-learning models.
remove_intercept (bool, default False) – Whether to remove explicit and implicit intercepts. By default,
patsy
will make the design matrix always full rank. It will always include an intercept term unless we specify “-1” or “+0”. However, if there are categorical variables, even we specify “-1” or “+0”, it will include an implicit intercept by adding all levels of a categorical variable into the design matrix. Sometimes we don’t want this to happen. Setting this parameter to True will remove both explicit and implicit intercepts.
- Returns
trained_model –
Trained model dictionary with keys:
”y” : response values
”x_design_info” : design matrix information
”ml_model” : A trained model with predict method
- ”uncertainty_model”dict
The returned uncertainty_model dict from
conf_interval
.
”ml_model_summary”: model summary
”y_col” : response columns
”x_mat “: design matrix
”min_admissible_value” : minimum acceptable value
”max_admissible_value” : maximum acceptable value
”normalize_df_func” : normalization function
”regression_weight_col” : regression weight column
- Return type
dict
- greykite.algo.common.ml_models.fit_ml_model_with_evaluation(df, model_formula_str=None, y_col=None, pred_cols=None, fit_algorithm='linear', fit_algorithm_params=None, ind_train=None, ind_test=None, training_fraction=0.9, randomize_training=False, min_admissible_value=None, max_admissible_value=None, uncertainty_dict=None, normalize_method='zero_to_one', regression_weight_col=None, remove_intercept=False)[source]
Fits prediction models to continuous response vector (y) and report results.
- Parameters
df (
pandas.DataFrame
) – A data frame with the response vector (y) and the feature columns (x_mat
)model_formula_str (str) – The prediction model formula e.g. “y~x1+x2+x3*x4”. This is similar to R language (https://www.r-project.org/) formulas. See https://patsy.readthedocs.io/en/latest/formulas.html#how-formulas-work.
y_col (str) – The column name which has the value of interest to be forecasted If the
model_formula_str
is not passed,y_col
e.g. [“y”] is used as the response vector columnpred_cols (list [str]) – The names of the feature columns If the
model_formula_str
is not passed,pred_cols
e.g. [“x1”, “x2”, “x3”] is used as the design matrix columnsfit_algorithm (str, optional, default “linear”) –
The type of predictive model used in fitting.
See
fit_model_via_design_matrix
for available options and their parameters.fit_algorithm_params (dict or None, optional, default None) – Parameters passed to the requested fit_algorithm. If None, uses the defaults in
fit_model_via_design_matrix
.ind_train (list [int]) – The index (row number) of the training set
ind_test (list [int]) – The index (row number) of the test set
training_fraction (float, between 0.0 and 1.0) – The fraction of data used for training This is invoked if ind_train and ind_test are not passed If this is also None or 1.0, then we skip testing and train on the entire dataset
randomize_training (bool) – If True, then the training and the test sets will be randomized rather than in chronological order
min_admissible_value (Optional[Union[int, float, double]]) – The minimum admissible value for the
predict
function to returnmax_admissible_value (Optional[Union[int, float, double]]) – The maximum admissible value for the
predict
function to returnuncertainty_dict (dict or None) –
If passed as a dictionary an uncertainty model will be fit. The items in the dictionary are:
"uncertainty_method"
strthe title of the method as of now only “simple_conditional_residuals” is implemented which calculates CIs by using residuals
"params"
dictA dictionary of parameters needed for the
uncertainty_method
requested
normalize_method (str or None, default “zero_to_one”) – If a string is provided, it will be used as the normalization method in
normalize_df
, passed via the argumentmethod
. Available options are: “zero_to_one”, “statistical”, “minus_half_to_half”, “zero_at_origin”. If None, no normalization will be performed. See that function for more details.regression_weight_col (str or None, default None) – The column name for the weights to be used in weighted regression version of applicable machine-learning models.
remove_intercept (bool, default False) – Whether to remove explicit and implicit intercepts. By default,
patsy
will make the design matrix always full rank. It will always include an intercept term unless we specify “-1” or “+0”. However, if there are categorical variables, even we specify “-1” or “+0”, it will include an implicit intercept by adding all levels of a categorical variable into the design matrix. Sometimes we don’t want this to happen. Setting this parameter to True will remove both explicit and implicit intercepts.
- Returns
trained_model –
Trained model dictionary with the following keys.
”ml_model”: A trained model object “summary”: Summary of the final model trained on all data “x_mat”: Feature vectors matrix used for training of full data (rows of
df
with NA are dropped) “y”: Response vector for training and testing (rows ofdf
with NA are dropped).The index corresponds to selected rows in the input
df
.”y_train”: Response vector used for training “y_train_pred”: Predicted values of
y_train
“training_evaluation”: score function value ofy_train
andy_train_pred
“y_test”: Response vector used for testing “y_test_pred”: Predicted values ofy_test
“test_evaluation”: score function value ofy_test
andy_test_pred
“uncertainty_model”: dictThe returned uncertainty_model dict from
conf_interval
.”plt_compare_test”: plot function to compare
y_test
andy_test_pred
, “plt_pred”: plot function to comparey_train
,y_train_pred
,y_test
andy_test_pred
.- Return type
dict
- greykite.algo.forecast.silverkite.forecast_silverkite_helper.get_silverkite_uncertainty_dict(uncertainty, simple_freq='DAY', coverage=None)[source]
Returns an uncertainty_dict for
forecast
input parameter: uncertainty_dict.The logic is as follows:
- If
uncertainty
is passed as dict: If
quantiles
are not passed throughuncertainty
we fill them usingcoverage
.If
coverage
also missing or quantiles calculated in two ways (viauncertainty["params"]["quantiles"]
andcoverage
) do not match, we throw Exceptions
- If
- If
uncertainty=="auto"
: We provide defaults based on time frequency of data.
Specify
uncertainty["params"]["quantiles"]
based oncoverage
if provided, otherwise the default coverage is 0.95.
- If
- Parameters
uncertainty (str or dict or None) –
It specifies what method should be used for uncertainty. If a dict is passed then it is directly returned to be passed to
forecast
as uncertainty_dict.- If “auto”, it builds a generic dict depending on frequency.
For frequencies less than or equal to one day it sets conditional_cols to be [“dow_hr”].
Otherwise it sets the conditional_cols to be None
If None and
coverage
is None, the upper/lower predictions are not returnedsimple_freq (str, optional) – SimpleTimeFrequencyEnum member that best matches the input data frequency according to get_simple_time_frequency_from_period
coverage (float or None, optional) – Intended coverage of the prediction bands (0.0 to 1.0) If None and uncertainty is None, the upper/lower predictions are not returned
- Returns
uncertainty – An uncertainty dict to be used as input to
forecast
. See that function’s docstring for more details.- Return type
dict or None
- class greykite.algo.forecast.silverkite.forecast_simple_silverkite.SimpleSilverkiteForecast(constants: ~greykite.algo.forecast.silverkite.constants.silverkite_constant.SilverkiteConstant = <greykite.algo.forecast.silverkite.constants.silverkite_constant.SilverkiteConstant object>)[source]
A derived class of SilverkiteForecast. Provides an alternative interface with simplified configuration parameters. Produces the same trained model output and uses the same predict functions.
- convert_params(df: DataFrame, time_col: str, value_col: str, time_properties: Optional[Dict] = None, freq: Optional[str] = None, forecast_horizon: Optional[int] = None, origin_for_time_vars: Optional[float] = None, train_test_thresh: Optional[datetime] = None, training_fraction: Optional[float] = 0.9, fit_algorithm: str = 'ridge', fit_algorithm_params: Optional[Dict] = None, auto_holiday: bool = False, holidays_to_model_separately: Optional[Union[str, List[str]]] = 'auto', holiday_lookup_countries: Optional[Union[str, List[str]]] = 'auto', holiday_pre_num_days: int = 2, holiday_post_num_days: int = 2, holiday_pre_post_num_dict: Optional[Dict] = None, daily_event_df_dict: Optional[Dict] = None, daily_event_neighbor_impact: Optional[Union[int, List[int], callable]] = None, daily_event_shifted_effect: Optional[List[str]] = None, auto_growth: bool = False, changepoints_dict: Optional[Dict] = None, auto_seasonality: bool = False, yearly_seasonality: Union[bool, str, int] = 'auto', quarterly_seasonality: Union[bool, str, int] = 'auto', monthly_seasonality: Union[bool, str, int] = 'auto', weekly_seasonality: Union[bool, str, int] = 'auto', daily_seasonality: Union[bool, str, int] = 'auto', max_daily_seas_interaction_order: Optional[int] = None, max_weekly_seas_interaction_order: Optional[int] = None, autoreg_dict: Optional[Dict] = None, past_df: Optional[DataFrame] = None, lagged_regressor_dict: Optional[Dict] = None, seasonality_changepoints_dict: Optional[Dict] = None, min_admissible_value: Optional[float] = None, max_admissible_value: Optional[float] = None, uncertainty_dict: Optional[Dict] = None, normalize_method: Optional[str] = None, growth_term: Optional[str] = 'linear', regressor_cols: Optional[List[str]] = None, feature_sets_enabled: Optional[Union[bool, str, Dict[str, Optional[Union[bool, str]]]]] = 'auto', extra_pred_cols: Optional[List[str]] = None, drop_pred_cols: Optional[List[str]] = None, explicit_pred_cols: Optional[List[str]] = None, regression_weight_col: Optional[str] = None, simulation_based: Optional[bool] = False, simulation_num: int = 10, fast_simulation: bool = False, remove_intercept: bool = False)[source]
Converts parameters of
forecast_simple_silverkite
into those ofSilverkiteForecast::forecast
.Makes it easier to set parameters to
SilverkiteForecast::forecast
suitable for most forecasting problems. Provides data-aware defaults for seasonality and interaction terms. Provides a simple configuration of holidays from an internal holiday database, and user-friendly configuration for growth and regressors.These parameters can be set from a plain-text config (e.g. no pandas dataframes). The parameter list is intentionally flat to facilitate hyperparameter grid search. Every parameter is either a parameter of
SilverkiteForecast::forecast
or a tuning parameter.Notes
The basic parameters are identical to
SilverkiteForecast::forecast
. The more complex parameters are specified via config parameters:daily_event_df_dict
(viaholiday*
)fs_components_df
(via *_seasonality`)extra_pred_cols
(viaholiday*
,*seas*
,growth_term
,regressor_cols
,feature_sets_enabled
,extra_pred_cols
)
- Parameters
df (
pandas.DataFrame
) – A data frame which includes the timestamp column as well as the value column. This is thedf
for training the model, not for future prediction.time_col (str) – The column name in df representing time for the time series data The time column can be anything that can be parsed by pandas DatetimeIndex
value_col (str) – The column name which has the value of interest to be forecasted
time_properties (dict [str, any] or None, optional) –
Time properties dictionary (likely produced by
get_forecast_time_properties
) with keys:"ts"
UnivariateTimeSeries or Nonedf
converted to aUnivariateTimeSeries
."period"
intPeriod of each observation (i.e. minimum time between observations, in seconds).
"simple_freq"
SimpleTimeFrequencyEnum
SimpleTimeFrequencyEnum
member corresponding to data frequency."num_training_points"
intNumber of observations for training.
"num_training_days"
intNumber of days for training.
"start_year"
intStart year of the training period.
"end_year"
intEnd year of the forecast period.
"origin_for_time_vars"
floatContinuous time representation of the first date in
df
.
In this function,
start_year
andend_year
are used to definedaily_event_df_dict
.simple_freq
andnum_training_days
are used to definefs_components_df
.simple_freq
andnum_training_days
are used to set defaultfeature_sets_enabled
.origin_for_time_vars
is used to set defaultorigin_for_time_vars
.the other parameters are ignored
It is okay if
num_training_points
,num_training_days
,start_year
,end_year
are computed for a superset ofdf
. This allows CV splits and backtest, which train on partial data, to use the same data-aware model parameters as the forecast on all training data.If None, the values are computed for
df
. This corresponds to using the same modeling approach on the CV splits and backtest from forecast_pipeline, without requiring the same parameters. In this case, make sureforecast_horizon
is at least as large as the test period for the split, to ensure all holidays are captured.freq (str or None, optional, default None) – Frequency of input data. Used to compute
time_properties
only iftime_properties is None
. Frequency strings can have multiples, e.g. ‘5H’. See https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases for a list of frequency aliases. If None, inferred bypandas.infer_freq
. Provide this parameter ifdf
has missing timepoints.forecast_horizon (int or None, optional, default None) – Number of periods to forecast into the future. Must be > 0. Used to compute
time_properties
only iftime_properties is None
. If None, default is determined by input data frequency. Used to determine forecast end date, to pull the appropriate holiday data. Should be at least as large as the prediction period (if this function is called fromforecast_pipeline
, the prediction period for different splits is set viacv_horizon
,test_horizon
,forecast_horizon
).origin_for_time_vars (float or None, optional, default None) – The time origin used to create continuous variables for time. If None, uses the value from
time_properties
.train_test_thresh (
datetime.datetime
or None, optional, default None) – e.g. datetime.datetime(2019, 6, 30) The threshold for training and testing split. Note that the final returned model is trained using all data. If None, training split is based ontraining_fraction
.training_fraction (float or None, optional, default 0.9) – The fraction of data used for training (0.0 to 1.0) Used only if
train_test_thresh is None
. If this is also None or 1.0, then we skip testing and train on the entire dataset.fit_algorithm (str, optional, default “linear”) –
The type of predictive model used in fitting.
See
fit_model_via_design_matrix
for available options and their parameters.fit_algorithm_params (dict or None, optional, default None) – Parameters passed to the requested fit_algorithm. If None, uses the defaults in
fit_model_via_design_matrix
.auto_holiday (bool, default False) –
Whether to automatically infer holiday configuration based on the input timeseries. The candidate lookup countries are specified by
holiday_lookup_countries
. If True, the following parameters will be ignored:”holidays_to_model_separately”
”holiday_pre_num_days”
”holiday_post_num_days”
”holiday_pre_post_num_dict”
For details, see
HolidayInferrer
. Extra events specified indaily_event_df_dict
will be added to the inferred holidays.holiday_lookup_countries (list [str] or “auto” or None, optional, default “auto”) –
The countries that contain the holidays you intend to model (
holidays_to_model_separately
).If “auto”, uses a default list of countries that contain the default
holidays_to_model_separately
. SeeHOLIDAY_LOOKUP_COUNTRIES_AUTO
.If a list, must be a list of country names.
If None or an empty list, no holidays are modeled.
holidays_to_model_separately (list [str] or “auto” or
ALL_HOLIDAYS_IN_COUNTRIES
or None, optional, default “auto” # noqa: E501) –Which holidays to include in the model. The model creates a separate key, value for each item in
holidays_to_model_separately
. The other holidays in the countries are grouped together as a single effect.If “auto”, uses a default list of important holidays. See
HOLIDAYS_TO_MODEL_SEPARATELY_AUTO
.If
ALL_HOLIDAYS_IN_COUNTRIES
, uses all available holidays inholiday_lookup_countries
. This can often create a model that has too many parameters, and should typically be avoided.If a list, must be a list of holiday names.
If None or an empty list, all holidays in
holiday_lookup_countries
are grouped together as a single effect.
Use
holiday_lookup_countries
to provide a list of countries where these holiday occur.holiday_pre_num_days (int, default 2) – Model holiday effects for
holiday_pre_num_days
days before the holiday.holiday_post_num_days (int, default 2) – Model holiday effects for
holiday_post_num_days
days after the holiday.holiday_pre_post_num_dict (dict [str, (int, int)] or None, default None) – Overrides
pre_num
andpost_num
for each holiday inholidays_to_model_separately
. For example, ifholidays_to_model_separately
contains “Thanksgiving” and “Labor Day”, this parameter can be set to{"Thanksgiving": [1, 3], "Labor Day": [1, 2]}
, denoting that the “Thanksgiving”pre_num
is 1 andpost_num
is 3, and “Labor Day”pre_num
is 1 andpost_num
is 2. Holidays not specified use the default given bypre_num
andpost_num
.daily_event_df_dict (dict [str,
pandas.DataFrame
] or None, default None) –A dictionary of data frames, each representing events data for the corresponding key. Specifies additional events to include besides the holidays specified above. The format is the same as in forecast. The DataFrame has two columns:
The first column contains event dates. Must be in a format recognized by
pandas.to_datetime
. Must be at daily frequency for proper join. It is joined against the time indf
, converted to a day:pd.to_datetime(pd.DatetimeIndex(df[time_col]).date)
.the second column contains the event label for each date
The column order is important; column names are ignored. The event dates must span their occurrences in both the training and future prediction period.
During modeling, each key in the dictionary is mapped to a categorical variable named
f"{EVENT_PREFIX}_{key}"
, whose value at each timestamp is specified by the corresponding DataFrame.For example, to manually specify a yearly event on September 1 during a training/forecast period that spans 2020-2022:
daily_event_df_dict = { "custom_event": pd.DataFrame({ "date": ["2020-09-01", "2021-09-01", "2022-09-01"], "label": ["is_event", "is_event", "is_event"] }) }
It’s possible to specify multiple events in the same df. Two events,
"sep"
and"oct"
are specified below for 2020-2021:daily_event_df_dict = { "custom_event": pd.DataFrame({ "date": ["2020-09-01", "2020-10-01", "2021-09-01", "2021-10-01"], "event_name": ["sep", "oct", "sep", "oct"] }) }
Use multiple keys if two events may fall on the same date. These events must be in separate DataFrames:
daily_event_df_dict = { "fixed_event": pd.DataFrame({ "date": ["2020-09-01", "2021-09-01", "2022-09-01"], "event_name": "fixed_event" }), "moving_event": pd.DataFrame({ "date": ["2020-09-01", "2021-08-28", "2022-09-03"], "event_name": "moving_event" }), }
The multiple event specification can be used even if events never overlap. An equivalent specification to the second example:
daily_event_df_dict = { "sep": pd.DataFrame({ "date": ["2020-09-01", "2021-09-01"], "event_name": "is_event" }), "oct": pd.DataFrame({ "date": ["2020-10-01", "2021-10-01"], "event_name": "is_event" }), }
Note: All these events are automatically added to the model. There is no need to specify them in
extra_pred_cols
as you would for forecast.Note: Do not use
EVENT_DEFAULT
in the second column. This is reserved to indicate dates that do not correspond to an event.daily_event_neighbor_impact (int, list [int], callable or None, default None) –
The impact of neighboring timestamps of the events in
event_df_dict
. This is for daily events so the units below are all in days.For example, if the data is weekly (“W-SUN”) and an event is daily, it may not exactly fall on the weekly date. But you can specify for New Year’s day on 1/1, it affects all dates in the week, e.g. 12/31, 1/1, …, 1/6, then it will be mapped to the weekly date. In this case you may want to map a daily event’s date to a few dates, and can specify
neighbor_impact=lambda x: [x-timedelta(days=x.isocalendar()[2]-1) + timedelta(days=i) for i in range(7)]
.Another example is that the data is rolling 7 day daily data, thus a holiday may affect the t, t+1, …, t+6 dates. You can specify
neighbor_impact=7
.If input is int, the mapping is t, t+1, …, t+neighbor_impact-1. If input is list, the mapping is [t+x for x in neighbor_impact]. If input is a function, it maps each daily event’s date to a list of dates.
daily_event_shifted_effect (list [str] or None, default None) – Additional neighbor events based on given events. For example, passing [“-1D”, “7D”] will add extra daily events which are 1 day before and 7 days after the given events. Offset format is {d}{freq} with any integer plus a frequency string. Must be parsable by pandas
to_offset
. The new events’ names will be the current events’ names with suffix “{offset}_before” or “{offset}_after”. For example, if we have an event named “US_Christmas Day”, a “7D” shift will have name “US_Christmas Day_7D_after”. This is useful when you expect an offset of the current holidays also has impact on the time series, or you want to interact the lagged terms with autoregression. Ifdaily_event_neighbor_impact
is also specified, this will be applied after adding neighboring days.auto_growth (bool, default False) –
Whether to automatically infer growth configuration. If True, the growth term and automatically changepoint detection configuration will be inferred from input timeseries, and the following parameters will be ignored:
”growth_term”
”changepoints_dict” (Except parameters that controls how custom changepoint are combined with automatically detected changepoints. These parameters include “dates”, “combine_changepoint_min_distance” and “keep_detected”.)
For detail, see
generate_trend_changepoint_detection_params
.changepoints_dict (dict or None, optional, default None) –
Specifies the changepoint configuration.
"method"
: strThe method to locate changepoints. Valid options:
”uniform”. Places n_changepoints evenly spaced changepoints to allow growth to change.
”custom”. Places changepoints at the specified dates.
”auto”. Automatically detects change points. For configuration, see
find_trend_changepoints
Additional keys to provide parameters for each particular method are described below.
"continuous_time_col"
: str, optionalColumn to apply
growth_func
to, to generate changepoint features Typically, this should match the growth term in the model"growth_func"
: callable or None, optionalGrowth function (scalar -> scalar). Changepoint features are created by applying
growth_func
tocontinuous_time_col
with offsets. If None, uses identity function to usecontinuous_time_col
directly as growth term If changepoints_dict[“method”] == “uniform”, this other key is required:"n_changepoints"
: intnumber of changepoints to evenly space across training period
If changepoints_dict[“method”] == “custom”, this other key is required:
"dates"
: Iterable[Union[int, float, str, datetime]]Changepoint dates. Must be parsable by pd.to_datetime. Changepoints are set at the closest time on or after these dates in the dataset.
If changepoints_dict[“method”] == “auto”, the keys that matches the parameters in
find_trend_changepoints
, exceptdf
,time_col
andvalue_col
, are optional. Extra keys also include “dates”, “combine_changepoint_min_distance” and “keep_detected” to specify additional custom trend changepoints. These three parameters correspond to the three parameters “custom_changepoint_dates”, “min_distance” and “keep_detected” incombine_detected_and_custom_trend_changepoints
.
auto_seasonality (bool, default False) –
Whether to automatically infer seasonality orders. If True, the seasonality orders will be automatically inferred from input timeseries and the following parameters will be ignored unless the value is
False
:”yearly_seasonality”
”quarterly_seasonality”
”monthly_seasonality”
”weekly_seasonality”
”daily_seasonality”
If any of the above parameter’s value is
False
, the corresponding seasonality order will be forced to be zero, regardless of the inferring result. For detail, seeSeasonalityInferrer
.yearly_seasonality (str or bool or int) – Determines the yearly seasonality. ‘auto’, True, False, or a number for the Fourier order
quarterly_seasonality (str or bool or int) – Determines the quarterly seasonality. ‘auto’, True, False, or a number for the Fourier order
monthly_seasonality (str or bool or int) – Determines the monthly seasonality. ‘auto’, True, False, or a number for the Fourier order
weekly_seasonality (str or bool or int) – Determines the weekly seasonality. ‘auto’, True, False, or a number for the Fourier order
daily_seasonality (str or bool or int) – Determines the daily seasonality. ‘auto’, True, False, or a number for the Fourier order
max_daily_seas_interaction_order (int or None, optional, default None) – Max fourier order for interaction terms with daily seasonality. If None, uses all available terms.
max_weekly_seas_interaction_order (int or None, optional, default None) – Max fourier order for interaction terms with weekly seasonality. If None, uses all available terms.
autoreg_dict (dict or str or None, optional, default None) –
If a dict: A dictionary with arguments for
build_autoreg_df
. That function’s parametervalue_col
is inferred from the input of current functionself.forecast
. Other keys are:"lag_dict"
: dict or None"agg_lag_dict"
: dict or None"series_na_fill_func"
: callableIf a str: The string will represent a method and a dictionary will be constructed using that str. Currently only implemented method is “auto” which uses __get_default_autoreg_dict to create a dictionary. See more details for above parameters in
build_autoreg_df
.past_df (
pandas.DataFrame
or None, default None) – The past df used for building autoregression features. This is not necessarily needed since imputation is available, however, if such data is available but not used in training for speed purposes, they can be passed here to build more accurate autoregression features.lagged_regressor_dict (dict or None, default None) –
A dictionary with arguments for
greykite.common.features.timeseries_lags.build_autoreg_df_multi
. The keys of the dictionary are the target lagged regressor column names. It can leverage the regressors included indf
. The value of each key is either a dict or str. If dict, it has the following keys:"lag_dict"
: dict or None"agg_lag_dict"
: dict or None"series_na_fill_func"
: callableIf str, it represents a method and a dictionary will be constructed using that str. Currently the only implemented method is “auto” which uses __get_default_lagged_regressor_dict to create a dictionary for each lagged regressor. An example:
lagged_regressor_dict = { "regressor1": { "lag_dict": {"orders": [1, 2, 3]}, "agg_lag_dict": { "orders_list": [[7, 7 * 2, 7 * 3]], "interval_list": [(8, 7 * 2)]}, "series_na_fill_func": lambda s: s.bfill().ffill()}, "regressor2": "auto"}
seasonality_changepoints_dict (dict or None, optional, default None) – The parameter dictionary for seasonality change point detection. Parameters are in
find_seasonality_changepoints
. Notedf
,time_col
,value_col
andtrend_changepoints
are auto populated, and do not need to be provided.min_admissible_value (float or None, optional, default None) – The minimum admissible value to return during prediction. If None, no limit is applied.
max_admissible_value (float or None, optional, default None) – The maximum admissible value to return during prediction. If None, no limit is applied.
uncertainty_dict (dict or None, optional, default None) –
- How to fit the uncertainty model. A dictionary with keys:
"uncertainty_method"
strThe title of the method. Only “simple_conditional_residuals” is implemented in
fit_prediction_model
which calculates CIs using residuals"params"
: dictA dictionary of parameters needed for the requested
uncertainty_method
. For example, foruncertainty_method="simple_conditional_residuals"
, see parameters ofconf_interval
, listed briefly here:"conditional_cols"
"quantiles"
"quantile_estimation_method"
"sample_size_thresh"
"small_sample_size_method"
"small_sample_size_quantile"
If None, no uncertainty intervals are calculated.
normalize_method (str or None, default None) – If a string is provided, it will be used as the normalization method in
normalize_df
, passed via the argumentmethod
. Available options are: “zero_to_one”, “statistical”, “minus_half_to_half”, “zero_at_origin”. If None, no normalization will be performed. See that function for more details.growth_term (str or None, optional, default “ct1”) – How to model the growth. Valid options are {“linear”, “quadratic”, “sqrt”, “cuberoot”}. See
GrowthColEnum
.regressor_cols (list [str] or None, optional, default None) – The columns in
df
to use as regressors. These must be provided during prediction as well.feature_sets_enabled (dict [str, bool or “auto” or None] or bool or “auto” or None, default “auto”) –
Whether to include interaction terms and categorical variables to increase model flexibility.
If a dict, boolean values indicate whether include various sets of features in the model. The following keys are recognized (from
SilverkiteColumn
):"COLS_HOUR_OF_WEEK"
strConstant hour of week effect
"COLS_WEEKEND_SEAS"
strDaily seasonality interaction with is_weekend
"COLS_DAY_OF_WEEK_SEAS"
strDaily seasonality interaction with day of week
"COLS_TREND_DAILY_SEAS"
strAllow daily seasonality to change over time by is_weekend
"COLS_EVENT_SEAS"
strAllow sub-daily event effects
"COLS_EVENT_WEEKEND_SEAS"
strAllow sub-daily event effect to interact with is_weekend
"COLS_DAY_OF_WEEK"
strConstant day of week effect
"COLS_TREND_WEEKEND"
strAllow trend (growth, changepoints) to interact with is_weekend
"COLS_TREND_DAY_OF_WEEK"
strAllow trend to interact with day of week
"COLS_TREND_WEEKLY_SEAS"
strAllow weekly seasonality to change over time
The following dictionary values are recognized:
True: include the feature set in the model
False: do not include the feature set in the model
None: do not include the feature set in the model
”auto” or not provided: use the default setting based on data frequency and size
If not a dict:
if a boolean, equivalent to a dictionary with all values set to the boolean.
if None, equivalent to a dictionary with all values set to False.
if “auto”, equivalent to a dictionary with all values set to “auto”.
extra_pred_cols (list [str] or None, optional, default None) – Columns to include in
extra_pred_cols
forSilverkiteForecast::forecast
. Other columns are added toextra_pred_cols
by the other parameters of this function (i.e.holidays_*
,growth_term
,regressors
,feature_sets_enabled
). If None, treated is the same as [].drop_pred_cols (list [str] or None, default None) – Names of predictor columns to be dropped from the final model. Ignored if None.
explicit_pred_cols (list [str] or None, default None) – Names of the explicit predictor columns which will be the only variables in the final model. Note that this overwrites the generated predictors in the model and may include new terms not appearing in the predictors (e.g. interaction terms). Ignored if None.
regression_weight_col (str or None, default None) – The column name for the weights to be used in weighted regression version of applicable machine-learning models.
simulation_based (bool, default False) – Boolean to specify if the future predictions are to be using simulations or not. Note that this is only used in deciding what parameters should be used for certain components e.g. autoregression, if automatic methods are requested. However, the auto-settings and the prediction settings regarding using simulations should match.
simulation_num (int, default 10) – The number of simulations for when simulations are used for generating forecasts and prediction intervals.
fast_simulation (bool, default False) – Deterimes if fast simulations are to be used. This only impacts models which include auto-regression. This method will only generate one simulation without any error being added and then add the error using the volatility model. The advantage is a major boost in speed during inference and the disadvantage is potentially less accurate prediction intervals.
remove_intercept (bool, default False) – Whether to remove explicit and implicit intercepts. By default,
patsy
will make the design matrix always full rank. It will always include an intercept term unless we specify “-1” or “+0”. However, if there are categorical variables, even we specify “-1” or “+0”, it will include an implicit intercept by adding all levels of a categorical variable into the design matrix. Sometimes we don’t want this to happen. Setting this parameter to True will remove both explicit and implicit intercepts.
- Returns
parameters – Parameters to call
forecast
.- Return type
dict
- forecast_simple(*args, **kwargs)[source]
A wrapper around
SilverkiteForecast::forecast
that simplifies some of the input parameters.- Parameters
args (positional args) – Positional args to pass to
convert_simple_silverkite_params
. See that function for details.kwargs (keyword args) – Keyword args to pass to
convert_simple_silverkite_params
. See that function for details.
- Returns
trained_model – The return value of
forecast
A dictionary that includes the fitted model from the functionfit_ml_model_with_evaluation
.- Return type
dict
- forecast(df, time_col, value_col, freq=None, origin_for_time_vars=None, extra_pred_cols=None, drop_pred_cols=None, explicit_pred_cols=None, train_test_thresh=None, training_fraction=0.9, fit_algorithm='linear', fit_algorithm_params=None, daily_event_df_dict=None, daily_event_neighbor_impact=None, daily_event_shifted_effect=None, fs_components_df= name period order seas_names 0 tod 24.0 3 daily 1 tow 7.0 3 weekly 2 toy 1.0 5 yearly, autoreg_dict=None, past_df=None, lagged_regressor_dict=None, changepoints_dict=None, seasonality_changepoints_dict=None, changepoint_detector=None, min_admissible_value=None, max_admissible_value=None, uncertainty_dict=None, normalize_method=None, adjust_anomalous_dict=None, impute_dict=None, regression_weight_col=None, forecast_horizon=None, simulation_based=False, simulation_num=10, fast_simulation=False, remove_intercept=False)
A function for forecasting. It captures growth, seasonality, holidays and other patterns. See “Capturing the time-dependence in the precipitation process for weather risk assessment” as a reference: https://link.springer.com/article/10.1007/s00477-016-1285-8
- Parameters
df (
pandas.DataFrame
) – A data frame which includes the timestamp column as well as the value column.time_col (str) – The column name in
df
representing time for the time series data. The time column can be anything that can be parsed by pandas DatetimeIndex.value_col (str) – The column name which has the value of interest to be forecasted.
freq (str, optional, default None) – The intended timeseries frequency, DateOffset alias. See https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases. If None automatically inferred. This frequency will be passed through this function as a part of the trained model and used at predict time if needed. If data include missing timestamps, and frequency is monthly/annual, user should pass this parameter, as it cannot be inferred.
origin_for_time_vars (float, optional, default None) – The time origin used to create continuous variables for time. If None, uses the first record in
df
.extra_pred_cols (list of str, default None) –
Names of the extra predictor columns.
If None, uses [“ct1”], a simple linear growth term.
It can leverage regressors included in
df
and those generated by the other parameters. The following effects will not be modeled unless specified inextra_pred_cols
:included in
df
: e.g. macro-economic factors, related timeseriesfrom
build_time_features_df
: e.g. ct1, ct_sqrt, dow, …from
daily_event_df_dict
: e.g. “events_India”, …
The columns corresponding to the following parameters are included in the model without specification in
extra_pred_cols
.extra_pred_cols
can be used to add interactions with these terms.changepoints_dict: e.g. changepoint0, changepoint1, … fs_components_df: e.g. sin2_dow, cos4_dow_weekly autoreg_dict: e.g. x_lag1, x_avglag_2_3_4, y_avglag_1_to_5
If a regressor is passed in
df
, it needs to be provided to the associated predict function:predict_silverkite
: viafut_df
ornew_external_regressor_df
silverkite.predict_n(_no_sim
: vianew_external_regressor_df
drop_pred_cols (list [str] or None, default None) – Names of predictor columns to be dropped from the final model. Ignored if None
explicit_pred_cols (list [str] or None, default None) – Names of the explicit predictor columns which will be the only variables in the final model. Note that this overwrites the generated predictors in the model and may include new terms not appearing in the predictors (e.g. interaction terms). Ignored if None
train_test_thresh (
datetime.datetime
, optional) – e.g. datetime.datetime(2019, 6, 30) The threshold for training and testing split. Note that the final returned model is trained using all data. If None, training split is based ontraining_fraction
training_fraction (float, optional) – The fraction of data used for training (0.0 to 1.0) Used only if
train_test_thresh
is None. If this is also None or 1.0, then we skip testing and train on the entire dataset.fit_algorithm (str, optional, default “linear”) –
The type of predictive model used in fitting.
See
fit_model_via_design_matrix
for available options and their parameters.fit_algorithm_params (dict or None, optional, default None) – Parameters passed to the requested fit_algorithm. If None, uses the defaults in
fit_model_via_design_matrix
.daily_event_df_dict (dict or None, optional, default None) –
A dictionary of data frames, each representing events data for the corresponding key. The DataFrame has two columns:
The first column contains event dates. Must be in a format recognized by
pandas.to_datetime
. Must be at daily frequency for proper join. It is joined against the time indf
, converted to a day:pd.to_datetime(pd.DatetimeIndex(df[time_col]).date)
.the second column contains the event label for each date
The column order is important; column names are ignored. The event dates must span their occurrences in both the training and future prediction period.
During modeling, each key in the dictionary is mapped to a categorical variable named
f"{EVENT_PREFIX}_{key}"
, whose value at each timestamp is specified by the corresponding DataFrame.For example, to manually specify a yearly event on September 1 during a training/forecast period that spans 2020-2022:
daily_event_df_dict = { "custom_event": pd.DataFrame({ "date": ["2020-09-01", "2021-09-01", "2022-09-01"], "label": ["is_event", "is_event", "is_event"] }) }
It’s possible to specify multiple events in the same df. Two events,
"sep"
and"oct"
are specified below for 2020-2021:daily_event_df_dict = { "custom_event": pd.DataFrame({ "date": ["2020-09-01", "2020-10-01", "2021-09-01", "2021-10-01"], "event_name": ["sep", "oct", "sep", "oct"] }) }
Use multiple keys if two events may fall on the same date. These events must be in separate DataFrames:
daily_event_df_dict = { "fixed_event": pd.DataFrame({ "date": ["2020-09-01", "2021-09-01", "2022-09-01"], "event_name": "fixed_event" }), "moving_event": pd.DataFrame({ "date": ["2020-09-01", "2021-08-28", "2022-09-03"], "event_name": "moving_event" }), }
The multiple event specification can be used even if events never overlap. An equivalent specification to the second example:
daily_event_df_dict = { "sep": pd.DataFrame({ "date": ["2020-09-01", "2021-09-01"], "event_name": "is_event" }), "oct": pd.DataFrame({ "date": ["2020-10-01", "2021-10-01"], "event_name": "is_event" }), }
Note
The events you want to use must be specified in
extra_pred_cols
. These take the form:f"{EVENT_PREFIX}_{key}"
, whereEVENT_PREFIX
is the constant.Do not use
EVENT_DEFAULT
in the second column. This is reserved to indicate dates that do not correspond to an event.daily_event_neighbor_impact (int, list [int], callable or None, default None) –
The impact of neighboring timestamps of the events in
event_df_dict
. This is for daily events so the units below are all in days.For example, if the data is weekly (“W-SUN”) and an event is daily, it may not exactly fall on the weekly date. But you can specify for New Year’s day on 1/1, it affects all dates in the week, e.g. 12/31, 1/1, …, 1/6, then it will be mapped to the weekly date. In this case you may want to map a daily event’s date to a few dates, and can specify
neighbor_impact=lambda x: [x-timedelta(days=x.isocalendar()[2]-1) + timedelta(days=i) for i in range(7)]
.Another example is that the data is rolling 7 day daily data, thus a holiday may affect the t, t+1, …, t+6 dates. You can specify
neighbor_impact=7
.If input is int, the mapping is t, t+1, …, t+neighbor_impact-1. If input is list, the mapping is [t+x for x in neighbor_impact]. If input is a function, it maps each daily event’s date to a list of dates.
daily_event_shifted_effect (list [str] or None, default None) – Additional neighbor events based on given events. For example, passing [“-1D”, “7D”] will add extra daily events which are 1 day before and 7 days after the given events. Offset format is {d}{freq} with any integer plus a frequency string. Must be parsable by pandas
to_offset
. The new events’ names will be the current events’ names with suffix “{offset}_before” or “{offset}_after”. For example, if we have an event named “US_Christmas Day”, a “7D” shift will have name “US_Christmas Day_7D_after”. This is useful when you expect an offset of the current holidays also has impact on the time series, or you want to interact the lagged terms with autoregression. Ifdaily_event_neighbor_impact
is also specified, this will be applied after adding neighboring days.fs_components_df (
pandas.DataFrame
or None, optional) –A dataframe with information about fourier series generation. Must contain columns with following names:
”name”: name of the timeseries feature e.g. “tod”, “tow” etc. “period”: Period of the fourier series, optional, default 1.0 “order”: Order of the fourier series, optional, default 1.0 “seas_names”: season names corresponding to the name (e.g. “daily”, “weekly” etc.), optional.
Default includes daily, weekly , yearly seasonality.
autoreg_dict (dict or str or None, optional, default None) –
If a dict: A dictionary with arguments for
build_autoreg_df
. That function’s parametervalue_col
is inferred from the input of current functionself.forecast
. Other keys are:"lag_dict"
: dict or None"agg_lag_dict"
: dict or None"series_na_fill_func"
: callableIf a str: The string will represent a method and a dictionary will be constructed using that str. Currently only implemented method is “auto” which uses __get_default_autoreg_dict to create a dictionary. See more details for above parameters in
build_autoreg_df
.past_df (
pandas.DataFrame
or None, default None) –The past df used for building autoregression features. This is not necessarily needed since imputation is possible. However, it is recommended to provide
past_df
for more accurate autoregression features and faster training (by skipping imputation). The columns are:- time_col
pandas.Timestamp
or str The timestamps.
- value_colfloat
The past values.
- addition_regressor_colsfloat
Any additional regressors.
Note that this
past_df
is assumed to immediately precededf
without gaps, otherwise an error will be raised.- time_col
lagged_regressor_dict (dict or None, default None) –
A dictionary with arguments for
build_autoreg_df_multi
. The keys of the dictionary are the target lagged regressor column names. It can leverage the regressors included indf
. The value of each key is either a dict or str. If dict, it has the following keys:"lag_dict"
: dict or None"agg_lag_dict"
: dict or None"series_na_fill_func"
: callableIf str, it represents a method and a dictionary will be constructed using that str. Currently the only implemented method is “auto” which uses __get_default_lagged_regressor_dict to create a dictionary for each lagged regressor. An example:
lagged_regressor_dict = { "regressor1": { "lag_dict": {"orders": [1, 2, 3]}, "agg_lag_dict": { "orders_list": [[7, 7 * 2, 7 * 3]], "interval_list": [(8, 7 * 2)]}, "series_na_fill_func": lambda s: s.bfill().ffill()}, "regressor2": "auto"}
Check the docstring of
build_autoreg_df_multi
for more details for each argument.changepoints_dict (dict or None, optional, default None) –
Specifies the changepoint configuration.
- ”method”: str
The method to locate changepoints. Valid options:
”uniform”. Places n_changepoints evenly spaced changepoints to allow growth to change.
”custom”. Places changepoints at the specified dates.
”auto”. Automatically detects change points. For configuration, see
find_trend_changepoints
Additional keys to provide parameters for each particular method are described below.
- ”continuous_time_col”: str, optional
Column to apply
growth_func
to, to generate changepoint features Typically, this should match the growth term in the model- ”growth_func”: Optional[func]
Growth function (scalar -> scalar). Changepoint features are created by applying
growth_func
tocontinuous_time_col
with offsets. If None, uses identity function to usecontinuous_time_col
directly as growth term If changepoints_dict[“method”] == “uniform”, this other key is required:"n_changepoints"
: intnumber of changepoints to evenly space across training period
If changepoints_dict[“method”] == “custom”, this other key is required:
"dates"
: Iterable[Union[int, float, str, datetime]]Changepoint dates. Must be parsable by pd.to_datetime. Changepoints are set at the closest time on or after these dates in the dataset.
If changepoints_dict[“method”] == “auto”, the keys that matches the parameters in
find_trend_changepoints
, exceptdf
,time_col
andvalue_col
, are optional. Extra keys also include “dates”, “combine_changepoint_min_distance” and “keep_detected” to specify additional custom trend changepoints. These three parameters correspond to the three parameters “custom_changepoint_dates”, “min_distance” and “keep_detected” incombine_detected_and_custom_trend_changepoints
.
seasonality_changepoints_dict (dict or None, default None) – The parameter dictionary for seasonality change point detection. Parameters are in
find_seasonality_changepoints
. Notedf
,time_col
,value_col
andtrend_changepoints
are auto populated, and do not need to be provided.changepoint_detector (ChangepointDetector or None, default None) – The ChangepointDetector class
ChangepointDetector
. This is specifically for forecast_simple_silverkite to pass the ChangepointDetector class for plotting purposes, in case that users useforecast_simple_silverkite
withchangepoints_dict["method"] == "auto"
. The trend change point detection has to be run there to include possible interaction terms, so we need to pass the detection result from there to include in the output.min_admissible_value (float or None, optional, default None) – The minimum admissible value to return during prediction. If None, no limit is applied.
max_admissible_value (float or None, optional, default None) – The maximum admissible value to return during prediction. If None, no limit is applied.
uncertainty_dict (dict or None, optional, default None) –
- How to fit the uncertainty model. A dictionary with keys:
"uncertainty_method"
strThe title of the method. Only “simple_conditional_residuals” is implemented in
fit_ml_model
which calculates CIs using residuals"params"
dictA dictionary of parameters needed for the requested
uncertainty_method
. For example, foruncertainty_method="simple_conditional_residuals"
, see parameters ofconf_interval
:"conditional_cols"
"quantiles"
"quantile_estimation_method"
"sample_size_thresh"
"small_sample_size_method"
"small_sample_size_quantile"
If None, no uncertainty intervals are calculated.
normalize_method (str or None, default None) – If a string is provided, it will be used as the normalization method in
normalize_df
, passed via the argumentmethod
. Available options are: “zero_to_one”, “statistical”, “minus_half_to_half”, “zero_at_origin”. If None, no normalization will be performed. See that function for more details.adjust_anomalous_dict (dict or None, default None) –
If not None, a dictionary with following items:
- ”func”callable
A function to perform adjustment of anomalous data with following signature:
adjust_anomalous_dict["func"]( df=df, time_col=time_col, value_col=value_col, **params) -> {"adjusted_df": adjusted_df, ...}
- ”params”dict
The extra parameters to be passed to the function above.
impute_dict (dict or None, default None) –
If not None, a dictionary with following items:
- ”func”callable
A function to perform imputations with following signature:
impute_dict["func"]( df=df, value_col=value_col, **impute_dict["params"] -> {"df": imputed_df, ...}
- ”params”dict
The extra parameters to be passed to the function above.
regression_weight_col (str or None, default None) – The column name for the weights to be used in weighted regression version of applicable machine-learning models.
forecast_horizon (int or None, default None) – The number of periods for which forecast is needed. Note that this is only used in deciding what parameters should be used for certain components e.g. autoregression, if automatic methods are requested. While, the prediction time forecast horizon could be different from this variable, ideally they should be the same.
simulation_based (bool, default False) – Boolean to specify if the future predictions are to be using simulations or not. Note that this is only used in deciding what parameters should be used for certain components e.g. autoregression, if automatic methods are requested. However, the auto-settings and the prediction settings regarding using simulations should match.
simulation_num (int, default 10) – The number of simulations for when simulations are used for generating forecasts and prediction intervals.
fast_simulation (bool, default False) – Deterimes if fast simulations are to be used. This only impacts models which include auto-regression. This method will only generate one simulation without any error being added and then add the error using the volatility model. The advantage is a major boost in speed during inference and the disadvantage is potentially less accurate prediction intervals.
remove_intercept (bool, default False) – Whether to remove explicit and implicit intercepts. By default,
patsy
will make the design matrix always full rank. It will always include an intercept term unless we specify “-1” or “+0”. However, if there are categorical variables, even we specify “-1” or “+0”, it will include an implicit intercept by adding all levels of a categorical variable into the design matrix. Sometimes we don’t want this to happen. Setting this parameter to True will remove both explicit and implicit intercepts.
- Returns
trained_model – A dictionary that includes the fitted model from the function
fit_ml_model_with_evaluation
. The keys are:- df_dropna:
pandas.DataFrame
The
df
with NAs dropped.- df:
pandas.DataFrame
The original
df
.- num_training_points: int
The number of training points.
- features_df:
pandas.DataFrame
The
df
with augmented time features.- min_timestamp:
pandas.Timestamp
The minimum timestamp in data.
- max_timestamp:
pandas.Timestamp
The maximum timestamp in data.
- freq: str
The data frequency.
- inferred_freq: str
The data freqency inferred from data.
- inferred_freq_in_secsfloat
The data frequency inferred from data in seconds.
- inferred_freq_in_days: float
The data frequency inferred from data in days.
- time_col: str
The time column name.
- value_col: str
The value column name.
- origin_for_time_vars: float
The first time stamp converted to a float number.
- fs_components_df:
pandas.DataFrame
The dataframe that specifies the seasonality Fourier configuration.
- autoreg_dict: dict
The dictionary that specifies the autoregression configuration.
- lagged_regressor_dict: dict
The dictionary that specifies the lagged regressors configuration.
- lagged_regressor_cols: list [str]
List of regressor column names used for lagged regressor
- normalize_method: str
The normalization method. See the function input parameter
normalize_method
.- daily_event_df_dict: dict
The dictionary that specifies daily events configuration.
- changepoints_dict: dict
The dictionary that specifies changepoints configuration.
- changepoint_values: list [float]
The list of changepoints in continuous time values.
- normalized_changepoint_valueslist [float]
The list of changepoints in continuous time values normalized to 0 to 1.
- continuous_time_col: str
The continuous time column name in
features_df
.- growth_func: func
The growth function used in changepoints, None is linear function.
- fs_func: func
The function used to generate Fourier series for seasonality.
- has_autoreg_structure: bool
Whether the model has autoregression structure.
- autoreg_func: func
The function to generate autoregression columns.
- min_lag_order: int
Minimal lag order in autoregression.
- max_lag_order: int
Maximal lag order in autoregression.
- has_lagged_regressor_structure: bool
Whether the model has lagged regressor structure.
- lagged_regressor_func: func
The function to generate lagged regressor columns.
- min_lagged_regressor_order: int
Minimal lag order in lagged regressors.
- max_lagged_regressor_order: int
Maximal lag order in lagged regressors.
- uncertainty_dict: dict
The dictionary that specifies uncertainty model configuration.
- pred_cols: list [str]
List of predictor names.
- last_date_for_fit: str or
pandas.Timestamp
The last timestamp used for fitting.
- trend_changepoint_dates: list [
pandas.Timestamp
] List of trend changepoints.
- changepoint_detector: class
The ChangepointDetector class used to detected trend changepoints.
- seasonality_changepoint_dates: list [
pandas.Timestamp
] List of seasonality changepoints.
- seasonality_changepoint_result: dict
The seasonality changepoint detection results.
- fit_algorithm: str
The algorithm used to fit the model.
- fit_algorithm_params: dict
The dictionary of parameters for
fit_algorithm
.- adjust_anomalous_info: dict
A dictionary that has anomaly adjustment results.
- impute_info: dict
A dictionary that has the imputation results.
- forecast_horizon: int
The forecast horizon in steps.
- forecast_horizon_in_days: float
The forecast horizon in days.
- forecast_horizon_in_timedelta: datetime.timmdelta
The forecast horizon in timedelta.
- simulation_based: bool
Whether to use simulation in prediction with autoregression terms.
- simulation_numint, default 10
The number of simulations for when simulations are used for generating forecasts and prediction intervals.
- train_df
pandas.DataFrame
The past dataframe used to generate AR terms. It includes the concatenation of
past_df
anddf
ifpast_df
is provided, otherwise it is thedf
itself.- drop_intercept_colstr or None
The intercept column, explicit or implicit, to be dropped.
- df_dropna:
- Return type
dict
- partition_fut_df(fut_df, trained_model, freq, na_fill_func=<function SilverkiteForecast.<lambda>>)
This function takes a dataframe
fut_df
which includes the timestamps to forecast and atrained_model
returned by forecast and decomposesfut_df
to various dataframes which reflect if the timestamps are before, during or after the training periods. It also determines if: ‘the future timestamps after the training period’ are immediately after ‘the last training period’ or if there is some extra gap. In that case, this function creates an expanded dataframe which includes the missing timestamps as well. Iffut_df
also includes extra columns (they could be regressor columns), this function will interpolate the extra regressor columns.- Parameters
fut_df (
pandas.DataFrame
) – The data frame which includes the timestamps for prediction and possibly regressors. Note that the timestamp column infut_df
must be the same astrained_model["time_col"]
. We assumefut_df[time_col]
is pandas.datetime64 type.trained_model (dict) – A fitted silverkite model which is the output of forecast
freq (str) – Timeseries frequency, DateOffset alias. See https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases for the allowed frequencies.
na_fill_func (callable (
pd.DataFrame
->pd.DataFrame
)) –default:
lambda df: df.interpolate().bfill().ffill()
A function which interpolated missing values in a dataframe. The main usage is invoked when there is a gap between the timestamps. In that case to fill in the gaps, the regressors need to be interpolated/filled. The default works by first interpolating the continuous variables. Then it uses back-filling and then forward-filling for categorical variables.
- Returns
result – A dictionary with following items:
"fut_freq_in_secs"
: floatThe inferred frequency in
fut_df
"training_freq_in_secs"
: floatThe inferred frequency in training data
"index_before_training"
: list [bool]A boolean list to determine which rows of
fut_df
include a time which is before the training start.
"index_within_training"
: list [bool]A boolean list to determine which rows of
fut_df
include a time which is during the training period.
"index_after_training"
: list [bool]A boolean list to determine which rows of
fut_df
include a time which is after the training end date.
"fut_df_before_training"
:pandas.DataFrame
A partition of
fut_df
with timestamps before the training start date
"fut_df_within_training"
:pandas.DataFrame
A partition of
fut_df
with timestamps during the training period
"fut_df_after_training"
:pandas.DataFrame
A partition of
fut_df
with timestamps after the training start date
"fut_df_gap"
:pandas.DataFrame
or NoneIf there is a gap between training end date and the first timestamp after the training end date in
fut_df
, this dataframe can fill the gap between the two. In casefut_df
includes extra columns as well, the values for those columns will be filled usingna_fill_func
.
"fut_df_after_training_expanded"
:pandas.DataFrame
If there is a gap between training end date and the first timestamp after the training end date in
fut_df
, this dataframe will include the data for the gaps (fut_df_gap
) as well asfut_df_after_training
.
"index_after_training_original"
: list [bool]A boolean list to determine which rows of
fut_df_after_training_expanded
correspond to raw data passed by user which are after training end date, appearing infut_df
. Note that this partition corresponds tofut_df_after_training
which is the subset of data infut_df
provided by user and also returned by this function.
"missing_periods_num"
: intNumber of missing timestamps between the last date of training and first date in
fut_df
appearing after the training end date
"inferred_forecast_horizon"
: intThis is the inferred forecast horizon from
fut_df
. This is defined to be the distance between the last training end date and last date appearing infut_df
. Note that this value can be smaller or larger than the number of rows offut_df
. This is calculated by adding the number of potentially missing timestamps and the number of time periods appearing after the training end point. Also note if there are no timestamps after the training end point infut_df
, this value will be zero.
"forecast_partition_summary"
: dictA dictionary which includes the size of various partitions of
fut_df
as well as the missing timestamps if needed. The dictionary keys are as follows:"len_before_training"
: the number of time periods before training start"len_within_training"
: the number of time periods within training"len_after_training"
: the number of time periods after training"len_gap"
: the number of missing time periods between training data and future time stamps infut_df
- Return type
dict
- predict(fut_df, trained_model, freq=None, past_df=None, new_external_regressor_df=None, include_err=None, force_no_sim=False, simulation_num=None, fast_simulation=None, na_fill_func=<function SilverkiteForecast.<lambda>>)
Performs predictions using silverkite model. It determines if the prediction should be simulation-based or not and then predicts using that setting. The function determines if it should use simulation-based predictions or that is not necessary. Here is the logic for determining if simulations are needed:
If the model is not autoregressive, then clearly no simulations are needed
If the model is autoregressive, however the minimum lag appearing in the model is larger than the forecast horizon, then simulations are not needed. This is because the lags can be calculated fully without predicting the future.
User can overwrite the above behavior and force no simulations using
force_no_sim
argument, in which case some lags will be imputed. This option should not be used by most users. Some scenarios where advanced user might want to use this is (a) whenmin_lag_order >= forecast_horizon
does not hold strictly but close to hold. (b) user want to predict fast, the autoregression lags are normalized. In that case the predictions returned could correspond to an approximation of a model without autoregression.- Parameters
fut_df (
pandas.DataFrame
) – The data frame which includes the timestamps for prediction and possibly regressors.trained_model (dict) – A fitted silverkite model which is the output of
self.forecast
freq (str, optional, default None) – Timeseries frequency, DateOffset alias. See https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases for the allowed strings. If None, it is extracted from
trained_model
input.past_df (
pandas.DataFrame
or None, default None) – A data frame with past values if autoregressive methods are called via autoreg_dict parameter ofgreykite.algo.forecast.silverkite.SilverkiteForecast.py
. Note that thispast_df
can be anytime before the training end timestamp, but can not exceed it.new_external_regressor_df (
pandas.DataFrame
or None, default None) – Contains the regressors not already included infut_df
.include_err (bool, optional, default None) – Boolean to determine if errors are to be incorporated in the simulations. If None, it will be set to True if uncertainty is passed to the model and otherwise will be set to False
force_no_sim (bool, default False) – If True, prediction with no simulations is forced. This can be useful when speed is of concern or for validation purposes. In this case, the potential non-available lags will be imputed. Most users should not set this to True as the consequences could be hard to quantify.
simulation_num (int or None, default None) – The number of simulations for when simulations are used for generating forecasts and prediction intervals. If None, it will be inferred from the model (
trained_model
).fast_simulation (bool or None, default None) – Deterimes if fast simulations are to be used. This only impacts models which include auto-regression. This method will only generate one simulation without any error being added and then add the error using the volatility model. The advantage is a major boost in speed during inference and the disadvantage is potentially less accurate prediction intervals. If None, it will be inferred from the model (
trained_model
).na_fill_func (callable (
pd.DataFrame
->pd.DataFrame
)) –default:
lambda df: df.interpolate().bfill().ffill()
A function which interpolates missing values in a dataframe. The main usage is invoked when there is a gap between the timestamps in
fut_df
. The main use case is when the user wants to predict a period which is not an immediate period after training. In that case to fill in the gaps, the regressors need to be interpolated/filled. The default works by first interpolating the continuous variables. Then it uses back-filling and then forward-filling for categorical variables.
- Returns
result – A dictionary with following items
- ”fut_df”:
pandas.DataFrame
The same as input dataframe with an added column for the response. If value_col already appears in
fut_df
, it will be over-written. Ifuncertainty_dict
is provided as input, it will also contain aQUANTILE_SUMMARY_COL
column. Here are the expected columns:A time column with the column name being
trained_model["time_col"]
The predicted response in
value_col
column.Quantile summary response in
QUANTILE_SUMMARY_COL
column. This column only appears if the model includes uncertainty.Error std in ERR_STD_COL column. This column only appears if the model includes uncertainty.
- ”fut_df”:
- ”x_mat”:
pandas.DataFrame
Design matrix of the predictive machine-learning model
- ”x_mat”:
- Return type
dict
- predict_n(fut_time_num, trained_model, freq=None, past_df=None, new_external_regressor_df=None, include_err=None, force_no_sim=False, simulation_num=None, fast_simulation=None, na_fill_func=<function SilverkiteForecast.<lambda>>)
This is the forecast function which can be used to forecast a number of periods into the future. It determines if the prediction should be simulation-based or not and then predicts using that setting. Currently if the silverkite model uses autoregression simulation-based prediction/CIs are used.
- Parameters
fut_time_num (int) – number of needed future values
trained_model (dict) – A fitted silverkite model which is the output of
self.forecast
freq (str, optional, default None) – Timeseries frequency, DateOffset alias. See https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases for the allowed frequencies. If None, it is extracted from
trained_model
input.new_external_regressor_df (
pandas.DataFrame
or None) – Contains the extra regressors if specified.simulation_num (int, optional, default 10) – The number of simulated series to be used in prediction.
fast_simulation (bool or None, default None) – Deterimes if fast simulations are to be used. This only impacts models which include auto-regression. This method will only generate one simulation without any error being added and then add the error using the volatility model. The advantage is a major boost in speed during inference and the disadvantage is potentially less accurate prediction intervals. If None, it will be inferred from the model (
trained_model
).include_err (bool or None, default None) – Boolean to determine if errors are to be incorporated in the simulations. If None, it will be set to True if uncertainty is passed to the model and otherwise will be set to False
force_no_sim (bool, default False) – If True, prediction with no simulations is forced. This can be useful when speed is of concern or for validation purposes.
na_fill_func (callable (
pd.DataFrame
->pd.DataFrame
)) –default:
lambda df: df.interpolate().bfill().ffill()
A function which interpolated missing values in a dataframe. The main usage is invoked when there is a gap between the timestamps. In that case to fill in the gaps, the regressors need to be interpolated/filled. The default works by first interpolating the continuous variables. Then it uses back-filling and then forward-filling for categorical variables.
- Returns
result – A dictionary with following items
- ”fut_df”:
pandas.DataFrame
The same as input dataframe with an added column for the response. If value_col already appears in
fut_df
, it will be over-written. Ifuncertainty_dict
is provided as input, it will also contain aQUANTILE_SUMMARY_COL
column. Here are the expected columns:A time column with the column name being
trained_model["time_col"]
The predicted response in
value_col
column.Quantile summary response in
QUANTILE_SUMMARY_COL
column. This column only appears if the model includes uncertainty.Error std in ERR_STD_COL column. This column only appears if the model includes uncertainty.
- ”fut_df”:
- ”x_mat”:
pandas.DataFrame
Design matrix of the predictive machine-learning model
- ”x_mat”:
- Return type
dict
- predict_n_no_sim(fut_time_num, trained_model, freq, new_external_regressor_df=None, time_features_ready=False, regressors_ready=False)
This is the forecast function which can be used to forecast. It accepts extra regressors (
extra_pred_cols
) originally indf
vianew_external_regressor_df
.- Parameters
fut_time_num (int) – number of needed future values
trained_model (dict) – A fitted silverkite model which is the output of
self.forecast
freq (str) – Frequency of future predictions. Accepts any valid frequency for
pd.date_range
.new_external_regressor_df (
pandas.DataFrame
or None) – Contains the extra regressors if specified.time_features_ready (bool) – Boolean to denote if time features are already given in df or not.
regressors_ready (bool) – Boolean to denote if regressors are already added to data (
fut_df
).
- Returns
result – A dictionary with following items
- ”fut_df”:
pandas.DataFrame
The same as input dataframe with an added column for the response. If value_col already appears in
fut_df
, it will be over-written. Ifuncertainty_dict
is provided as input, it will also contain aQUANTILE_SUMMARY_COL
column.
- ”fut_df”:
- ”x_mat”:
pandas.DataFrame
Design matrix of the predictive machine-learning model
- ”x_mat”:
- Return type
dict
- predict_n_via_sim(fut_time_num, trained_model, freq, new_external_regressor_df=None, simulation_num=10, fast_simulation=False, include_err=None)
This is the forecast function which can be used to forecast. This function’s predictions are constructed using simulations from the fitted series. This supports both
predict_silverkite_via_sim
and``predict_silverkite_via_sim_fast
depending on value of the passed argumentfast_simulation
.The
past_df
is set to be the training data which is available intrained_model
. It accepts extra regressors (extra_pred_cols
) originally indf
vianew_external_regressor_df
.- Parameters
fut_time_num (int) – number of needed future values
trained_model (dict) – A fitted silverkite model which is the output of
self.forecast
freq (str) – Frequency of future predictions. Accepts any valid frequency for
pd.date_range
.new_external_regressor_df (
pandas.DataFrame
or None) – Contains the extra regressors if specified.simulation_num (int, optional, default 10) – The number of simulated series to be used in prediction.
fast_simulation (bool, default False) – Deterimes if fast simulations are to be used. This only impacts models which include auto-regression. This method will only generate one simulation without any error being added and then add the error using the volatility model. The advantage is a major boost in speed during inference and the disadvantage is potentially less accurate prediction intervals.
include_err (bool, optional, default None) – Boolean to determine if errors are to be incorporated in the simulations. If None, it will be set to True if uncertainty is passed to the model and otherwise will be set to False
- Returns
result – A dictionary with following items
- ”fut_df”:
pandas.DataFrame
The same as input dataframe with an added column for the response. If value_col already appears in
fut_df
, it will be over-written. Ifuncertainty_dict
is provided as input, it will also contain aQUANTILE_SUMMARY_COL
column. Here are the expected columns:A time column with the column name being
trained_model["time_col"]
The predicted response in
value_col
column.Quantile summary response in
QUANTILE_SUMMARY_COL
column. This column only appears if the model includes uncertainty.Error std in ERR_STD_COL column. This column only appears if the model includes uncertainty.
- ”fut_df”:
- ”x_mat”:
pandas.DataFrame
Design matrix of the predictive machine-learning model
- ”x_mat”:
- Return type
dict
- predict_no_sim(fut_df, trained_model, past_df=None, new_external_regressor_df=None, time_features_ready=False, regressors_ready=False)
Performs predictions for the dates in
fut_df
. Ifextra_pred_cols
refers to a column indf
, eitherfut_df
ornew_external_regressor_df
must contain the regressors and the columns needed for lagged regressors.- Parameters
fut_df (
pandas.DataFrame
) – The data frame which includes the timestamps. for prediction and any regressors.trained_model (dict) – A fitted silverkite model which is the output of
self.forecast
.past_df (
pandas.DataFrame
, optional) – A data frame with past values if autoregressive methods are called via autoreg_dict parameter ofgreykite.algo.forecast.silverkite.SilverkiteForecast.py
.new_external_regressor_df (
pandas.DataFrame
, optional) – Contains the regressors not already included in fut_df.time_features_ready (bool) – Boolean to denote if time features are already given in df or not.
regressors_ready (bool) – Boolean to denote if regressors are already added to data (
fut_df
).
- Returns
result – A dictionary with following items
- ”fut_df”:
pandas.DataFrame
The same as input dataframe with an added column for the response. If value_col already appears in
fut_df
, it will be over-written. Ifuncertainty_dict
is provided as input, it will also contain aQUANTILE_SUMMARY_COL
column.
- ”fut_df”:
- ”x_mat”:
pandas.DataFrame
Design matrix of the predictive machine-learning model
- ”x_mat”:
- ”features_df”:
pandas.DataFrame
The features dataframe used for prediction.
- ”features_df”:
- Return type
dict
- predict_via_sim(fut_df, trained_model, past_df=None, new_external_regressor_df=None, simulation_num=10, include_err=None)
Performs predictions and calculate uncertainty using multiple simulations.
- Parameters
fut_df (
pandas.DataFrame
) – The data frame which includes the timestamps for prediction and possibly regressors.trained_model (dict) – A fitted silverkite model which is the output of
self.forecast
past_df (
pandas.DataFrame
, optional) – A data frame with past values if autoregressive methods are called via autoreg_dict parameter ofgreykite.algo.forecast.silverkite.SilverkiteForecast.py
new_external_regressor_df (
pandas.DataFrame
, optional) – Contains the regressors not already included infut_df
.simulation_num (int, optional, default 10) – The number of simulated series to be used in prediction.
include_err (bool, optional, default None) – Boolean to determine if errors are to be incorporated in the simulations. If None, it will be set to True if uncertainty is passed to the model and otherwise will be set to False
- Returns
result – A dictionary with following items
- ”fut_df”:
pandas.DataFrame
The same as input dataframe with an added column for the response. If value_col already appears in
fut_df
, it will be over-written. Ifuncertainty_dict
is provided as input, it will also contain aQUANTILE_SUMMARY_COL
column. Here are the expected columns:A time column with the column name being
trained_model["time_col"]
The predicted response in
value_col
column.Quantile summary response in
QUANTILE_SUMMARY_COL
column. This column only appears if the model includes uncertainty.Error std in ERR_STD_COL column. This column only appears if the model includes uncertainty.
- ”fut_df”:
- ”x_mat”:
pandas.DataFrame
Design matrix of the predictive machine-learning model
- ”x_mat”:
- Return type
dict
- predict_via_sim_fast(fut_df, trained_model, past_df=None, new_external_regressor_df=None)
Performs predictions and calculates uncertainty using one simulation of future and calculate the error separately (not relying on multiple simulations). Due to this the prediction intervals well into future will be narrower than
predict_via_sim
and therefore less accurate. However there will be a major speed gain which might be important in some use cases.- Parameters
fut_df (
pandas.DataFrame
) – The data frame which includes the timestamps for prediction and possibly regressors.trained_model (dict) – A fitted silverkite model which is the output of
self.forecast
.past_df (
pandas.DataFrame
or None, default None) – A data frame with past values if autoregressive methods are called viaautoreg_dict
parameter ofgreykite.algo.forecast.silverkite.SilverkiteForecast.py
new_external_regressor_df (
pandas.DataFrame
or None, default None) – Contains the regressors not already included infut_df
.
- Returns
result – A dictionary with following items
- ”fut_df”:
pandas.DataFrame
The same as input dataframe with an added column for the response. If value_col already appears in
fut_df
, it will be over-written. Ifuncertainty_dict
is provided as input, it will also contain aQUANTILE_SUMMARY_COL
column. Here are the expected columns:A time column with the column name being
trained_model["time_col"]
The predicted response in
value_col
column.Quantile summary response in
QUANTILE_SUMMARY_COL
column. This column only appears if the model includes uncertainty.Error std in ERR_STD_COL column. This column only appears if the model includes uncertainty.
- ”fut_df”:
- ”x_mat”:
pandas.DataFrame
Design matrix of the predictive machine-learning model
- ”x_mat”:
- ”features_df”:
pandas.DataFrame
The features dataframe used for prediction.
- ”features_df”:
- Return type
dict
- simulate(fut_df, trained_model, past_df=None, new_external_regressor_df=None, include_err=True, time_features_ready=False, regressors_ready=False)
A function to simulate future series. If the fitted model supports uncertainty e.g. via
uncertainty_dict
, errors are incorporated into the simulations.- Parameters
fut_df (
pandas.DataFrame
) – The data frame which includes the timestamps for prediction and any regressors.trained_model (dict) – A fitted silverkite model which is the output of
self.forecast
.past_df (
pandas.DataFrame
, optional) – A data frame with past values if autoregressive methods are called viaautoreg_dict
parameter ofgreykite.algo.forecast.silverkite.SilverkiteForecast.py
new_external_regressor_df (
pandas.DataFrame
, optional) – Contains the regressors not already included infut_df
.include_err (bool) – Boolean to determine if errors are to be incorporated in the simulations.
time_features_ready (bool) – Boolean to denote if time features are already given in df or not.
regressors_ready (bool) – Boolean to denote if regressors are already added to data (
fut_df
).
- Returns
result – A dictionary with following items
- ”fut_df”:
pandas.DataFrame
The same as input dataframe with an added column for the response. If value_col already appears in
fut_df
, it will be over-written. Ifuncertainty_dict
is provided as input, it will also contain aQUANTILE_SUMMARY_COL
column. Here are the expected columns:A time column with the column name being
trained_model["time_col"]
The predicted response in
value_col
column.Quantile summary response in
QUANTILE_SUMMARY_COL
column. This column only appears if the model includes uncertainty.Error std in ERR_STD_COL column. This column only appears if the model includes uncertainty.
- ”fut_df”:
- ”x_mat”:
pandas.DataFrame
Design matrix of the predictive machine-learning model
- ”x_mat”:
- ”features_df”:
pandas.DataFrame
The features dataframe used for prediction.
- ”features_df”:
- Return type
dict
- simulate_multi(fut_df, trained_model, simulation_num=10, past_df=None, new_external_regressor_df=None, include_err=None)
A function to simulate future series. If the fitted model supports uncertainty e.g. via
uncertainty_dict
, errors are incorporated into the simulations.- Parameters
fut_df (
pandas.DataFrame
) – The data frame which includes the timestamps for prediction and any regressors.trained_model (dict) – A fitted silverkite model which is the output of
self.forecast
.simulation_num (int) – The number of simulated series, (each of which have the same number of rows as
fut_df
) to be stacked up row-wise. This number must be larger than zero.past_df (
pandas.DataFrame
, optional) – A data frame with past values if autoregressive methods are called viaautoreg_dict
parameter ofgreykite.algo.forecast.silverkite.SilverkiteForecast.py
.new_external_regressor_df (
pandas.DataFrame
, optional) – Contains the regressors not already included infut_df
.include_err (bool, optional, default None) – Boolean to determine if errors are to be incorporated in the simulations. If None, it will be set to True if uncertainty is passed to the model and otherwise will be set to False.
- Returns
result – A dictionary with follwing items
- ”fut_df_sim”
pandas.DataFrame
Row-wise concatenation of dataframes each being the same as input dataframe (
fut_df
) with an added column for the response and a new column: “sim_label” to differentiate various simulations. The row number of the returned dataframe is:simulation_num
times the row number offut_df
.If
value_col
already appears infut_df
, it will be over-written.
- ”fut_df_sim”
- ”x_mat”:
pandas.DataFrame
simulation_num
copies of design matrix of the predictive machine-learning model concatenated. An extra index column (“original_row_index”) is also added for aggregation when needed. Note that the all copies will be the same except for the case where auto-regression is utilized.
- ”x_mat”:
- Return type
dict
- greykite.algo.uncertainty.conditional.conf_interval.conf_interval(df, distribution_col, offset_col=None, sigma_scaler=None, h_mat=None, x_mean=None, conditional_cols=None, quantiles=(0.005, 0.025, 0.975, 0.995), quantile_estimation_method='normal_fit', sample_size_thresh=5, small_sample_size_method='std_quantiles', small_sample_size_quantile=0.95, min_admissible_value=None, max_admissible_value=None)[source]
A function to calculate confidence intervals (CI) for values given in
distribution_col
. We allow for calculating as many quantiles as needed (specified byquantiles
) as opposed to only two quantiles representing a typical CI.Two methods are available for quantiles calculation for each slice of data (given in
conditional_cols
).“normal_fit” : CI is calculated using quantiles of a normal
distribution fit.
“ecdf” : CI is calculated using quantiles of empirical cumulative
distribution function.
offset_col
is used in the prediction phase to shift the calculated quantiles appropriately.- Parameters
df (pandas.Dataframe) –
The dataframe with the following columns:
distribution_col,
conditional_cols (optional),
offset_col (optional column)
distribution_col (str) – The column containing the values for the variable for which confidence interval is needed.
offset_col (str or None, default None) – The column containing the values by which the computed quantiles for
distribution_col
are shifted. Only used during prediction phase. If None, quantiles are not shifted.sigma_scaler (float or None, default None) – Scaling factor that is applied to the estimated standard deviation
sigma
in regression setting. Used to take into account the degrees of freedom in the fitted model, otherwise sigma is under-estimated by just using the distribution of the residuals. The formula issigma_scaler = np.sqrt((n_train - 1) / (n_train - p_effective))
. Only useful in linear and ridge regression models. If None, no scaling will be done.h_mat (
np.ndarray
or None, default None) – The H matrixnp.linalg.pinv(X.T @ X + alpha * np.eye(p)) @ X.T
in regression setting. Dimension isp
(number of parameters) byn_train
, andalpha
is the regularization term extracted fromml_model
. Seefit_ml_model
for details.x_mean (
np.ndarray
or None, default None) – Column mean ofx_mat
as a row vector. This is stored and used in ridge regression to compute the prediction intervals. In other methods, it is set to None.conditional_cols (list [str] or None, default None) – These columns are used to slice the data first then calculate quantiles for each slice.
quantiles (list [float], default (0.005, 0.025, 0.975, 0.995)) – The quantiles calculated for each slice. These quantiles can be then used to construct the desired CIs. The default values [0.005, 0.025, 0.0975, 0.995] can be used to construct 99 and 95 percent CIs.
quantile_estimation_method (str, default “normal_fit”) –
There are two options implemented for the quantile estimation method (conditional on slice):
”normal_fit”: Uses the standard deviation of the values in each
slice to compute normal distribution quantiles. - “ecdf”: Uses the empirical cumulative distribution function to calculate sample quantiles.
sample_size_thresh (int, default 5) – The minimum sample size for each slice where we allow for using the conditional distribution (conditioned on the “conditional_cols” argument). If sample size for that slice is smaller than this, we use the fallback method.
small_sample_size_method (str, default “std_quantiles”) –
The method to use for slices with small sample size
”std_quantile” method is implemented and it looks at the response std for each slice with sample size >= “sample_size_thresh” and takes the row which has its std being closest to “small_sample_size_quantile” quantile. It assigns that row to act as fall-back for calculating conf intervals.
small_sample_size_quantile (float, default 0.95) – Quantile to calculate for small sample size.
min_admissible_value (Union[float, double, int], default None) – This is the lowest admissible value for the obtained ci limits and any value below this will be mapped back to this value.
max_admissible_value (Union[float, double, int], default None) – This is the highest admissible value for the obtained ci limits and any higher value will be mapped back to this value.
- Returns
uncertainty_model –
Dictionary with following items (main component is the
predict
function).- ”ecdf_df”
pandas.DataFrame
ecdf_df generated by “estimate_empirical_distribution”
- ”ecdf_df”
- ”ecdf_df_overall”
pandas.DataFrame
ecdf_df_overall generated by “estimate_empirical_distribution”
- ”ecdf_df_overall”
- ”ecdf_df_fallback”
pandas.DataFrame
ecdf_df_fallback, a fall back data to get the CI quantiles when the sample size for that slice is small or that slice is unobserved in that case.
if small_sample_size_method = “std_quantiles”, we use std quantiles to pick a slice which has a std close to that quantile and fall-back to that slice.
otherwise we fallback to “ecdf_overall”
- ”ecdf_df_fallback”
- ”distribution_col”str
Input
distribution_col
- ”offset_col”: str
Input
offset_col
- ”quantiles”list [float]
Input
quantiles
- ”min_admissible_value”: float
Input
min_admissible_value
- ”max_admissible_value”: float
Input
max_admissible_value
- ”conditional_cols”: list [str]
Input
conditional_cols
- ”std_col”: str
The column name with standard deviations.
- ”quantile_summary_col”: str
The column name with computed quantiles.
- ”fall_back_for_all”: bool
Indicates if fallback method should be used for the whole dataset.
- Return type
dict
- greykite.algo.changepoint.adalasso.changepoints_utils.combine_detected_and_custom_trend_changepoints(detected_changepoint_dates, custom_changepoint_dates, min_distance=None, keep_detected=False)[source]
Adds custom trend changepoints to detected trend changepoints.
Compares the distance between custom changepoints and detected changepoints, and drops a detected changepoint or a custom changepoint depending on
keep_detected
if their distance is less thanmin_distance
.- Parameters
detected_changepoint_dates (list) – A list of detected trend changepoints, parsable by
pandas.to_datetime
.custom_changepoint_dates (list) – A list of additional custom trend changepoints, parsable by
pandas.to_datetime
.min_distance (DateOffset, Timedelta, str or None, default None) – The minimum distance between detected changepoints and custom changepoints. If a detected changepoint and a custom changepoint have distance less than
min_distance
, either the detected changepoint or the custom changepoint will be dropped according tokeep_detected
. Does not compare the distance within detected changepoints or custom changepoints. Note: maximal unit is ‘D’, i.e., you may only use units no more than ‘D’ such as ‘10D’, ‘5H’, ‘100T’, ‘200S’. The reason is that ‘W’, ‘M’ or higher has either cycles or indefinite number of days, thus is not parsable by pandas as timedelta. For example, seepandas.tseries.frequencies.to_offset
.keep_detected (bool, default False) – When the distance of a detected changepoint and a custom changepoint is less than
min_distance
, whether to keep the detected changepoint or the custom changepoint.
- Returns
combined_changepoint_dates – A list of combined changepoints in ascending order.
- Return type
list
- greykite.common.features.timeseries_lags.build_autoreg_df(value_col, lag_dict=None, agg_lag_dict=None, series_na_fill_func=<function <lambda>>)[source]
- This function generates a function (“build_lags_func” in the returned dict)
which when called builds a lag data frame and an aggregated lag data frame using “build_lag_df” and “build_agg_lag_df” functions. Note: In case of training, validation and testing (e.g. cross-validation) for forecasting, this function needs to be applied after the data split is done. This is especially important if “series_na_fill_func” is using future values in interpolation - that is the case for the default which is lambda s: s.bfill().ffill()
- Parameters
value_col – str the column name for the values of interest
lag_dict –
- Optional[dict]
A dictionary which encapsulates the needed params to be passed to the function “build_lag_df” Expected items are:
- ”max_order”: Optional[int]
the max_order for creating lags
- ”orders”: Optional[List[int]]
the orders for which lag is needed
- param agg_lag_dict
Optional[dict] A dictionary encapsulating the needed params to be passed to the function “build_agg_lag_df” Expected items are:
- ”orders_list”: List[List[int]]
A list of list of integers. Each int list is to be used as order of lags to be aggregated See build_lag_df arguments for more details
- ”interval_list”: List[tuple]
A list of tuples each of length 2. Each tuple is used to construct an aggregated lag using all orders within that range See build_agg_lag_df arguments for more details
- ”agg_func”: “mean” or func (pd.Dataframe -> pd.Dataframe)
The function used for aggregation in “build_agg_lag_df” If this key is not passed, the default of “build_agg_lag_df” will be used. If “mean”, uses
pandas.DataFrame.mean
.
- param series_na_fill_func
(pd.Series -> pd.Series) default: lambda s: s.bfill.ffill() This function is used to fill in the missing data The default works by first back-filling and then forward-filling This function should not be applied to data before CV split is done.
- return
dict a dictionary with following items
- ”build_lags_func”: func
pd.Daframe -> dict(lag_df=pd.DataFrame, agg_lag_df=pd.DataFrame) A function which takes a df (need to have value_col) as input calculates the lag_df and agg_lag_df and returns them
- ”lag_col_names”: Optional[List[str]]
The list of generated column names for the returned lag_df when “build_lags_func” is applied
- ”agg_lag_col_names”: Optional[List[str]]
The list of generated column names for returned agg_lag_df when “build_lags_func” is applied
- ”max_order”: int
the maximum lag order needed in the calculation of “build_lags_func”
- ”min_order”: int
the minimum lag order needed in the calculation of “build_lags_func”
- greykite.common.features.timeseries_lags.build_agg_lag_df(value_col, df=None, orders_list=[], interval_list=[], agg_func='mean', agg_name='avglag', max_order=None)[source]
A function which returns a dataframe including aggregated (e.g. averaged) time series lags in the form of dataframe columns. By “aggregated lags”, we mean an aggregate of several lags using an aggregation function given in “agg_func”. The advantage of “aggregated lags” over regular lags is we can aggregate (e.g. average) many lags in the past instead of using a large number of lags. This is useful in many applications and avoids over-fitting.
For a time series mathematically denoted by Y(t), one could consider the average lag processes as follows:
- the average of last 3 values:
“avg(t) = (Y(t-1) + Y(t-2) + Y(t-3)) / 3”
- the average of 7th, 14th and 21st lags:
“avg(t) = (Y(t-7) + Y(t-14) + Y(t-21)) / 3”
- See following references:
Reza Hosseini et al. (2014) Non-linear time-varying stochastic models for agroclimate risk assessment, Environmental and Ecological Statistics https://link.springer.com/article/10.1007/s10651-014-0295-2
Alireza Hosseini et al. (2017) Capturing the time-dependence in the precipitation process for weather risk assessment, Stochastic Environmental Research and Risk Assessment https://link.springer.com/article/10.1007/s00477-016-1285-8
- Parameters
value_col – str the column name for the values of interest
df – Optional[pd.DataFrame] the data frame which includes the time series of interest
orders_list –
List[int] a list including the order range for the average lags. For example if agg_func = np.mean and orders_list = [[1, 2, 3], [7, 14, 21]] then we construct two averaged lags:
avg(t) = (Y(t-1) + Y(t-2) + Y(t-3)) / 3 and avg(t) = (Y(t-7) + Y(t-14) + Y(t-21)) / 3
interval_list –
List[tuple[int]] a list of (lag) intervals where interval is a tuple of length 2 with
first element denoting the lower bound and
second is the upper
For example if interval_list = [(1, 3), (8, 11)] then we construct two “average lagged” variables:
avg(t) = (Y(t-1) + Y(t-2) + Y(t-3)) / 3 and avg(t) = (Y(t-8) + Y(t-9) + Y(t-10) + Y(t-11)) / 4
agg_func – “mean” or callable, default: “mean” the function used to aggregate the lag orders for each of orders specified in either of order_list or interval_list. Typically this function is an averaging function such as np.mean or np.median but more sophisticated functions are allowed. If “mean”, uses
pandas.DataFrame.mean
.agg_name –
str, default: “avglag” the aggregate function name used in constructing the column names for the output data frame. For example if
value_col = “y”
orders = [7 , 14, 21]
agg_name = “avglag”
then the column name appearing in the output data frame will be “y_avglag_7_14_21”.
max_order –
Optional[int] maximum order of lags needed in calculations of lag aggregates this is usually calculated/inferred from these arguments:
orders_list, interval_list
unless the max_order is already pre-calculated before calling this function. Hence this argument is optional and only included for computational efficiency gains.
- Returns
dict dictionary with following items:
- ”col_names”: List[str]
the generated column names
- ”agg_lag_df”: Optional[pd.DataFrame]
a data frame with the average lag columns. The column names are constructed in a way that reflects what lags are averaged. For example if
value_col = “y”
agg_name = “avglag”
orders_list = [[1, 2, 3], [7, 14, 21]]
Then the column names are “y_avglag_1_2_3”, “y_avglag_7_14_21” and if
interval_list = [(1, 3), (8, 11)]
Then the column names are “y_avglag_1_to_3”, “y_avglag_8_to_11”
- greykite.common.features.timeseries_lags.build_autoreg_df_multi(value_lag_info_dict, series_na_fill_func=<function <lambda>>)[source]
A function which returns a function to build autoregression dataframe for multiple value columns. This function should not be applied to data before CV split is done.
- Parameters
value_lag_info_dict (dict [str, dict]) –
A dictionary with keys being the target value columns: value_col For each of these value columns, a dictionary with following keys
lag_dict, agg_lag_dict, series_na_fill_func
The value_col and the above three variables are then passed to the following function:
- build_autoreg_df(
value_col, lag_dict, agg_lag_dict, series_na_fill_func)
Check the
greykite.common.features.timeseries_lags.build_autoreg_df
docstring for more details for each argument.series_na_fill_func (callable, (pd.Series -> pd.Series)) – default: lambda s: s.bfill.ffill() This function is used to fill in the missing data The default works by first back-filling and then forward-filling
- Returns
- “autoreg_func”callable, (pd.DataFrame -> pd.DataFrame)
A function which can be applied to a dataframe and return a dataframe which has the lagged values for all the relevant columns
- ”autoreg_col_names”List[str]
A list of all the generated columns
- ”autoreg_orig_col_names”List[str]
A list of all the original target value columns
- ”max_order”int
Maximum lag order for all target value columns
- ”min_order”int
Minimum lag order for all target value columns
- Return type
A dictionary with following items
- class greykite.algo.forecast.silverkite.forecast_silverkite.SilverkiteForecast(constants: ~greykite.algo.forecast.silverkite.constants.silverkite_seasonality.SilverkiteSeasonalityEnumMixin = <greykite.algo.forecast.silverkite.constants.silverkite_constant.SilverkiteConstant object>)[source]
- forecast(df, time_col, value_col, freq=None, origin_for_time_vars=None, extra_pred_cols=None, drop_pred_cols=None, explicit_pred_cols=None, train_test_thresh=None, training_fraction=0.9, fit_algorithm='linear', fit_algorithm_params=None, daily_event_df_dict=None, daily_event_neighbor_impact=None, daily_event_shifted_effect=None, fs_components_df= name period order seas_names 0 tod 24.0 3 daily 1 tow 7.0 3 weekly 2 toy 1.0 5 yearly, autoreg_dict=None, past_df=None, lagged_regressor_dict=None, changepoints_dict=None, seasonality_changepoints_dict=None, changepoint_detector=None, min_admissible_value=None, max_admissible_value=None, uncertainty_dict=None, normalize_method=None, adjust_anomalous_dict=None, impute_dict=None, regression_weight_col=None, forecast_horizon=None, simulation_based=False, simulation_num=10, fast_simulation=False, remove_intercept=False)[source]
A function for forecasting. It captures growth, seasonality, holidays and other patterns. See “Capturing the time-dependence in the precipitation process for weather risk assessment” as a reference: https://link.springer.com/article/10.1007/s00477-016-1285-8
- Parameters
df (
pandas.DataFrame
) – A data frame which includes the timestamp column as well as the value column.time_col (str) – The column name in
df
representing time for the time series data. The time column can be anything that can be parsed by pandas DatetimeIndex.value_col (str) – The column name which has the value of interest to be forecasted.
freq (str, optional, default None) – The intended timeseries frequency, DateOffset alias. See https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases. If None automatically inferred. This frequency will be passed through this function as a part of the trained model and used at predict time if needed. If data include missing timestamps, and frequency is monthly/annual, user should pass this parameter, as it cannot be inferred.
origin_for_time_vars (float, optional, default None) – The time origin used to create continuous variables for time. If None, uses the first record in
df
.extra_pred_cols (list of str, default None) –
Names of the extra predictor columns.
If None, uses [“ct1”], a simple linear growth term.
It can leverage regressors included in
df
and those generated by the other parameters. The following effects will not be modeled unless specified inextra_pred_cols
:included in
df
: e.g. macro-economic factors, related timeseriesfrom
build_time_features_df
: e.g. ct1, ct_sqrt, dow, …from
daily_event_df_dict
: e.g. “events_India”, …
The columns corresponding to the following parameters are included in the model without specification in
extra_pred_cols
.extra_pred_cols
can be used to add interactions with these terms.changepoints_dict: e.g. changepoint0, changepoint1, … fs_components_df: e.g. sin2_dow, cos4_dow_weekly autoreg_dict: e.g. x_lag1, x_avglag_2_3_4, y_avglag_1_to_5
If a regressor is passed in
df
, it needs to be provided to the associated predict function:predict_silverkite
: viafut_df
ornew_external_regressor_df
silverkite.predict_n(_no_sim
: vianew_external_regressor_df
drop_pred_cols (list [str] or None, default None) – Names of predictor columns to be dropped from the final model. Ignored if None
explicit_pred_cols (list [str] or None, default None) – Names of the explicit predictor columns which will be the only variables in the final model. Note that this overwrites the generated predictors in the model and may include new terms not appearing in the predictors (e.g. interaction terms). Ignored if None
train_test_thresh (
datetime.datetime
, optional) – e.g. datetime.datetime(2019, 6, 30) The threshold for training and testing split. Note that the final returned model is trained using all data. If None, training split is based ontraining_fraction
training_fraction (float, optional) – The fraction of data used for training (0.0 to 1.0) Used only if
train_test_thresh
is None. If this is also None or 1.0, then we skip testing and train on the entire dataset.fit_algorithm (str, optional, default “linear”) –
The type of predictive model used in fitting.
See
fit_model_via_design_matrix
for available options and their parameters.fit_algorithm_params (dict or None, optional, default None) – Parameters passed to the requested fit_algorithm. If None, uses the defaults in
fit_model_via_design_matrix
.daily_event_df_dict (dict or None, optional, default None) –
A dictionary of data frames, each representing events data for the corresponding key. The DataFrame has two columns:
The first column contains event dates. Must be in a format recognized by
pandas.to_datetime
. Must be at daily frequency for proper join. It is joined against the time indf
, converted to a day:pd.to_datetime(pd.DatetimeIndex(df[time_col]).date)
.the second column contains the event label for each date
The column order is important; column names are ignored. The event dates must span their occurrences in both the training and future prediction period.
During modeling, each key in the dictionary is mapped to a categorical variable named
f"{EVENT_PREFIX}_{key}"
, whose value at each timestamp is specified by the corresponding DataFrame.For example, to manually specify a yearly event on September 1 during a training/forecast period that spans 2020-2022:
daily_event_df_dict = { "custom_event": pd.DataFrame({ "date": ["2020-09-01", "2021-09-01", "2022-09-01"], "label": ["is_event", "is_event", "is_event"] }) }
It’s possible to specify multiple events in the same df. Two events,
"sep"
and"oct"
are specified below for 2020-2021:daily_event_df_dict = { "custom_event": pd.DataFrame({ "date": ["2020-09-01", "2020-10-01", "2021-09-01", "2021-10-01"], "event_name": ["sep", "oct", "sep", "oct"] }) }
Use multiple keys if two events may fall on the same date. These events must be in separate DataFrames:
daily_event_df_dict = { "fixed_event": pd.DataFrame({ "date": ["2020-09-01", "2021-09-01", "2022-09-01"], "event_name": "fixed_event" }), "moving_event": pd.DataFrame({ "date": ["2020-09-01", "2021-08-28", "2022-09-03"], "event_name": "moving_event" }), }
The multiple event specification can be used even if events never overlap. An equivalent specification to the second example:
daily_event_df_dict = { "sep": pd.DataFrame({ "date": ["2020-09-01", "2021-09-01"], "event_name": "is_event" }), "oct": pd.DataFrame({ "date": ["2020-10-01", "2021-10-01"], "event_name": "is_event" }), }
Note
The events you want to use must be specified in
extra_pred_cols
. These take the form:f"{EVENT_PREFIX}_{key}"
, whereEVENT_PREFIX
is the constant.Do not use
EVENT_DEFAULT
in the second column. This is reserved to indicate dates that do not correspond to an event.daily_event_neighbor_impact (int, list [int], callable or None, default None) –
The impact of neighboring timestamps of the events in
event_df_dict
. This is for daily events so the units below are all in days.For example, if the data is weekly (“W-SUN”) and an event is daily, it may not exactly fall on the weekly date. But you can specify for New Year’s day on 1/1, it affects all dates in the week, e.g. 12/31, 1/1, …, 1/6, then it will be mapped to the weekly date. In this case you may want to map a daily event’s date to a few dates, and can specify
neighbor_impact=lambda x: [x-timedelta(days=x.isocalendar()[2]-1) + timedelta(days=i) for i in range(7)]
.Another example is that the data is rolling 7 day daily data, thus a holiday may affect the t, t+1, …, t+6 dates. You can specify
neighbor_impact=7
.If input is int, the mapping is t, t+1, …, t+neighbor_impact-1. If input is list, the mapping is [t+x for x in neighbor_impact]. If input is a function, it maps each daily event’s date to a list of dates.
daily_event_shifted_effect (list [str] or None, default None) – Additional neighbor events based on given events. For example, passing [“-1D”, “7D”] will add extra daily events which are 1 day before and 7 days after the given events. Offset format is {d}{freq} with any integer plus a frequency string. Must be parsable by pandas
to_offset
. The new events’ names will be the current events’ names with suffix “{offset}_before” or “{offset}_after”. For example, if we have an event named “US_Christmas Day”, a “7D” shift will have name “US_Christmas Day_7D_after”. This is useful when you expect an offset of the current holidays also has impact on the time series, or you want to interact the lagged terms with autoregression. Ifdaily_event_neighbor_impact
is also specified, this will be applied after adding neighboring days.fs_components_df (
pandas.DataFrame
or None, optional) –A dataframe with information about fourier series generation. Must contain columns with following names:
”name”: name of the timeseries feature e.g. “tod”, “tow” etc. “period”: Period of the fourier series, optional, default 1.0 “order”: Order of the fourier series, optional, default 1.0 “seas_names”: season names corresponding to the name (e.g. “daily”, “weekly” etc.), optional.
Default includes daily, weekly , yearly seasonality.
autoreg_dict (dict or str or None, optional, default None) –
If a dict: A dictionary with arguments for
build_autoreg_df
. That function’s parametervalue_col
is inferred from the input of current functionself.forecast
. Other keys are:"lag_dict"
: dict or None"agg_lag_dict"
: dict or None"series_na_fill_func"
: callableIf a str: The string will represent a method and a dictionary will be constructed using that str. Currently only implemented method is “auto” which uses __get_default_autoreg_dict to create a dictionary. See more details for above parameters in
build_autoreg_df
.past_df (
pandas.DataFrame
or None, default None) –The past df used for building autoregression features. This is not necessarily needed since imputation is possible. However, it is recommended to provide
past_df
for more accurate autoregression features and faster training (by skipping imputation). The columns are:- time_col
pandas.Timestamp
or str The timestamps.
- value_colfloat
The past values.
- addition_regressor_colsfloat
Any additional regressors.
Note that this
past_df
is assumed to immediately precededf
without gaps, otherwise an error will be raised.- time_col
lagged_regressor_dict (dict or None, default None) –
A dictionary with arguments for
build_autoreg_df_multi
. The keys of the dictionary are the target lagged regressor column names. It can leverage the regressors included indf
. The value of each key is either a dict or str. If dict, it has the following keys:"lag_dict"
: dict or None"agg_lag_dict"
: dict or None"series_na_fill_func"
: callableIf str, it represents a method and a dictionary will be constructed using that str. Currently the only implemented method is “auto” which uses __get_default_lagged_regressor_dict to create a dictionary for each lagged regressor. An example:
lagged_regressor_dict = { "regressor1": { "lag_dict": {"orders": [1, 2, 3]}, "agg_lag_dict": { "orders_list": [[7, 7 * 2, 7 * 3]], "interval_list": [(8, 7 * 2)]}, "series_na_fill_func": lambda s: s.bfill().ffill()}, "regressor2": "auto"}
Check the docstring of
build_autoreg_df_multi
for more details for each argument.changepoints_dict (dict or None, optional, default None) –
Specifies the changepoint configuration.
- ”method”: str
The method to locate changepoints. Valid options:
”uniform”. Places n_changepoints evenly spaced changepoints to allow growth to change.
”custom”. Places changepoints at the specified dates.
”auto”. Automatically detects change points. For configuration, see
find_trend_changepoints
Additional keys to provide parameters for each particular method are described below.
- ”continuous_time_col”: str, optional
Column to apply
growth_func
to, to generate changepoint features Typically, this should match the growth term in the model- ”growth_func”: Optional[func]
Growth function (scalar -> scalar). Changepoint features are created by applying
growth_func
tocontinuous_time_col
with offsets. If None, uses identity function to usecontinuous_time_col
directly as growth term If changepoints_dict[“method”] == “uniform”, this other key is required:"n_changepoints"
: intnumber of changepoints to evenly space across training period
If changepoints_dict[“method”] == “custom”, this other key is required:
"dates"
: Iterable[Union[int, float, str, datetime]]Changepoint dates. Must be parsable by pd.to_datetime. Changepoints are set at the closest time on or after these dates in the dataset.
If changepoints_dict[“method”] == “auto”, the keys that matches the parameters in
find_trend_changepoints
, exceptdf
,time_col
andvalue_col
, are optional. Extra keys also include “dates”, “combine_changepoint_min_distance” and “keep_detected” to specify additional custom trend changepoints. These three parameters correspond to the three parameters “custom_changepoint_dates”, “min_distance” and “keep_detected” incombine_detected_and_custom_trend_changepoints
.
seasonality_changepoints_dict (dict or None, default None) – The parameter dictionary for seasonality change point detection. Parameters are in
find_seasonality_changepoints
. Notedf
,time_col
,value_col
andtrend_changepoints
are auto populated, and do not need to be provided.changepoint_detector (ChangepointDetector or None, default None) – The ChangepointDetector class
ChangepointDetector
. This is specifically for forecast_simple_silverkite to pass the ChangepointDetector class for plotting purposes, in case that users useforecast_simple_silverkite
withchangepoints_dict["method"] == "auto"
. The trend change point detection has to be run there to include possible interaction terms, so we need to pass the detection result from there to include in the output.min_admissible_value (float or None, optional, default None) – The minimum admissible value to return during prediction. If None, no limit is applied.
max_admissible_value (float or None, optional, default None) – The maximum admissible value to return during prediction. If None, no limit is applied.
uncertainty_dict (dict or None, optional, default None) –
- How to fit the uncertainty model. A dictionary with keys:
"uncertainty_method"
strThe title of the method. Only “simple_conditional_residuals” is implemented in
fit_ml_model
which calculates CIs using residuals"params"
dictA dictionary of parameters needed for the requested
uncertainty_method
. For example, foruncertainty_method="simple_conditional_residuals"
, see parameters ofconf_interval
:"conditional_cols"
"quantiles"
"quantile_estimation_method"
"sample_size_thresh"
"small_sample_size_method"
"small_sample_size_quantile"
If None, no uncertainty intervals are calculated.
normalize_method (str or None, default None) – If a string is provided, it will be used as the normalization method in
normalize_df
, passed via the argumentmethod
. Available options are: “zero_to_one”, “statistical”, “minus_half_to_half”, “zero_at_origin”. If None, no normalization will be performed. See that function for more details.adjust_anomalous_dict (dict or None, default None) –
If not None, a dictionary with following items:
- ”func”callable
A function to perform adjustment of anomalous data with following signature:
adjust_anomalous_dict["func"]( df=df, time_col=time_col, value_col=value_col, **params) -> {"adjusted_df": adjusted_df, ...}
- ”params”dict
The extra parameters to be passed to the function above.
impute_dict (dict or None, default None) –
If not None, a dictionary with following items:
- ”func”callable
A function to perform imputations with following signature:
impute_dict["func"]( df=df, value_col=value_col, **impute_dict["params"] -> {"df": imputed_df, ...}
- ”params”dict
The extra parameters to be passed to the function above.
regression_weight_col (str or None, default None) – The column name for the weights to be used in weighted regression version of applicable machine-learning models.
forecast_horizon (int or None, default None) – The number of periods for which forecast is needed. Note that this is only used in deciding what parameters should be used for certain components e.g. autoregression, if automatic methods are requested. While, the prediction time forecast horizon could be different from this variable, ideally they should be the same.
simulation_based (bool, default False) – Boolean to specify if the future predictions are to be using simulations or not. Note that this is only used in deciding what parameters should be used for certain components e.g. autoregression, if automatic methods are requested. However, the auto-settings and the prediction settings regarding using simulations should match.
simulation_num (int, default 10) – The number of simulations for when simulations are used for generating forecasts and prediction intervals.
fast_simulation (bool, default False) – Deterimes if fast simulations are to be used. This only impacts models which include auto-regression. This method will only generate one simulation without any error being added and then add the error using the volatility model. The advantage is a major boost in speed during inference and the disadvantage is potentially less accurate prediction intervals.
remove_intercept (bool, default False) – Whether to remove explicit and implicit intercepts. By default,
patsy
will make the design matrix always full rank. It will always include an intercept term unless we specify “-1” or “+0”. However, if there are categorical variables, even we specify “-1” or “+0”, it will include an implicit intercept by adding all levels of a categorical variable into the design matrix. Sometimes we don’t want this to happen. Setting this parameter to True will remove both explicit and implicit intercepts.
- Returns
trained_model – A dictionary that includes the fitted model from the function
fit_ml_model_with_evaluation
. The keys are:- df_dropna:
pandas.DataFrame
The
df
with NAs dropped.- df:
pandas.DataFrame
The original
df
.- num_training_points: int
The number of training points.
- features_df:
pandas.DataFrame
The
df
with augmented time features.- min_timestamp:
pandas.Timestamp
The minimum timestamp in data.
- max_timestamp:
pandas.Timestamp
The maximum timestamp in data.
- freq: str
The data frequency.
- inferred_freq: str
The data freqency inferred from data.
- inferred_freq_in_secsfloat
The data frequency inferred from data in seconds.
- inferred_freq_in_days: float
The data frequency inferred from data in days.
- time_col: str
The time column name.
- value_col: str
The value column name.
- origin_for_time_vars: float
The first time stamp converted to a float number.
- fs_components_df:
pandas.DataFrame
The dataframe that specifies the seasonality Fourier configuration.
- autoreg_dict: dict
The dictionary that specifies the autoregression configuration.
- lagged_regressor_dict: dict
The dictionary that specifies the lagged regressors configuration.
- lagged_regressor_cols: list [str]
List of regressor column names used for lagged regressor
- normalize_method: str
The normalization method. See the function input parameter
normalize_method
.- daily_event_df_dict: dict
The dictionary that specifies daily events configuration.
- changepoints_dict: dict
The dictionary that specifies changepoints configuration.
- changepoint_values: list [float]
The list of changepoints in continuous time values.
- normalized_changepoint_valueslist [float]
The list of changepoints in continuous time values normalized to 0 to 1.
- continuous_time_col: str
The continuous time column name in
features_df
.- growth_func: func
The growth function used in changepoints, None is linear function.
- fs_func: func
The function used to generate Fourier series for seasonality.
- has_autoreg_structure: bool
Whether the model has autoregression structure.
- autoreg_func: func
The function to generate autoregression columns.
- min_lag_order: int
Minimal lag order in autoregression.
- max_lag_order: int
Maximal lag order in autoregression.
- has_lagged_regressor_structure: bool
Whether the model has lagged regressor structure.
- lagged_regressor_func: func
The function to generate lagged regressor columns.
- min_lagged_regressor_order: int
Minimal lag order in lagged regressors.
- max_lagged_regressor_order: int
Maximal lag order in lagged regressors.
- uncertainty_dict: dict
The dictionary that specifies uncertainty model configuration.
- pred_cols: list [str]
List of predictor names.
- last_date_for_fit: str or
pandas.Timestamp
The last timestamp used for fitting.
- trend_changepoint_dates: list [
pandas.Timestamp
] List of trend changepoints.
- changepoint_detector: class
The ChangepointDetector class used to detected trend changepoints.
- seasonality_changepoint_dates: list [
pandas.Timestamp
] List of seasonality changepoints.
- seasonality_changepoint_result: dict
The seasonality changepoint detection results.
- fit_algorithm: str
The algorithm used to fit the model.
- fit_algorithm_params: dict
The dictionary of parameters for
fit_algorithm
.- adjust_anomalous_info: dict
A dictionary that has anomaly adjustment results.
- impute_info: dict
A dictionary that has the imputation results.
- forecast_horizon: int
The forecast horizon in steps.
- forecast_horizon_in_days: float
The forecast horizon in days.
- forecast_horizon_in_timedelta: datetime.timmdelta
The forecast horizon in timedelta.
- simulation_based: bool
Whether to use simulation in prediction with autoregression terms.
- simulation_numint, default 10
The number of simulations for when simulations are used for generating forecasts and prediction intervals.
- train_df
pandas.DataFrame
The past dataframe used to generate AR terms. It includes the concatenation of
past_df
anddf
ifpast_df
is provided, otherwise it is thedf
itself.- drop_intercept_colstr or None
The intercept column, explicit or implicit, to be dropped.
- df_dropna:
- Return type
dict
- predict_no_sim(fut_df, trained_model, past_df=None, new_external_regressor_df=None, time_features_ready=False, regressors_ready=False)[source]
Performs predictions for the dates in
fut_df
. Ifextra_pred_cols
refers to a column indf
, eitherfut_df
ornew_external_regressor_df
must contain the regressors and the columns needed for lagged regressors.- Parameters
fut_df (
pandas.DataFrame
) – The data frame which includes the timestamps. for prediction and any regressors.trained_model (dict) – A fitted silverkite model which is the output of
self.forecast
.past_df (
pandas.DataFrame
, optional) – A data frame with past values if autoregressive methods are called via autoreg_dict parameter ofgreykite.algo.forecast.silverkite.SilverkiteForecast.py
.new_external_regressor_df (
pandas.DataFrame
, optional) – Contains the regressors not already included in fut_df.time_features_ready (bool) – Boolean to denote if time features are already given in df or not.
regressors_ready (bool) – Boolean to denote if regressors are already added to data (
fut_df
).
- Returns
result – A dictionary with following items
- ”fut_df”:
pandas.DataFrame
The same as input dataframe with an added column for the response. If value_col already appears in
fut_df
, it will be over-written. Ifuncertainty_dict
is provided as input, it will also contain aQUANTILE_SUMMARY_COL
column.
- ”fut_df”:
- ”x_mat”:
pandas.DataFrame
Design matrix of the predictive machine-learning model
- ”x_mat”:
- ”features_df”:
pandas.DataFrame
The features dataframe used for prediction.
- ”features_df”:
- Return type
dict
- predict_n_no_sim(fut_time_num, trained_model, freq, new_external_regressor_df=None, time_features_ready=False, regressors_ready=False)[source]
This is the forecast function which can be used to forecast. It accepts extra regressors (
extra_pred_cols
) originally indf
vianew_external_regressor_df
.- Parameters
fut_time_num (int) – number of needed future values
trained_model (dict) – A fitted silverkite model which is the output of
self.forecast
freq (str) – Frequency of future predictions. Accepts any valid frequency for
pd.date_range
.new_external_regressor_df (
pandas.DataFrame
or None) – Contains the extra regressors if specified.time_features_ready (bool) – Boolean to denote if time features are already given in df or not.
regressors_ready (bool) – Boolean to denote if regressors are already added to data (
fut_df
).
- Returns
result – A dictionary with following items
- ”fut_df”:
pandas.DataFrame
The same as input dataframe with an added column for the response. If value_col already appears in
fut_df
, it will be over-written. Ifuncertainty_dict
is provided as input, it will also contain aQUANTILE_SUMMARY_COL
column.
- ”fut_df”:
- ”x_mat”:
pandas.DataFrame
Design matrix of the predictive machine-learning model
- ”x_mat”:
- Return type
dict
- simulate(fut_df, trained_model, past_df=None, new_external_regressor_df=None, include_err=True, time_features_ready=False, regressors_ready=False)[source]
A function to simulate future series. If the fitted model supports uncertainty e.g. via
uncertainty_dict
, errors are incorporated into the simulations.- Parameters
fut_df (
pandas.DataFrame
) – The data frame which includes the timestamps for prediction and any regressors.trained_model (dict) – A fitted silverkite model which is the output of
self.forecast
.past_df (
pandas.DataFrame
, optional) – A data frame with past values if autoregressive methods are called viaautoreg_dict
parameter ofgreykite.algo.forecast.silverkite.SilverkiteForecast.py
new_external_regressor_df (
pandas.DataFrame
, optional) – Contains the regressors not already included infut_df
.include_err (bool) – Boolean to determine if errors are to be incorporated in the simulations.
time_features_ready (bool) – Boolean to denote if time features are already given in df or not.
regressors_ready (bool) – Boolean to denote if regressors are already added to data (
fut_df
).
- Returns
result – A dictionary with following items
- ”fut_df”:
pandas.DataFrame
The same as input dataframe with an added column for the response. If value_col already appears in
fut_df
, it will be over-written. Ifuncertainty_dict
is provided as input, it will also contain aQUANTILE_SUMMARY_COL
column. Here are the expected columns:A time column with the column name being
trained_model["time_col"]
The predicted response in
value_col
column.Quantile summary response in
QUANTILE_SUMMARY_COL
column. This column only appears if the model includes uncertainty.Error std in
ERR_STD_COL
column. This column only appears if the model includes uncertainty.
- ”fut_df”:
- ”x_mat”:
pandas.DataFrame
Design matrix of the predictive machine-learning model
- ”x_mat”:
- ”features_df”:
pandas.DataFrame
The features dataframe used for prediction.
- ”features_df”:
- Return type
dict
- simulate_multi(fut_df, trained_model, simulation_num=10, past_df=None, new_external_regressor_df=None, include_err=None)[source]
A function to simulate future series. If the fitted model supports uncertainty e.g. via
uncertainty_dict
, errors are incorporated into the simulations.- Parameters
fut_df (
pandas.DataFrame
) – The data frame which includes the timestamps for prediction and any regressors.trained_model (dict) – A fitted silverkite model which is the output of
self.forecast
.simulation_num (int) – The number of simulated series, (each of which have the same number of rows as
fut_df
) to be stacked up row-wise. This number must be larger than zero.past_df (
pandas.DataFrame
, optional) – A data frame with past values if autoregressive methods are called viaautoreg_dict
parameter ofgreykite.algo.forecast.silverkite.SilverkiteForecast.py
.new_external_regressor_df (
pandas.DataFrame
, optional) – Contains the regressors not already included infut_df
.include_err (bool, optional, default None) – Boolean to determine if errors are to be incorporated in the simulations. If None, it will be set to True if uncertainty is passed to the model and otherwise will be set to False.
- Returns
result – A dictionary with follwing items
- ”fut_df_sim”
pandas.DataFrame
Row-wise concatenation of dataframes each being the same as input dataframe (
fut_df
) with an added column for the response and a new column: “sim_label” to differentiate various simulations. The row number of the returned dataframe is:simulation_num
times the row number offut_df
.If
value_col
already appears infut_df
, it will be over-written.
- ”fut_df_sim”
- ”x_mat”:
pandas.DataFrame
simulation_num
copies of design matrix of the predictive machine-learning model concatenated. An extra index column (“original_row_index”) is also added for aggregation when needed. Note that the all copies will be the same except for the case where auto-regression is utilized.
- ”x_mat”:
- Return type
dict
- predict_via_sim(fut_df, trained_model, past_df=None, new_external_regressor_df=None, simulation_num=10, include_err=None)[source]
Performs predictions and calculate uncertainty using multiple simulations.
- Parameters
fut_df (
pandas.DataFrame
) – The data frame which includes the timestamps for prediction and possibly regressors.trained_model (dict) – A fitted silverkite model which is the output of
self.forecast
past_df (
pandas.DataFrame
, optional) – A data frame with past values if autoregressive methods are called via autoreg_dict parameter ofgreykite.algo.forecast.silverkite.SilverkiteForecast.py
new_external_regressor_df (
pandas.DataFrame
, optional) – Contains the regressors not already included infut_df
.simulation_num (int, optional, default 10) – The number of simulated series to be used in prediction.
include_err (bool, optional, default None) – Boolean to determine if errors are to be incorporated in the simulations. If None, it will be set to True if uncertainty is passed to the model and otherwise will be set to False
- Returns
result – A dictionary with following items
- ”fut_df”:
pandas.DataFrame
The same as input dataframe with an added column for the response. If value_col already appears in
fut_df
, it will be over-written. Ifuncertainty_dict
is provided as input, it will also contain aQUANTILE_SUMMARY_COL
column. Here are the expected columns:A time column with the column name being
trained_model["time_col"]
The predicted response in
value_col
column.Quantile summary response in
QUANTILE_SUMMARY_COL
column. This column only appears if the model includes uncertainty.Error std in
ERR_STD_COL
column. This column only appears if the model includes uncertainty.
- ”fut_df”:
- ”x_mat”:
pandas.DataFrame
Design matrix of the predictive machine-learning model
- ”x_mat”:
- Return type
dict
- predict_via_sim_fast(fut_df, trained_model, past_df=None, new_external_regressor_df=None)[source]
Performs predictions and calculates uncertainty using one simulation of future and calculate the error separately (not relying on multiple simulations). Due to this the prediction intervals well into future will be narrower than
predict_via_sim
and therefore less accurate. However there will be a major speed gain which might be important in some use cases.- Parameters
fut_df (
pandas.DataFrame
) – The data frame which includes the timestamps for prediction and possibly regressors.trained_model (dict) – A fitted silverkite model which is the output of
self.forecast
.past_df (
pandas.DataFrame
or None, default None) – A data frame with past values if autoregressive methods are called viaautoreg_dict
parameter ofgreykite.algo.forecast.silverkite.SilverkiteForecast.py
new_external_regressor_df (
pandas.DataFrame
or None, default None) – Contains the regressors not already included infut_df
.
- Returns
result – A dictionary with following items
- ”fut_df”:
pandas.DataFrame
The same as input dataframe with an added column for the response. If value_col already appears in
fut_df
, it will be over-written. Ifuncertainty_dict
is provided as input, it will also contain aQUANTILE_SUMMARY_COL
column. Here are the expected columns:A time column with the column name being
trained_model["time_col"]
The predicted response in
value_col
column.Quantile summary response in
QUANTILE_SUMMARY_COL
column. This column only appears if the model includes uncertainty.Error std in
ERR_STD_COL
column. This column only appears if the model includes uncertainty.
- ”fut_df”:
- ”x_mat”:
pandas.DataFrame
Design matrix of the predictive machine-learning model
- ”x_mat”:
- ”features_df”:
pandas.DataFrame
The features dataframe used for prediction.
- ”features_df”:
- Return type
dict
- predict_n_via_sim(fut_time_num, trained_model, freq, new_external_regressor_df=None, simulation_num=10, fast_simulation=False, include_err=None)[source]
This is the forecast function which can be used to forecast. This function’s predictions are constructed using simulations from the fitted series. This supports both
predict_silverkite_via_sim
and``predict_silverkite_via_sim_fast
depending on value of the passed argumentfast_simulation
.The
past_df
is set to be the training data which is available intrained_model
. It accepts extra regressors (extra_pred_cols
) originally indf
vianew_external_regressor_df
.- Parameters
fut_time_num (int) – number of needed future values
trained_model (dict) – A fitted silverkite model which is the output of
self.forecast
freq (str) – Frequency of future predictions. Accepts any valid frequency for
pd.date_range
.new_external_regressor_df (
pandas.DataFrame
or None) – Contains the extra regressors if specified.simulation_num (int, optional, default 10) – The number of simulated series to be used in prediction.
fast_simulation (bool, default False) – Deterimes if fast simulations are to be used. This only impacts models which include auto-regression. This method will only generate one simulation without any error being added and then add the error using the volatility model. The advantage is a major boost in speed during inference and the disadvantage is potentially less accurate prediction intervals.
include_err (bool, optional, default None) – Boolean to determine if errors are to be incorporated in the simulations. If None, it will be set to True if uncertainty is passed to the model and otherwise will be set to False
- Returns
result – A dictionary with following items
- ”fut_df”:
pandas.DataFrame
The same as input dataframe with an added column for the response. If value_col already appears in
fut_df
, it will be over-written. Ifuncertainty_dict
is provided as input, it will also contain aQUANTILE_SUMMARY_COL
column. Here are the expected columns:A time column with the column name being
trained_model["time_col"]
The predicted response in
value_col
column.Quantile summary response in
QUANTILE_SUMMARY_COL
column. This column only appears if the model includes uncertainty.Error std in
ERR_STD_COL
column. This column only appears if the model includes uncertainty.
- ”fut_df”:
- ”x_mat”:
pandas.DataFrame
Design matrix of the predictive machine-learning model
- ”x_mat”:
- Return type
dict
- predict(fut_df, trained_model, freq=None, past_df=None, new_external_regressor_df=None, include_err=None, force_no_sim=False, simulation_num=None, fast_simulation=None, na_fill_func=<function SilverkiteForecast.<lambda>>)[source]
Performs predictions using silverkite model. It determines if the prediction should be simulation-based or not and then predicts using that setting. The function determines if it should use simulation-based predictions or that is not necessary. Here is the logic for determining if simulations are needed:
If the model is not autoregressive, then clearly no simulations are needed
If the model is autoregressive, however the minimum lag appearing in the model is larger than the forecast horizon, then simulations are not needed. This is because the lags can be calculated fully without predicting the future.
User can overwrite the above behavior and force no simulations using
force_no_sim
argument, in which case some lags will be imputed. This option should not be used by most users. Some scenarios where advanced user might want to use this is (a) whenmin_lag_order >= forecast_horizon
does not hold strictly but close to hold. (b) user want to predict fast, the autoregression lags are normalized. In that case the predictions returned could correspond to an approximation of a model without autoregression.- Parameters
fut_df (
pandas.DataFrame
) – The data frame which includes the timestamps for prediction and possibly regressors.trained_model (dict) – A fitted silverkite model which is the output of
self.forecast
freq (str, optional, default None) – Timeseries frequency, DateOffset alias. See https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases for the allowed strings. If None, it is extracted from
trained_model
input.past_df (
pandas.DataFrame
or None, default None) – A data frame with past values if autoregressive methods are called via autoreg_dict parameter ofgreykite.algo.forecast.silverkite.SilverkiteForecast.py
. Note that thispast_df
can be anytime before the training end timestamp, but can not exceed it.new_external_regressor_df (
pandas.DataFrame
or None, default None) – Contains the regressors not already included infut_df
.include_err (bool, optional, default None) – Boolean to determine if errors are to be incorporated in the simulations. If None, it will be set to True if uncertainty is passed to the model and otherwise will be set to False
force_no_sim (bool, default False) – If True, prediction with no simulations is forced. This can be useful when speed is of concern or for validation purposes. In this case, the potential non-available lags will be imputed. Most users should not set this to True as the consequences could be hard to quantify.
simulation_num (int or None, default None) – The number of simulations for when simulations are used for generating forecasts and prediction intervals. If None, it will be inferred from the model (
trained_model
).fast_simulation (bool or None, default None) – Deterimes if fast simulations are to be used. This only impacts models which include auto-regression. This method will only generate one simulation without any error being added and then add the error using the volatility model. The advantage is a major boost in speed during inference and the disadvantage is potentially less accurate prediction intervals. If None, it will be inferred from the model (
trained_model
).na_fill_func (callable (
pd.DataFrame
->pd.DataFrame
)) –default:
lambda df: df.interpolate().bfill().ffill()
A function which interpolates missing values in a dataframe. The main usage is invoked when there is a gap between the timestamps in
fut_df
. The main use case is when the user wants to predict a period which is not an immediate period after training. In that case to fill in the gaps, the regressors need to be interpolated/filled. The default works by first interpolating the continuous variables. Then it uses back-filling and then forward-filling for categorical variables.
- Returns
result – A dictionary with following items
- ”fut_df”:
pandas.DataFrame
The same as input dataframe with an added column for the response. If value_col already appears in
fut_df
, it will be over-written. Ifuncertainty_dict
is provided as input, it will also contain aQUANTILE_SUMMARY_COL
column. Here are the expected columns:A time column with the column name being
trained_model["time_col"]
The predicted response in
value_col
column.Quantile summary response in
QUANTILE_SUMMARY_COL
column. This column only appears if the model includes uncertainty.Error std in
ERR_STD_COL
column. This column only appears if the model includes uncertainty.
- ”fut_df”:
- ”x_mat”:
pandas.DataFrame
Design matrix of the predictive machine-learning model
- ”x_mat”:
- Return type
dict
- predict_n(fut_time_num, trained_model, freq=None, past_df=None, new_external_regressor_df=None, include_err=None, force_no_sim=False, simulation_num=None, fast_simulation=None, na_fill_func=<function SilverkiteForecast.<lambda>>)[source]
This is the forecast function which can be used to forecast a number of periods into the future. It determines if the prediction should be simulation-based or not and then predicts using that setting. Currently if the silverkite model uses autoregression simulation-based prediction/CIs are used.
- Parameters
fut_time_num (int) – number of needed future values
trained_model (dict) – A fitted silverkite model which is the output of
self.forecast
freq (str, optional, default None) – Timeseries frequency, DateOffset alias. See https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases for the allowed frequencies. If None, it is extracted from
trained_model
input.new_external_regressor_df (
pandas.DataFrame
or None) – Contains the extra regressors if specified.simulation_num (int, optional, default 10) – The number of simulated series to be used in prediction.
fast_simulation (bool or None, default None) – Deterimes if fast simulations are to be used. This only impacts models which include auto-regression. This method will only generate one simulation without any error being added and then add the error using the volatility model. The advantage is a major boost in speed during inference and the disadvantage is potentially less accurate prediction intervals. If None, it will be inferred from the model (
trained_model
).include_err (bool or None, default None) – Boolean to determine if errors are to be incorporated in the simulations. If None, it will be set to True if uncertainty is passed to the model and otherwise will be set to False
force_no_sim (bool, default False) – If True, prediction with no simulations is forced. This can be useful when speed is of concern or for validation purposes.
na_fill_func (callable (
pd.DataFrame
->pd.DataFrame
)) –default:
lambda df: df.interpolate().bfill().ffill()
A function which interpolated missing values in a dataframe. The main usage is invoked when there is a gap between the timestamps. In that case to fill in the gaps, the regressors need to be interpolated/filled. The default works by first interpolating the continuous variables. Then it uses back-filling and then forward-filling for categorical variables.
- Returns
result – A dictionary with following items
- ”fut_df”:
pandas.DataFrame
The same as input dataframe with an added column for the response. If value_col already appears in
fut_df
, it will be over-written. Ifuncertainty_dict
is provided as input, it will also contain aQUANTILE_SUMMARY_COL
column. Here are the expected columns:A time column with the column name being
trained_model["time_col"]
The predicted response in
value_col
column.Quantile summary response in
QUANTILE_SUMMARY_COL
column. This column only appears if the model includes uncertainty.Error std in
ERR_STD_COL
column. This column only appears if the model includes uncertainty.
- ”fut_df”:
- ”x_mat”:
pandas.DataFrame
Design matrix of the predictive machine-learning model
- ”x_mat”:
- Return type
dict
- partition_fut_df(fut_df, trained_model, freq, na_fill_func=<function SilverkiteForecast.<lambda>>)[source]
This function takes a dataframe
fut_df
which includes the timestamps to forecast and atrained_model
returned by forecast and decomposesfut_df
to various dataframes which reflect if the timestamps are before, during or after the training periods. It also determines if: ‘the future timestamps after the training period’ are immediately after ‘the last training period’ or if there is some extra gap. In that case, this function creates an expanded dataframe which includes the missing timestamps as well. Iffut_df
also includes extra columns (they could be regressor columns), this function will interpolate the extra regressor columns.- Parameters
fut_df (
pandas.DataFrame
) – The data frame which includes the timestamps for prediction and possibly regressors. Note that the timestamp column infut_df
must be the same astrained_model["time_col"]
. We assumefut_df[time_col]
is pandas.datetime64 type.trained_model (dict) – A fitted silverkite model which is the output of forecast
freq (str) – Timeseries frequency, DateOffset alias. See https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases for the allowed frequencies.
na_fill_func (callable (
pd.DataFrame
->pd.DataFrame
)) –default:
lambda df: df.interpolate().bfill().ffill()
A function which interpolated missing values in a dataframe. The main usage is invoked when there is a gap between the timestamps. In that case to fill in the gaps, the regressors need to be interpolated/filled. The default works by first interpolating the continuous variables. Then it uses back-filling and then forward-filling for categorical variables.
- Returns
result – A dictionary with following items:
"fut_freq_in_secs"
: floatThe inferred frequency in
fut_df
"training_freq_in_secs"
: floatThe inferred frequency in training data
"index_before_training"
: list [bool]A boolean list to determine which rows of
fut_df
include a time which is before the training start.
"index_within_training"
: list [bool]A boolean list to determine which rows of
fut_df
include a time which is during the training period.
"index_after_training"
: list [bool]A boolean list to determine which rows of
fut_df
include a time which is after the training end date.
"fut_df_before_training"
:pandas.DataFrame
A partition of
fut_df
with timestamps before the training start date
"fut_df_within_training"
:pandas.DataFrame
A partition of
fut_df
with timestamps during the training period
"fut_df_after_training"
:pandas.DataFrame
A partition of
fut_df
with timestamps after the training start date
"fut_df_gap"
:pandas.DataFrame
or NoneIf there is a gap between training end date and the first timestamp after the training end date in
fut_df
, this dataframe can fill the gap between the two. In casefut_df
includes extra columns as well, the values for those columns will be filled usingna_fill_func
.
"fut_df_after_training_expanded"
:pandas.DataFrame
If there is a gap between training end date and the first timestamp after the training end date in
fut_df
, this dataframe will include the data for the gaps (fut_df_gap
) as well asfut_df_after_training
.
"index_after_training_original"
: list [bool]A boolean list to determine which rows of
fut_df_after_training_expanded
correspond to raw data passed by user which are after training end date, appearing infut_df
. Note that this partition corresponds tofut_df_after_training
which is the subset of data infut_df
provided by user and also returned by this function.
"missing_periods_num"
: intNumber of missing timestamps between the last date of training and first date in
fut_df
appearing after the training end date
"inferred_forecast_horizon"
: intThis is the inferred forecast horizon from
fut_df
. This is defined to be the distance between the last training end date and last date appearing infut_df
. Note that this value can be smaller or larger than the number of rows offut_df
. This is calculated by adding the number of potentially missing timestamps and the number of time periods appearing after the training end point. Also note if there are no timestamps after the training end point infut_df
, this value will be zero.
"forecast_partition_summary"
: dictA dictionary which includes the size of various partitions of
fut_df
as well as the missing timestamps if needed. The dictionary keys are as follows:"len_before_training"
: the number of time periods before training start"len_within_training"
: the number of time periods within training"len_after_training"
: the number of time periods after training"len_gap"
: the number of missing time periods between training data and future time stamps infut_df
- Return type
dict