Examine Input Data

Expected format

Your input df should be pandas DataFrame with a time column and value column.

  • The time column can have any format recognized by pandas.to_datetime

  • The value column should be numeric and non-negative. Missing values are allowed.

Include regressors as columns in df. Forecasts will start after the date of the last non-null observation in value column. The regressors must be available past this date to create a forecast.

Your df will look like this:

(x) data point
(-) missing data

time_col    value_col    regressor1_col    regressor2_col
x           x            x                 x
x           -            x                 x       <- missing values okay; let Greykite handle imputation
x           x            -                 x       <- missing values okay; let Greykite handle imputation
x           x            x                 -       <- missing values okay; let Greykite handle imputation
x           x            x                 x
x           x            x                 x
-           -            -                 -       <- missing values okay; let Greykite handle imputation
x           x            x                 x
x           x            x                 x
x           -            x                 x       <- forecast start date, continue to provide regressors
x           -            x                 x
x           -            -                 x       <- Greykite will impute regressor1
x           -            x                 x
x           -            x                 -       <- Greykite will impute regressor2
x           -            x                 x
x           -            -                 -       <- Greykite will impute regressor1 and regressor2
x           -            x                 x
x           -            x                 x       <- last date for prediction (no regressors after this point)
x           -            -                 -       <- no prediction
x           -            -                 -       <- no prediction
x           -            -                 -       <- no prediction

note: for clarity, this diagram shows time_col sorted in ascending order. This is
not required for your input data.

Note

The input data frequency can be whatever you’d like. Hourly, daily, weekly, monthly, every 6 hours, etc.

Note

Greykite handles missing values and even missing timestamps. You should not impute the missing values on your own.

Greykite’s imputation methods prevent leakage of future information into the past during time-series cross validation and backtesting. They allow imputation based on future values, but only within each training set. See Pre-processing, Selective Grid Search.

Note

As a rule of thumb, provide at least twice as much training data as you intend to forecast.

A more nuanced answer considers seasonality: you need a few full seasonality cycles to properly model the seasonality patterns and distinguish them from other terms (e.g. growth). For example, if yearly seasonality is important to your problem, then provide at least 2 years of data.

You can still create a forecast with fewer data points, e.g. 1 year for yearly seasonality, but it may be more challenging to train a good model or do proper historical validation.

Tip

Sometimes you may have a dataframe with additional columns not relevant to the forecast.

When creating a forecast, subset the df to the relevant columns: df[[time_col, value_col, regressor_col1, regressor_col2, ...]].

Examples:

import numpy as np
import pandas as pd

# no regressors
df = pd.DataFrame({
    "ts": pd.date_range(start="2018-01-03", periods=400, freq="D"),
    "value": np.random.normal(size=400)
})

# with regressors
df = pd.DataFrame({
    "ts": pd.date_range(start="2018-01-03-00", periods=5, freq="H"),
    "value": [1.0, 2.0, 3.0, None, None],
    "regressor1": [0.19, None, 0.14, 0.16, 0.17],
    "regressor2": [1.18, 1.12, 1.14, 1.16, None],
    "regressor3": [2.17, 2.12, 2.14, 2.16, 2.17]
})

Inspect data

We will be using UnivariateTimeSeries to inspect the data.

Note

While following steps are not necessary to create a forecast, it’s always helpful to know what your data looks like.

Greykite provides functions to visualize your input timeseries and examine the trend, seasonality, holidays.

Load data

Make sure your data loads correctly. First, check the printed logs of load_data.

from greykite.framework.input.univariate_time_series import UnivariateTimeSeries

ts = UnivariateTimeSeries()
ts.load_data(
    df=df,
    time_col="ts",
    value_col="y",
    freq="D")  # optional, but recommended if you have missing data points
               # W for weekly, D for daily, H for hourly, etc. See ``pd.date_range``

Here is some example logging info for hourly data. The loaded data spans 2017-10-11 to 2020-02-23. 11 missing dates were added.

INFO:root:Added 11 missing dates. There were 20773 values originally.
INFO:root:Input time stats:
INFO:root:  data points: 20784
INFO:root:  avg increment (sec): 3600.00
INFO:root:  start date: 2017-10-11 00:00:00
INFO:root:  end date: 2020-02-23 23:00:00
INFO:root:Input value stats:
INFO:root:count     20773.000000
mean     234249.356472
std       30072.193941
min        9169.000000
25%      191494.000000
50%      234046.000000
75%      242572.000000
max      34832.000000
Name: y, dtype: float64
INFO:root:  last date for fit: 2020-02-23 23:00:00
INFO:root:  columns available to use as regressors: []
INFO:root:  last date for regressors:

Alternatively, if you already have a forecast from Forecaster, the time series is included in the result.

from greykite.framework.templates.autogen.forecast_config import ForecastConfig
from greykite.framework.templates.autogen.forecast_config import MetadataParam
from greykite.framework.templates.forecaster import Forecaster
from greykite.framework.templates.model_templates import ModelTemplateEnum

metadata = MetadataParam(
    time_col="ts",
    value_col="y",
    freq="D"
)
forecaster = Forecaster()
result = forecaster.run_forecast_config(
    df=df,  # input data
    config=ForecastConfig(
        model_template=ModelTemplateEnum.AUTO.name,
        metadata_param=metadata,
        forecast_horizon=30,
        coverage=0.95
    )
)
ts = result.timeseries  # a `UnivariateTimeSeries`

You can also check the information programatically:

print(ts.time_stats)         # time statistics
print(ts.value_stats)        # value statistics
print(ts.freq)               # frequency
print(ts.regressor_cols)     # available regressors
print(ts.last_date_for_fit)  # last date with value_col
print(ts.last_date_for_reg)  # last date for any regressor
print(ts.df.head())          # the standardized dataset for forecasting
print(ts.fit_df.head())      # the standardized dataset for fitting and historical evaluation

Simple plot

The best way to check your data is to plot it. You can do this interactively in a Jupyter notebook.

from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)   # for generating offline graphs within Jupyter Notebook

fig = ts.plot()
iplot(fig)

Anomalies

An anomaly is a deviation in the metric that is not expected to occur again in the future. Check for anomalies using ts.plot() and label them before forecasting.

You may label anomalies by passing anomaly_info to load_data(). An anomaly in a timeseries is defined by its time period (start, end). If you are able to estimate the hypothetical value had the anomaly not occurred, you may specify an adjustment to get this corrected value. Otherwise, the values during the anomalous period will simply be masked and properly handled when forecasting. It is important to provide the anomaly information, rather than correcting the data yourself.

  • ts.df contains the values after adjustments, and ts.df_before_adjustment contains the values before adjustment.

  • The plot function has an option to show the anomaly adjustment (show_anomaly_adjustment=True).

  • The same anomaly_info can be used in the forecast configuration. See Anomaly Configuration

For example:

import numpy as np
import pandas as pd

from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)   # for generating offline graphs within Jupyter Notebook

import greykite.common.constants as cst
from greykite.framework.input.univariate_time_series import UnivariateTimeSeries

# Suppose 30.0 is an anomaly in "value" and we know the value should be lowered by 27.
# Suppose 20.17 and 20.12 are an anomalies in "regressor3" and we don't know the true value,
# so they should be replaced with np.nan.
df = pd.DataFrame({
    "ts": ["2018-07-13", "2018-07-14", "2018-07-15", "2018-07-16", "2018-07-17"],
    "y": [1.0, 2.0, 30.0, None, None],
    "regressor1": [0.19, None, 0.14, 0.16, 0.17],
    "regressor2": [1.18, 1.12, 1.14, 1.16, None],
    "regressor3": [20.17, 20.12, 2.14, 2.16, 2.17]
})
# The corrected df should look like this:
# df_adjusted = pd.DataFrame({
#     "ts": ["2018-07-13", "2018-07-14", "2018-07-15", "2018-07-16", "2018-07-17"],
#     "y": [1.0, 2.0, 3.0, None, None],
#     "regressor1": [0.19, None, 0.14, 0.16, 0.17],
#     "regressor2": [1.18, 1.12, 1.14, 1.16, None],
#     "regressor3": [None, None, 2.14, 2.16, 2.17]
# })

# Specify anomalies using ``anomaly_df``.
# Each row corresponds to an anomaly. The start date, end date,
# and impact (if known) are provided. Extra columns can be
# used to annotate information such as which metrics the
# anomaly applies to.
anomaly_df = pd.DataFrame({
    # start and end date are inclusive
    cst.START_TIME_COL: ["2018-07-15", "2018-07-13"],  # inclusive
    cst.END_TIME_COL: ["2018-07-15", "2018-07-14"],    # inclusive
    cst.ADJUSTMENT_DELTA_COL: [-27, np.nan],
    cst.METRIC_COL: ["y", "regressor3"]
})
# ``anomaly_info`` dictates which columns
# in ``df`` to correct (``value_col`` below), and which rows
# in ``anomaly_df`` to use to correct them.
# Rows are filtered using ``filter_by_dict``.
anomaly_info = [
    {
        "value_col": "y",
        "anomaly_df": anomaly_df,
        "adjustment_delta_col": cst.ADJUSTMENT_DELTA_COL,
        "filter_by_dict": {cst.METRIC_COL: "y"},
    },
    {
        "value_col": "regressor3",
        "anomaly_df": anomaly_df,
        "adjustment_delta_col": cst.ADJUSTMENT_DELTA_COL,
        "filter_by_dict": {cst.METRIC_COL: "regressor3"},
    },
]

# Pass ``anomaly_info`` to ``load_data``.
# Since our dataset has regressors, we pass ``regressor_cols`` as well.
ts = UnivariateTimeSeries()
ts.load_data(
    df=df,
    time_col="ts",
    value_col="y",
    freq="D",
    regressor_cols=["regressor1", "regressor2", "regressor3"],
    anomaly_info=anomaly_info)

# Plots the dataset after correction
fig = ts.plot()
iplot(fig)
# Set show_anomaly_adjustment=True to show the dataset before correction
fig = ts.plot(show_anomaly_adjustment=True)
iplot(fig)
# The results are stored as attributes.
ts.df                    # dataset after correction (same as ``df_adjusted`` above)
ts.df_before_adjustment  # dataset before correction (same as ``df`` above)

Check trend

Plot your data over time to see how it trends.

If you have daily or hourly data, it helps to aggregate. For example, look at weekly averages.

import numpy as np

# aggregate daily data to weekly
fig = ts.plot_grouping_evaluation(
    aggregation_func=np.mean,  # any aggregation function you want
    aggregation_func_name="mean",
    groupby_time_feature=None,
    groupby_sliding_window_size=7,  # any aggregation window you want
                                    # (7*24 for weekly aggregation of hourly data)
    groupby_custom_column=None,
    title=f"Weekly average of {value_col}")
iplot(fig)

For a more detailed examination, including automatic changepoint detection, see Changepoint Detection.

Check seasonality

Look for cyclical patterns in your data (i.e. seasonality).

For example, daily seasonality is a pattern that repeats once per day. To check daily seasonality, aggregate by hour of day and plot the average:

fig = ts.plot_grouping_evaluation(
    aggregation_func=np.mean,
    aggregation_func_name="mean",
    groupby_time_feature="hour",  # hour of day
    groupby_sliding_window_size=None,
    groupby_custom_column=None,
    title=f"daily seasonality: mean of {value_col}")
iplot(fig)

To check weekly seasonality, group by day of week.

fig = ts.plot_grouping_evaluation(
    aggregation_func=np.mean,
    aggregation_func_name="mean",
    groupby_time_feature="str_dow",  # day of week
    groupby_sliding_window_size=None,
    groupby_custom_column=None,
    title=f"weekly seasonality: mean of {value_col}")
iplot(fig)

To check yearly seasonality, group by week of year.

fig = ts.plot_grouping_evaluation(
    aggregation_func=np.mean,
    aggregation_func_name="mean",
    groupby_time_feature="woy",  # week of year
    groupby_sliding_window_size=None,
    groupby_custom_column=None,
    title=f"yearly seasonality: mean of {value_col}")
iplot(fig)

To see other features to group by: see build_time_features_df.

For a more detailed examination using a more powerful plotting function, see Seasonality Plots.