The purpose of this vignette is to provide a quick overview of forecasting with multiple time-series in forecastML
. The benefits to modeling multiple time-series in one go with a single model or ensemble of models include (a) modeling simplicity, (b) potentially more robust results from pooling data across time-series, and (c) solving the cold-start problem when few data points are available for a given time-series.
We won’t concern ourselves with error metrics or hyperparameter evaluation which can be assessed with the return_error()
and return_hyper()
functions. This functionality is covered in the package overview vignette.
To forecast with multiple/grouped/hierarchical time-series in forecastML
, your data needs the following characteristics:
The same outcome is being forecasted across time-series.
Data are in a long format with a single outcome column–i.e., time-series are stacked on top of each other in a data.frame.
There are 1 or more grouping columns.
There may be 1 or more static features that are constant through time but differ between time-series–e.g., a fixed location, store square footage, species of animal etc.
The time-series are regularly spaced and have no missing rows or gaps in time. Irregular or sparse time-series with many NA
s can be modeled in this framework, but missing rows will result in incorrect feature lags when using create_lagged_df()
which is the first step in the forecastML
workflow. To fix any gaps in data collection, use the fill_gaps()
function; handling the resulting missing values in the target being forecasted and the dynamic features can be done (a) prior to create_lagged_df()
or (b) in the user-defined model training function.
To illustrate forecasting with multiple time-series, we’ll use the data_buoy
dataset that comes with the forecastML
package. This dataset consists of daily sensor measurements of several environmental conditions collected by 14 buoys in Lake Michigan from 2012 through 2018. The data were obtained from NOAA’s National Buoy Data Center available at https://www.ndbc.noaa.gov/ using the rnoaa
package.
Outcome: Average daily wind speed in Lake Michigan.
Forecast horizon: Daily, 1 to 30 days into the future which is essentially January 2019 for this dataset.
Time-series: 14 outcome time-series collected from buoys throughout Lake Michigan.
Model: A single gradient boosted tree model with xgboost
.
data_buoy_gaps
consists of:
date
: A date column which will be removed for modeling.
wind_spd
: The outcome.
lat
and lon
: Latitude and longitude which are features that are static or unchanging through time.
day
and year
: Dynamic features encoding time.
air_temperature
and sea_surface_temperature
: Data collected from the buoys through time.
The wind speed data has some gaps in it: Some buoys collected data throughout the year, others only during the summer months. These gaps in data collection would result in incorrect feature lags in create_lagged_df()
as the previous row in the dataset for a given buoy–a lag of 1–may be several months in the past.
To fix this problem, we’ll run fill_gaps()
to fill in the rows for the missing dates. The added rows will appear between min(date)
for each buoy and and max(date)
across all buoys. For example, buoy 45186 that only started data collection in 2018 won’t have additional rows with NA
s for 2012 through 2017; only gaps since the start of data collection in 2018 to the most recent date will be filled.
After running fill_gaps()
, the following columns have been filled in and have no NA
s: date
, buoy_id
, lat
, and lon
.
After running fill_gaps()
, the following columns now have additional NA
s: our wind_spd
target and the dynamic features.
Notice that the input dataset and the returned dataset have the same columns in the same order with the same data types.
data <- forecastML::fill_gaps(data_buoy_gaps, date_col = 1, frequency = '1 day',
groups = 'buoy_id', static_features = c('lat', 'lon'))
print(list(paste0("The original dataset with gaps in data collection is ", nrow(data_buoy_gaps), " rows."),
paste0("The modified dataset with no gaps in data collection from fill_gaps() is ", nrow(data), " rows.")))
## [[1]]
## [1] "The original dataset with gaps in data collection is 23646 rows."
##
## [[2]]
## [1] "The modified dataset with no gaps in data collection from fill_gaps() is 31225 rows."
xgboost
model handle these NA
s.p <- ggplot(data, aes(x = date, y = wind_spd, color = ordered(buoy_id), group = year))
p <- p + geom_line()
p <- p + facet_wrap(~ ordered(buoy_id), scales = "fixed")
p <- p + theme_bw() + theme(
legend.position = "none"
) + xlab(NULL)
p
We’ll simply and incorrectly set our grouping column, buoy_id
, to numeric to work smoothly with xgboost
. Better alternatives include feature embedding, target encoding (available in the R
package catboost
), or mixed effects Random Forests.
To be clear, buoy_id
is both (a) used to identify a specific time-series for creating lagged features and (b) used as a feature in the model.
create_lagged_df()
.outcome_col <- 1 # The column position of our 'wind_spd' outcome.
horizons <- c(1, 7, 30) # Forecast 1, 1:7, and 1:30 days into the future.
lookback <- c(1:30, 360:370) # Features from 1 to 30 days in the past and annually.
dates <- data$date # Grouped time series forecasting requires dates.
data$date <- NULL # Dates, however, don't need to be in the input data.
frequency <- "1 day" # A string that works in base::seq(..., by = "frequency").
dynamic_features <- c("day", "year") # Features that change through time but which we will not lag.
groups <- "buoy_id" # 1 forecast for each group or buoy.
static_features <- c("lat", "lon") # Features that do not change through time.
type <- "train" # Create a model-training dataset.
data_train <- forecastML::create_lagged_df(data, type = type, outcome_col = outcome_col,
horizons = horizons, lookback = lookback,
dates = dates, frequency = frequency,
dynamic_features = dynamic_features,
groups = groups, static_features = static_features,
use_future = FALSE)
print(paste0("The class of `data_train` is ", class(data_train)))
## [1] "The class of `data_train` is grouped_lagged_df"
## [2] "The class of `data_train` is lagged_df"
## [3] "The class of `data_train` is list"
p <- plot(data_train) # plot.lagged_df() returns a ggplot object.
p <- p + geom_tile(NULL) # Remove the gray border for a cleaner plot.
p
skip = 730
argument to skip 2 years between validation datasets.windows <- forecastML::create_windows(data_train, window_length = 365, skip = 730,
include_partial_window = FALSE)
p <- plot(windows, data_train) + theme(legend.position = "none")
p
group_filter = "buoy_id == 1"
argument to get a closer look at 1 of our 14 time-series. The user-supplied filter is passed to dplyr::filter()
internally.create_lagged_df(..., type = "train")
(e.g., my_lagged_df$horizon_h),train_model()
predict()
function.Any data transformations, hyperparameter tuning, or inner loop cross-validation procedures should take place within this function, with the limitation that it ultimately needs to return()
a model suitable for the user-defined predict()
function; a list can be returned to capture meta-data such as hyperparameter results.
xgboost
-specific input datasets are created within this wrapper function.# The value of outcome_col can also be set in train_model() with train_model(outcome_col = 1).
model_function <- function(data, outcome_col = 1) {
# xgboost cannot handle missing outcomes data so we'll remove it.
data <- data[!is.na(data[, outcome_col]), ]
# 1 fixed validation dataset for early stopping. Ideally, we'd have many.
# We'll use an 80/20 split.
indices <- 1:nrow(data)
set.seed(224)
train_indices <- sample(1:nrow(data), ceiling(nrow(data) * .8), replace = FALSE)
test_indices <- indices[!(indices %in% train_indices)]
data_train <- xgboost::xgb.DMatrix(data = as.matrix(data[train_indices,
-(outcome_col), drop = FALSE]),
label = as.matrix(data[train_indices,
outcome_col, drop = FALSE]))
data_test <- xgboost::xgb.DMatrix(data = as.matrix(data[test_indices,
-(outcome_col), drop = FALSE]),
label = as.matrix(data[test_indices,
outcome_col, drop = FALSE]))
params <- list("objective" = "reg:linear")
watchlist <- list(train = data_train, test = data_test)
set.seed(224)
model <- xgboost::xgb.train(data = data_train, params = params,
max.depth = 8, nthread = 2, nrounds = 30,
metrics = "rmse", verbose = 0,
early_stopping_rounds = 5,
watchlist = watchlist)
return(model)
}
This should take ~1 minute to train our ‘3 forecast horizons’ * ‘3 validation datasets’ = 9 models.
The user-defined modeling wrapper function could be much more elaborate, in which case many more models could potentially be trained here.
These models could be trained in parallel on any OS with the very flexible future
package by uncommenting the code below and setting use_future = TRUE
. To avoid nested parallelization, models are either trained in parallel across forecast horizons or validation windows, whichever is longer (when equal, the default is parallel across forecast horizons).
#future::plan(future::multiprocess) # Multi-core or multi-session parallel training.
model_results_cv <- forecastML::train_model(lagged_df = data_train,
windows = windows,
model_name = "xgboost",
model_function = model_function,
use_future = FALSE)
## [1] "The class of `model_results_cv` is forecast_model"
## [2] "The class of `model_results_cv` is list"
xgboost
model for any horizon or validation window. Here, we show a summary()
of the model for the first validation window which is 2012.## Length Class Mode
## handle 1 xgb.Booster.handle externalptr
## raw 308719 -none- raw
## best_iteration 1 -none- numeric
## best_ntreelimit 1 -none- numeric
## best_score 1 -none- numeric
## niter 1 -none- numeric
## evaluation_log 3 data.table list
## call 10 -none- call
## params 5 -none- list
## callbacks 2 -none- list
## feature_names 128 -none- character
## nfeatures 1 -none- numeric
First, we’ll visually evaluate our model’s performance across our validation datasets.
Then we’ll forecast with each of our 9 models to get a sense of the stability of the forecasts produced from models trained on different subsets of our historical data.
The following user-defined prediction function is needed for each model:
data.frame
of the model features from forecastML::create_lagged_df(..., type = "train")
.data.frame
of predictions with 1 or 3 columns. A 1-column data.frame will produce point forecasts, and a 3-column data.frame can be used to return point, lower, and upper forecasts (column names and order do not matter).prediction_function <- function(model, data_features) {
x <- xgboost::xgb.DMatrix(data = as.matrix(data_features))
data_pred <- data.frame("y_pred" = predict(model, x))
return(data_pred)
}
prediction_function <- list(prediction_function)
data_pred_cv <- predict(model_results_cv, prediction_function = prediction_function, data = data_train)
print(paste0("The class of `data_pred_cv` is ", class(data_pred_cv)))
## [1] "The class of `data_pred_cv` is training_results"
## [2] "The class of `data_pred_cv` is forecast_model"
## [3] "The class of `data_pred_cv` is data.frame"
The predictions are solid lines; the actuals are dashed lines.
We’ll filter this plot for closer inspection below.
It’s somewhat difficult to see how we’ve done here, so we’ll use the group_filter
argument again to focus on specific results. Note that the plot()
function returns a ggplot
object so that we can easily modify our plot.
Notice that during the first half of 2015 there were no wind speed measurements recorded; however, we’re getting some interesting variability in predictions here because the day of the month and year dynamic features have captured some seasonality.
p <- plot(data_pred_cv, group_filter = "buoy_id %in% c(1, 2, 3)", horizons = 7)
p <- p + theme(legend.position = "none")
p <- p + scale_x_date(limits = c(as.Date("2015-01-01"), as.Date("2015-12-31")))
p <- p + facet_grid(horizon ~ buoy_id) + ggtitle(paste0(p$labels$title, " and buoy ID"))
p <- p + theme(axis.text.x = element_text(angle = 90))
p
type <- "forecast" # Create a forecasting dataset for our predict() function.
data_forecast <- forecastML::create_lagged_df(data, type = type, outcome_col = outcome_col,
horizons = horizons, lookback = lookback,
dates = dates, frequency = frequency,
dynamic_features = dynamic_features,
groups = groups, static_features = static_features,
use_future = FALSE)
print(paste0("The class of `data_forecast` is ", class(data_forecast)))
## [1] "The class of `data_forecast` is grouped_lagged_df"
## [2] "The class of `data_forecast` is lagged_df"
## [3] "The class of `data_forecast` is list"
forecastML
to autofill the future values of dynamic, non-lagged features so we’ll simply do it manually below.Now we’ll forecast 1, 1:7, and 1:30 days into the future with predict(..., data = data_forecast)
.
The first time step into the future is max(dates) + 1 * frequency
. Here, this is 12-31-2018 + 1 * ‘1 day’ or 1-1-2019.
data_forecasts <- predict(model_results_cv, prediction_function = prediction_function,
data = data_forecast)
print(paste0("The class of `data_forecasts` is ", class(data_forecasts)))
## [1] "The class of `data_forecasts` is forecast_results"
## [2] "The class of `data_forecasts` is forecast_model"
## [3] "The class of `data_forecasts` is data.frame"
xgboost
here–and time horizon. The separate lines are for the combination of buoy_id
* validation window.buoy_id == 1
. In this case, a validation window shows how the model would forecast if it were not trained on data in a given time frame.The modeling steps are more or less the same as in the nested cross-validation modeling described above so we’ll skip the explanations from here on out.
Notice that by this point in the modeling process, the optimal hyperparameters that gave the best performance on the outer-loop validation datasets have already been identified (see the package overview vignette). Incorporating the optimal hyperparameters in a final model would occur in a new user-defined modeling wrapper function.
Train across all data by setting window_length = 0
.
windows <- forecastML::create_windows(data_train, window_length = 0)
p <- plot(windows, data_train) + theme(legend.position = "none")
p
# Un-comment the code below and set 'use_future' to TRUE.
#future::plan(future::multiprocess)
model_results_no_cv <- forecastML::train_model(lagged_df = data_train,
windows = windows,
model_name = "xgboost",
model_function = model_function,
use_future = FALSE)
## [1] "The class of `model_results_no_cv` is forecast_model"
## [2] "The class of `model_results_no_cv` is list"
data_forecasts <- predict(model_results_no_cv, prediction_function = prediction_function,
data = data_forecast)
print(paste0("The class of `data_forecasts` is ", class(data_forecasts)))
## [1] "The class of `data_forecasts` is forecast_results"
## [2] "The class of `data_forecasts` is forecast_model"
## [3] "The class of `data_forecasts` is data.frame"