Dynamic linear regression models are extension to basic linear regression models where instead of constant but unknown regression coefficients, the underlying coefficients are assumed to vary over “time” according to random walk. These types of models allow robust modelling of phenomenas where the effect size of the predictor variables and the response variable can vary during the period of the study. The R
[@R] package walker
provides an efficient method for fully Bayesian inference of such models, where the main computations are performed using state-of-the-art Markov chain Monte Carlo (MCMC) algorithms provided by Stan
[@stan, @rstan]. This also allows the straightforward use of many diagnostic and graphical tools provided by several Stan
related R
packages such as ShinyStan
[@shinystan].
More specifically, the dynamic regression model is defined as \[ \begin{aligned} y_t &= x'_t \beta_t + \epsilon_t, \quad t = 1,\ldots, n\ \beta_{t+1} &= \beta_t + \eta_t, \end{aligned} \] where \(y_t\) is the observation at time \(t\), \(x_t\) contains the corresponding predictor variables, \(\beta_t\) is a \(k\) dimensional vector of regression coefficients at time \(t\), \(\epsilon_t \sim N(0, \sigma^2_{\epsilon})\), and \(\eta_t \sim N(0, D)\), with \(D\) being \(k \times k\) diagonal matrix with diagonal elements \(\sigma^2_{i,\eta}\), \(i=1,\ldots,k\). Denote the unknown parameters of the model by \(\beta = (\beta_1, \ldots, \beta_n)\) and \(\sigma = (\sigma_{\epsilon}, \sigma_{1, \eta}, \ldots, \sigma_{k, \eta})\). We define priors for first \(\beta_1\) as \(N(\mu_{\beta_1}, \sigma_{\beta_1})\), and for \(\sigma_i \sim N(\mu_{\sigma_i}, \sigma_{\sigma_i})\), \(i=1,\ldots,k+1\), truncated to positive values, with slighly awful notation.
Although in principle writing dynamic regression model above in Stan
language is straighforward, most intuitive implementations are computationally inefficient and prone to severe problems related to convergence of the underlying MCMC algorithm. The approach used by walker
is based on the marginalization of the regression coefficients \(\beta\) during the MCMC sampling by using the Kalman filter, and which provides fast and accurate inference of marginal posterior \(p(\sigma | y)\), and the corresponding joint posterior \(p(\sigma, \beta | y) = p(\beta | \sigma, y)p(\sigma | y)\) can then be obtained by simulating the regression coefficients given sampled standard deviations using the Kalman smoothing based simulation algorithms such as [@durbin-koopman2002]. Note that we have opted to sample the \(\beta\) parameters given \(\sigma\)'s, but it is also possible to obtain somewhat more accurate summary statistics such as mean and variance of these parameters by using the standard Kalman smoother for compution of \(\textrm{E}(\beta| \sigma, y)\) and \(\textrm{Var}(\beta| \sigma, y)\), and using the law of total expectation.
Let us consider a observations \(y\) of length \(n=100\), generated by random walk (i.e. time varying intercept) and two predictors. This is rather small problem, but it was chosen in order to make possible comparisons with the “naive” implementation. For larger problems (in terms of number of observations and especially number of predictors) it is very difficult to get naive implementation to work at all, as even after tweaking several parameters of the underlying MCMC sampler, one typically ends up with divergent transitions or low BMFI index, meaning that the results are not to be trusted.
First we simulate the coefficients and the predictors:
set.seed(1)
n <- 100
beta1 <- cumsum(c(0.5, rnorm(n - 1, 0, sd = 0.05)))
beta2 <- cumsum(c(-1, rnorm(n - 1, 0, sd = 0.15)))
x1 <- n:1 / 10
x2 <- cos(1:n)
u <- cumsum(rnorm(n, 0, 0.5))
ts.plot(cbind(u, beta1 * x1, beta2 * x2), col = 1:3)
signal <- u + beta1 * x1 + beta2 * x2
y <- rnorm(n, signal, 0.5)
ts.plot(signal)
lines(y, col = 2)
Then we can call function walker
. The model is defined as a formula like in lm
, and we can give several arguments which are passed to sampling
method of rstan
, such as number of iteration iter
and number of chains chains
(default values for these are 2000 and 4). In addition to these, we use arguments beta_prior
and sigma_prior
, which define the prior distributions for \(\beta\) and \(\sigma\) respectively. These arguments should be two-column matrices, where the first column defines the prior means, and the second column defines the prior standard deviations.
kalman_walker <- walker(y ~ x1 + x2, refresh = 0, chains = 2,
beta_prior = cbind(0, rep(5, 3)), sigma_prior = cbind(0, rep(2, 4)))
##
## Gradient evaluation took 0.000426 seconds
## 1000 transitions using 10 leapfrog steps per transition would take 4.26 seconds.
## Adjust your expectations accordingly!
##
##
##
## Elapsed Time: 4.38233 seconds (Warm-up)
## 6.12298 seconds (Sampling)
## 10.5053 seconds (Total)
## The following numerical problems occurred the indicated number of times on chain 1
## count
## Exception thrown at line -1: Exception thrown at line -1: multiply: A[1] is -nan, but must not be na 1
## When a numerical problem occurs, the Hamiltonian proposal gets rejected.
## See http://mc-stan.org/misc/warnings.html#exception-hamiltonian-proposal-rejected
## If the number in the 'count' column is small, there is no need to ask about this message on stan-users.
##
## Gradient evaluation took 0.000379 seconds
## 1000 transitions using 10 leapfrog steps per transition would take 3.79 seconds.
## Adjust your expectations accordingly!
##
##
##
## Elapsed Time: 4.32755 seconds (Warm-up)
## 3.72021 seconds (Sampling)
## 8.04776 seconds (Total)
print(kalman_walker, pars = c("sigma_y", "sigma_b"))
## Inference for Stan model: rw_model.
## 2 chains, each with iter=2000; warmup=1000; thin=1;
## post-warmup draws per chain=1000, total post-warmup draws=2000.
##
## mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
## sigma_y 0.55 0.00 0.09 0.39 0.50 0.55 0.60 0.72 704 1
## sigma_b[1] 0.35 0.01 0.15 0.07 0.25 0.35 0.45 0.66 513 1
## sigma_b[2] 0.06 0.00 0.03 0.00 0.05 0.06 0.08 0.11 429 1
## sigma_b[3] 0.21 0.00 0.09 0.08 0.14 0.19 0.26 0.43 1081 1
##
## Samples were drawn using NUTS(diag_e) at Thu Jun 15 15:09:12 2017.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).
library("rstan")
stan_plot(kalman_walker, pars = c("sigma_y", "sigma_b"))
## ci_level: 0.8 (80% intervals)
## outer_level: 0.95 (95% intervals)
We often get few (typically one) warning message about numerical problems, as the sampling algorithm warms up, but this is nothing to be concerned with (if more errors occur, then a Github issue for walker
package is more than welcome).
Using the extract
method of rstan
we can pick up the samples corresponding to \(\beta\)'s, and we can for example plot the posterior mean paths with ts.plot
(the dashed lines correspond to true values):
betas <- summary(kalman_walker, "beta")$summary
ts.plot(cbind(u, beta1, beta2,
matrix(betas[, c("mean", "2.5%", "97.5%")], ncol = 9)),
col = c(1:3, rep(1:3, 3)), lty = rep(1:2, times = c(3, 9)))
We can perform the same analysis with naive implementation by setting the argument naive
to TRUE
:
naive_walker <- walker(y ~ x1 + x2, seed = 1, refresh = 0, chains = 2,
beta_prior = cbind(0, rep(5, 3)), sigma_prior = cbind(0, rep(2, 4)),
naive = TRUE, control = list(adapt_delta = 0.9, max_treedepth = 15))
##
## Gradient evaluation took 5.5e-05 seconds
## 1000 transitions using 10 leapfrog steps per transition would take 0.55 seconds.
## Adjust your expectations accordingly!
##
##
##
## Elapsed Time: 36.7643 seconds (Warm-up)
## 13.8379 seconds (Sampling)
## 50.6022 seconds (Total)
## The following numerical problems occurred the indicated number of times on chain 1
## count
## Exception thrown at line -1: normal_lpdf: Location parameter[2] is inf, but must be finite! 4
## When a numerical problem occurs, the Hamiltonian proposal gets rejected.
## See http://mc-stan.org/misc/warnings.html#exception-hamiltonian-proposal-rejected
## If the number in the 'count' column is small, there is no need to ask about this message on stan-users.
##
## Gradient evaluation took 6.5e-05 seconds
## 1000 transitions using 10 leapfrog steps per transition would take 0.65 seconds.
## Adjust your expectations accordingly!
##
##
##
## Elapsed Time: 41.5331 seconds (Warm-up)
## 29.035 seconds (Sampling)
## 70.5681 seconds (Total)
## The following numerical problems occurred the indicated number of times on chain 2
## count
## Exception thrown at line -1: normal_lpdf: Location parameter[2] is inf, but must be finite! 6
## When a numerical problem occurs, the Hamiltonian proposal gets rejected.
## See http://mc-stan.org/misc/warnings.html#exception-hamiltonian-proposal-rejected
## If the number in the 'count' column is small, there is no need to ask about this message on stan-users.
## Warning: There were 23 divergent transitions after warmup. Increasing adapt_delta above 0.9 may help. See
## http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
## Warning: Examine the pairs() plot to diagnose sampling problems
print(naive_walker, pars = c("sigma_y", "sigma_b"))
## Inference for Stan model: rw_model_naive.
## 2 chains, each with iter=2000; warmup=1000; thin=1;
## post-warmup draws per chain=1000, total post-warmup draws=2000.
##
## mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
## sigma_y 0.55 0.01 0.09 0.39 0.49 0.55 0.60 0.72 250 1.00
## sigma_b[1] 0.35 0.01 0.15 0.06 0.25 0.36 0.46 0.64 125 1.03
## sigma_b[2] 0.06 0.00 0.03 0.01 0.05 0.06 0.08 0.11 150 1.03
## sigma_b[3] 0.21 0.00 0.09 0.08 0.15 0.19 0.26 0.44 436 1.00
##
## Samples were drawn using NUTS(diag_e) at Thu Jun 15 15:11:15 2017.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).
sum(get_elapsed_time(kalman_walker))
## [1] 18.55307
sum(get_elapsed_time(naive_walker))
## [1] 121.1703
With naive implementation we get smaller effective sample sizes and much higher computation time, as well as some indications of divergence problems, even with adjusted step size (argument adapt_delta
).
The walker
function also returns samples from the posterior predictive distribution \(p(y^{\textrm{rep}} | y) = \int p(y^{\textrm{rep}} | \beta, \sigma, y) p(\beta, \sigma | y) \textrm{d}\beta\textrm{d}\sigma\). This can be used to used for example in assessment of model “fit” to the data. By comparing the replication series (mean and 95% quantiles in black) and the original observations (in red) we see that very good overlap, which is not that suprising given that we know the correct model:
y_rep <- summary(kalman_walker, "y_rep")$summary
ts.plot(y_rep[, c("mean", "2.5%", "97.5%")], lty = c(1, 2, 2))
lines(y, col = 2)
It is also possible to perform actual predictions given new covariates \(x^{new}\) (Currently we need to run whole MCMC procedure for this but a separate function which uses the output of walker
will likely be added in future).
original_data <- data.frame(y = head(y, 95), x1 = head(x1, 95), x2 = head(x2, 95))
new_data <- data.frame(x1 = tail(x1, 5), x2 = tail(x2, 5))
walker_predict <- walker(y ~ x1 + x2, data = original_data, newdata = new_data,
iter = 2000, chains = 1, seed = 1, refresh = 0,
beta_prior = cbind(0, rep(2, 3)), sigma_prior = cbind(0, rep(2, 4)))
##
## Gradient evaluation took 0.000366 seconds
## 1000 transitions using 10 leapfrog steps per transition would take 3.66 seconds.
## Adjust your expectations accordingly!
##
##
##
## Elapsed Time: 4.43246 seconds (Warm-up)
## 5.47436 seconds (Sampling)
## 9.90682 seconds (Total)
intervals <- summary(walker_predict, pars = "y_new")$summary[, c("mean", "2.5%", "97.5%")]
ts.plot(ts(y), ts(intervals, start = 96),
col = c(1, 2, 2, 2), lty = c(1, 1, 2, 2))
In this vignette we illustrated the benefits of marginalisation in the context of dynamic regression models. The underlying idea is not new; this approach is typical especially in classic Metropolis-type algorithms for linear-Gaussian state space models where the marginal likelihood \(p(y | \theta)\) (where \(\theta\) denotes the hyperparameters i.e. not the latents states such as \(\beta\)'s in current context) is used in the computation of the acceptance probability. Here instead of building specific MCMC machinery, we rely on readily available Hamiltonian Monte Carlo based Stan
software, thus allowing us to enjoy the benefits of diverse tools of the Stan
community. Due to the restricted class of models considered, we can also simplify the underlying Kalman filter considerably, thus enabling efficient likelihood evaluation.
Although already fully functionable, some manual processing of the results are currently needed in order to extract and possibly plot for example fitted values and prediction intervals from the output of walker
. In future, more straighforward methods for these tasks should be implemented. From a methodological perspective, one possible extension is a support for generalized linear dynamic regression. Although in this case the Kalman filter recursions are not directly applicable, efficient Laplace approximations for these type of models are available, and the possible bias can be efficiently corrected in post-processing step using importance sampling type correction [@vihola-helske-franks].