%%\VignetteIndexEntry{User guide to msm with worked examples}
%\VignettePackage{msm}
\documentclass{article}

%% Need to modify Sweave.sty to pass pdftex option to graphicx.
%\usepackage{Sweave-local}
\usepackage{graphics}
\RequirePackage[T1]{fontenc}
%%Check if we are compiling under latex or pdflatex
\ifx\pdftexversion\undefined
  \RequirePackage[dvips]{graphicx}
\else
  \RequirePackage[pdftex]{graphicx}
  \RequirePackage{epstopdf}
\fi
\RequirePackage{ae,fancyvrb}
\IfFileExists{upquote.sty}{\RequirePackage{upquote}}{}
\setkeys{Gin}{width=0.8\textwidth}
\DefineVerbatimEnvironment{Sinput}{Verbatim}{fontshape=sl}
\DefineVerbatimEnvironment{Soutput}{Verbatim}{}
\DefineVerbatimEnvironment{Scode}{Verbatim}{fontshape=sl}
\newenvironment{Schunk}{}{}

\usepackage{times}
\usepackage{url}
\addtolength{\textwidth}{2cm}

\newcommand{\Exp}{\mathop{\mathrm{Exp}}}
\newcommand{\etal}{{\textit{et al.}}}
\newcommand{\Rfunction}[1]{{\texttt{#1}}}
\newcommand{\Robject}[1]{{\texttt{#1}}}
\newcommand{\Rpackage}[1]{{\textit{#1}}}
\newcommand{\Rmethod}[1]{{\texttt{#1}}}
\newcommand{\Rfunarg}[1]{{\texttt{#1}}}
\newcommand{\Rclass}[1]{{\textit{#1}}}



%% version <- 1.3

<<echo=FALSE>>=
version <- gsub("Version: +", "",
               packageDescription("msm", lib.loc=c("../..",.libPaths()))$Version)
@

\title{Multi-state modelling with R: the {\tt msm} package \vskip 0.2in
  \large{Version %
<<echo=FALSE,results=tex>>=
cat(version)
@
\vskip 0.1in
<<echo=FALSE,results=tex>>=
cat(format(Sys.time(), "%d %B, %Y"))
@
}}
\author{Christopher Jackson\\MRC Biostatistics Unit\\Cambridge, U.K.\\ \texttt{chris.jackson@mrc-bsu.cam.ac.uk}}
\date{}
%% label with date when Rnw is compiled, not when tex is compiled. should be release date of package

\bibliographystyle{unsrt}

\begin{document}

\maketitle

\begin{abstract}

  The multi-state Markov model is a useful way of describing a process
  in which an individual moves through a series of states in
  continuous time.  The \Rpackage{msm} package for R allows a general
  multi-state model to be fitted to longitudinal data. Data often
  consist of observations of the process at arbitrary times, so that
  the exact times when the state changes are unobserved.  For example,
  the progression of chronic diseases is often described by stages of
  severity, and the state of the patient may only be known at doctor
  or hospital visits.  Features of \Rpackage{msm} include the ability
  to model transition rates and hidden Markov output models in terms
  of covariates, and the ability to model data with a variety of
  observation schemes, including censored states.

  Hidden Markov models, in which the true path through states is only
  observed through some error-prone marker, can also be fitted.  The
  observation is generated, conditionally on the underlying states,
  via some distribution.  An example is a screening misclassification
  model in which states are observed with error.  More generally,
  hidden Markov models can have a continuous response, with some
  arbitrary distribution, conditionally on the underlying state.

  This manual introduces the theory behind multi-state Markov and
  hidden Markov models, and gives a tutorial in the typical use of the
  \Rpackage{msm} package, illustrated by some typical applications to
  modelling chronic diseases.

  Much of the material in this manual is published, in a more concise
  form, in Journal of Statistical Software (2011) 38(8):1-29,
  \url{http://www.jstatsoft.org/v38/i08/}

\end{abstract}

\section{Multi-state models}
\label{sec:multistate}

\subsection{Introduction}
\label{sec:intro}


Figure \ref{fig:multi} illustrates a multi-state model in continuous
time. Its four states are labelled {\bf 1, 2, 3, 4}.  At a time $t$
the individual is in state $S(t)$.  The arrows show which transitions
are possible between states. The next state to which the individual
moves, and the time of the change, are governed by a set of {\em
  transition intensities} $q_{rs}(t, z(t))$ for each pair of states
$r$ and $s$.  The intensities may also depend on the time of the
process $t$, or more generally a set of individual-specific or
time-varying explanatory variables $z(t)$.  The intensity represents
the instantaneous risk of moving from state $r$ to state $s$:
\begin{equation}
  \label{eq:multi:intensity}
  q_{rs}(t, z(t))  =  \lim_{\delta t \rightarrow 0} P (S(t+\delta t) = s | S(t) = r) / \delta t
\end{equation}
The intensities form a matrix $Q$ whose rows sum to zero, so that the
diagonal entries are defined by $q_{rr} = - \sum_{s \neq r} q_{rs}$.

To fit a multi-state model to data, we estimate this transition
intensity matrix.  We concentrate on {\em Markov} models here. The
Markov assumption is that future evolution only depends on the current
state.  That is, $q_{rs}(t, z(t), \mathcal{F}_t)$ is independent of
$\mathcal{F}_t$, the observation history $\mathcal{F}_t$ of the
process up to the time preceding $t$.  See, for example, Cox and
Miller\cite{cox:miller} for a thorough
introduction to the theory of continuous-time Markov chains.

\begin{figure}[p]
  \begin{center}
    \scalebox{0.4}{\includegraphics{figures/multistate}}
    \[
    Q = \left(
      \begin{array}{llll}
        q_{11} & q_{12} & q_{13} & q_{14}\\
        q_{21} & q_{22} & q_{23} & q_{24}\\
        q_{31} & q_{32} & q_{33} & q_{34}\\
        q_{41} & q_{42} & q_{43} & q_{44}\\
      \end{array}
    \right )
    \]
    \caption{\label{fig:multi}General multi-state model.}
  \end{center}
\end{figure}

In a time-homogeneous continuous-time Markov model, a single period of occupancy (or
\emph{sojourn time}) in state $r$ has an exponential distribution,
with rate given by $-q_{rr}$, (or
mean $-1 / q_{rr}$).  The remaining elements of the $r$th row of $Q$
are \emph{proportional to} the probabilities governing the next state
after $r$ to which the individual makes a transition.  The probability
that the individual's next move from state $r$ is to state $s$ is
$-q_{rs} / q_{rr}$.


\subsection{Disease progression models}

The development of the \Rpackage{msm} package was motivated by
applications to disease modelling.  Many chronic diseases have a
natural interpretation in terms of staged progression. Multi-state
Markov models in continuous time are often used to model the course of
diseases. A commonly-used model is illustrated in
Figure \ref{fig:disease}.  This represents a series of successively
more severe disease stages, and an `absorbing' state, often death.
The patient may advance into or recover from adjacent disease stages,
or die at any disease stage.  Observations of the state $S_i(t)$ are
made on a number of individuals $i$ at arbitrary times $t$, which may
vary between individuals.  The stages of disease may be modelled as a
homogeneous continuous-time Markov process, with a transition matrix
$Q$, pictured below Figure \ref{fig:disease}.

A commonly-used model is the \emph{illness-death} model, with three
states representing health, illness and death
(Figure \ref{fig:multi:illdeath}).  Transitions are permitted from
health to illness, illness to death and health to death.  Recovery
from illness to health is sometimes also considered.

A wide range of medical situations have been modelled using
multi-state methods, for example, screening for abdominal aortic
aneurysms (Jackson \etal\cite{jackson:sharples:2003}),
problems following lung transplantation (Jackson and
Sharples\cite{jackson:sharples:2002}), problems following heart
transplantation (Sharples\cite{sharp:gibbs}, Klotz and
Sharples\cite{klotz:est}), hepatic cancer (Kay\cite{kay:mark}), HIV
infection and AIDS (Longini \etal\cite{long1}, Satten and
Longini\cite{sattlong}, Guihenneuc-Jouyaux \etal\cite{rich},
Gentleman \etal\cite{gentlaw}), diabetic complications
(Marshall and Jones\cite{retino2}, Andersen\cite{andersen88}), breast
cancer screening (Duffy and Chen\cite{duffy95}, Chen \emph{et
  al.}\cite{duffy96}), cervical cancer screening (Kirby and
Spiegelhalter\cite{kirby:spiegelhalter}) and liver cirrhosis (Andersen
\etal\cite{ander:proth}).  Many of these references also
describe the mathematical theory, which will be reviewed in the
following sections.


\begin{figure}[p]
  \centering
  \vskip 1cm
  \scalebox{1.0}{\includegraphics{figures/general}}
    \[
    Q = \left(
      \begin{array}{llllll}
        q_{11} & q_{12} & 0 & 0 & \ldots & q_{1n}\\
        q_{21} & q_{22} & q_{23} & 0 & \ldots & q_{2n}\\
        0 & q_{32} & q_{33} & q_{34} & \ddots & q_{3n}\\
        0 & 0 & q_{43} & q_{44} & \ddots & q_{4n}\\
        \vdots & \vdots & \ddots & \ddots & \ddots & \vdots\\
        0 & 0 & 0 & 0 & \ldots & 0\\
      \end{array}
    \right )
    \]
  \caption{\label{fig:disease}General model for disease progression.}
  \end{figure}

\begin{figure}[p]
  \begin{center}
    \scalebox{0.4}{\includegraphics{figures/illdeath}}
    \caption{Illness-death model.}
    \label{fig:multi:illdeath}
  \end{center}
\end{figure}

\subsection{Arbitrary observation times}
\label{sec:arbitr-observ-times}

Longitudinal data from monitoring disease progression are often
incomplete in some way.  Usually patients are seen at intermittent
follow-up visits, at which monitoring information is collected, but
information from the periods between visits is not available.  Often
the exact time of disease onset is unknown. Thus, the changes of state
in a multi-state model usually occur at unknown times.  Also a subject
may only be followed up for a portion of their disease history.  A
fixed observation schedule may be specified in advance, but in
practice times of visits may vary due to patient and hospital
pressures. The states of disease progression models often include
death. Death times are commonly recorded to within a day.  Also
observations may be censored.  For example, at the end of a study, an
individual may be known only to be alive, and in an unknown state.

A typical sampling situation is illustrated in
Figure \ref{fig:multi:sampling}. The individual is observed at four
occasions through 10 months. The final occasion is the death date
which is recorded to within a day. The only other information
available is the occupation of states 2, 2, and 1 at respective times
1.5, 3.5 and 5. The times of movement between states and the state
occupancy in between the observation times are unknown.  Although the
patient was in state 3 between 7 and 9 months this was not observed at
all.

\paragraph{Informative sampling times}
To fit a model to longitudinal data with arbitrary sampling times we
must consider the reasons why observations were made at the given
times.  This is analogous to the problem of missing data, where the
fact that a particular observation is missing may implicitly give
information about the value of that observation. Possible observation
schemes include:
\begin{itemize}
\item {\em fixed}. Each patient is observed at fixed intervals
  specified in advance.
\item {\em random}. The sampling times vary randomly, independently
  of the current state of the disease.
\item {\em doctor's care}. More severely ill patients are monitored
  more closely. The next sampling time is chosen on the basis of the
  current disease state.
\item patient {\em self-selection}. A patient may decide to visit the
  doctor on occasions when they are in a poor condition.
\end{itemize}
Gr\"uger \etal \cite{gruger:valid} discussed conditions under
which sampling times are \emph{informative}. If a multi-state model is
fitted, ignoring the information available in the sampling times, then
inference may be biased.  Mathematically, because the sampling
times are often themselves random, they should be modelled along with
the observation process $X_t$. However the ideal situation is where
the joint likelihood for the times and the process is proportional to
the likelihood obtained if the sampling times were fixed in advance.
Then the parameters of the process can be estimated independently of
the parameters of the sampling scheme.

In particular, they showed that fixed, random and doctor's care
observation policies are not informative, whereas patient
self-selection is informative.   Note that \Rpackage{msm} does not
deal with informative sampling times. See, e.g. \cite{sweeting:inform:msm}
for some methods in this case, which require specialised programming.

\begin{figure}[p]
  \begin{center}
    \scalebox{0.6}{\rotatebox{270}{\includegraphics{figures/sampling}}}
    \caption{Evolution of a multi-state model. The process is observed
      on four occasions.}
    \label{fig:multi:sampling}
  \end{center}
\end{figure}




\subsection{Likelihood for the multi-state model}
\label{sec:multi:likelihood}

Kalbfleisch and Lawless\cite{kalbfleisch:lawless} and later
Kay \cite{kay:mark} described a general method for evaluating the
likelihood for a general multi-state model in continuous time,
applicable to any form of transition matrix. The only available
information is the observed state at a set of times, as in
Figure \ref{fig:multi:sampling}.  The sampling times are assumed to be
non-informative.

\paragraph{Transition probability matrix}
The likelihood is calculated from the \emph{transition probability matrix}
$P(t)$.  For a time-homogeneous process, the $(r,s)$ entry of $P(t)$,
$p_{rs}(t)$, is the probability of being in state $s$ at a time $t+u$
in the future, given the state at time $u$ is $r$.  It does not say
anything about the time of transition from $r$ to $s$, indeed the
process may have entered other states between times $u$ and $t+u$.
$P(t)$ can be calculated by taking the matrix exponential of the
scaled transition intensity matrix (see, for example, Cox and
Miller \cite{cox:miller}).
\begin{equation}
  \label{eq:exptq}
  P(t) = \Exp(tQ)
\end{equation}

The matrix exponential $\Exp$ is different from a scalar exponential.
The exponential of a matrix is defined by the same "power series"
$\Exp(X) = 1 + X^2/2! + X^3/3! + ...$ as the scalar exponential,
except that each term $X^k$ in the series is defined by matrix products,
not element-wise scalar multiplication.  It is notoriously difficult
to calculate reliably, as discussed by Moler and van Loan \cite{matrixexp}.
For simpler models, it is feasible to calculate an analytic expression
for each element of $P(t)$ in terms of $Q$.  This is generally faster
and avoids the potential numerical instability of calculating the
matrix exponential.  Symbolic algebra sofware, such as Mathematica,
can be helpful for obtaining these expressions.  For example, the
three-state illness-death model with no recovery has a transition
intensity matrix of
\[
Q = \left(
  \begin{array}{llllll}
    -(q_{12} + q_{13}) & q_{12} & q_{13}\\
    0 & - q_{23} & q_{23}\\
    0 & 0 & 0\\
  \end{array}
\right )
\]
The corresponding time $t$ transition probabilities are
\begin{eqnarray*}
  p_{11}(t) & = & e^{-(q_{12} + q_{13})t}\\
  p_{12}(t) & = &
\left\{
\begin{array}{ll}
\frac{ q_{12} }{q_{12} + q_{13} - q_{23} } (e^{-q_{23} t} - e^{-(q_{12} + q_{13})t})  & (q_{12} + q_{13} \neq q_{23})\\
q_{12}te^{(-(q_{12} + q_{13})t} &  (q_{12} + q_{13} = q_{23})
\end{array}
\right.
\\
  p_{13}(t) & = &
\left\{
\begin{array}{ll}
1 - e^{-(q_{12} + q_{13})t} - \frac{ q_{12} }{q_{12} + q_{13} - q_{23} } (e^{-q_{23} t} - e^{-(q_{12} + q_{13})t}) & (q_{12} + q_{13} \neq q_{23})\\
 (-1 + e^{(q_{12} + q_{13})t} - q_{12}t)e^{-(q_{12} + q_{13})t} & (q_{12} + q_{13} = q_{23})
\end{array}
\right.
\\
  p_{21}(t) & = & 0\\
  p_{22}(t) & = & e^{-q_{23}t}\\
  p_{23}(t) & = & 1 - e^{-q_{23}t}\\
  p_{31}(t) & = & 0\\
  p_{32}(t) & = & 0\\
  p_{33}(t) & = & 1\\
\end{eqnarray*}
The \Rpackage{msm} package calculates $P(t)$ analytically for selected
2, 3, 4 and 5-state models, illustrated in
Figures \ref{fig:anp2}--\ref{fig:anp5}.  For other models, which can have any transition
structure on any number of states in principle, $P(t)$ is
determined from the matrix exponential.  This is calculated using eigensystem
decomposition (if eigenvalues are distinct) or a method based on
Pad\'e approximants with scaling and squaring \cite{matrixexp} (if
there are repeated eigenvalues). Notice that the states are not
labelled in these figures.  Each graph can correspond to several
different $Q$ matrices, depending on how the states are labelled.  For
example, Figure~\ref{fig:anp2} a) illustrates the model defined by either \( Q =
\left(
  \begin{array}{ll}
    - q_{12} & q_{12} \\
    0 & 0
\end{array}
\right)
\)
or
\(
Q = \left( \begin{array}{ll}
0 & 0\\
q_{21} & - q_{21}
\end{array}
\right)
\).

\begin{figure}
\begin{center}
a) \includegraphics[width=3cm]{figures/p2q1} \hskip 3cm
b) \includegraphics[width=3cm]{figures/p2q12}
\end{center}
\caption{\label{fig:anp2}Two-state models fitted using analytic $P(t)$ matrices in \Rpackage{msm}. Implemented for all permutations of state labels 1, 2.}
\end{figure}
\begin{figure}
\begin{center}
a) \includegraphics[width=3cm]{figures/p3q12} \hskip 3cm
b) \includegraphics[width=5cm]{figures/p3q14} \vskip 1cm
c) \includegraphics[width=3cm]{figures/p3q16}\hskip 3cm
d) \includegraphics[width=3cm]{figures/p3q124}\vskip 1cm
e) \includegraphics[width=3cm]{figures/p3q135}\hskip 3cm
f) \includegraphics[width=3cm]{figures/p3q1246}
\end{center}
\caption{\label{fig:anp3}Three-state models fitted using analytic $P(t)$ matrices in \Rpackage{msm}. Implemented for all permutations of state labels 1, 2, 3.}
\end{figure}
\begin{figure}
\begin{center}
a) \includegraphics[width=7cm]{figures/p4q159} \vskip 1cm
b) \includegraphics[width=7cm]{figures/p4q13569}
\end{center}
\caption{\label{fig:anp4}Four-state models fitted using analytic $P(t)$ matrices in \Rpackage{msm}. Implemented for all permutations of state labels 1, 2, 3, 4.}
\end{figure}
\begin{figure}
\begin{center}
a) \includegraphics[width=9cm]{figures/p5q1_6_11_16}\vskip 1cm
b) \includegraphics[width=9cm]{figures/p5q1_4_6_8_11_12_16}\vskip 1cm
c) \includegraphics[width=5cm]{figures/p5q1_6_7_11_12}
\end{center}
\caption{\label{fig:anp5}Five-state models fitted using analytic $P(t)$ matrices in \Rpackage{msm}. Implemented for all permutations of state labels 1, 2, 3, 4, 5.}
\end{figure}

\paragraph{Likelihood for intermittently-observed processes}
Suppose $i$ indexes $M$ individuals.  The data for individual $i$
consist of a series of times $(t_{i1}, \ldots, t_{in_i})$ and
corresponding states $(S(t_{i1}), \ldots, S(t_{in_i}))$.  Consider a
general multi state model, with a pair of successive observed disease
states $S(t_j), S(t_{j+1})$ at times $t_j, t_{j+1}$. The contribution
to the likelihood from this pair of states is

\begin{equation}
  \label{eq:multi:lik:contrib}
  L_{i, j} =  p_{S(t_j)S(t_{j+1})}(t_{j+1} - t_j)
\end{equation}
This is the entry of the transition matrix $P(t)$ at the $S(t_j)$th
row and $S(t_{j+1})$th column, evaluated at $t = t_{j+1} - t_j$.

The full likelihood $L(Q)$ is the product of all such terms $L_{i,j}$
over all individuals and all transitions. It depends on the unknown
transition matrix $Q$, which was used to determine $P(t)$.

\paragraph{Exactly-observed death times}
In observational studies of chronic diseases, it is common that the
time of death is known, but the state on the previous instant before
death is unknown. If $S(t_{j+1}) = D $ is such a death state, then the
contribution to the likelihood is summed over the unknown state $m$ on
the instant before death:

\begin{equation}
  \label{eq:multi:lik:death}
  L_{i, j} =  \sum_{m \neq D} p_{S(t_j),m}(t_{j+1} - t_j) q_{m, D}
\end{equation}
The sum is taken over all possible states $m$ which can be visited
between $S(t_j)$ and $D$.

\paragraph{Exactly observed transition times}
If the times $(t_{i1}, \ldots, t_{in_i})$ had been the {\em exact}
transition times between the states, with no transitions between the
observation times, then the contributions would be of the form

\begin{equation}
  \label{eq:multi:lik:exact}
  L_{i, j} =  \exp(q_{S(t_j)S(t_{j})}(t_{j+1} - t_j)) q_{S(t_j)S(t_{j+1})}
\end{equation}
since the state is assumed to be $S(t_j)$ throughout the interval
between $t_j$ and $t_{j+1}$ with a known transition to state
$S(t_{j+1})$ at $t_{j+1}$.  \Rpackage{msm} is restricted to Markov
models, but much richer models are possible for this type of data. For
example, Putter \etal \cite{putter:mstate} discussed the
\Rpackage{mstate} software for semi-parametric multi-state models with
non-parametric baseline hazards and Cox regression.  The Markov
assumption is restrictive but necessary in general to compute a
likelihood for intermittently-observed processes.

\paragraph{Censored states}
A censored quantity is one whose exact value is unknown, but known to
be in a certain interval.  For example, in survival analysis, a death
time is \emph{right-censored} if the study ends and the patient is
still alive, since the death time is known to be greater than the end
time.  In multi-state models for intermittently-observed processes,
the times of changes of state are usually \emph{interval censored},
known to be within bounded intervals.  This leads to a likelihood
based on equation \ref{eq:multi:lik:contrib}.

In some circumstances, \emph{states} may be censored as well as
\emph{event times}.  For example, at the end of some chronic disease
studies, patients are known to be alive but in an \emph{unknown
  state}.  For such a censored observation $S(t_{j+1})$ ($j+1=n$) known only to
be a state in the set $C$, the equivalent contribution to the
likelihood is
\begin{equation}
  \label{eq:multi:lik:deathcens}
  L_{i, j} =  \sum_{m \in C} p_{S(t_j),m}(t_{j+1} - t_j)
\end{equation}
Note that this special likelihood is not needed if the state is known
at the end of the study.  In this case,
likelihood \ref{eq:multi:lik:contrib} applies.  Although the
\emph{survival time} is censored, the \emph{state} at the end of the
study is not censored.

More generally, suppose every observation from a particular individual
is censored.  Observations $1, 2, \ldots n_i$ are known only to be in
the sets $C_1, C_2, \ldots, C_{n_i}$ respectively. The likelihood for
this individual is a sum of the likelihoods of all possible paths
through the unobserved states.
\begin{equation}
  \label{eq:multi:lik:cens}
  L_i = \sum_{s_{n_i} \in C_{n_i}} \ldots \sum_{s_2 \in C_2} \sum_{s_1 \in C_1} p_{s_1 s_2}(t_2 - t_1) p_{s_2 s_3} (t_3 - t_2) \ldots p_{s_{n_i-1} s_{n_i}} (t_{n_i} - t_{n_i-1})
\end{equation}
Suppose the states comprising the set $C_j$ are $c^{(j)}_1, \ldots,
c^{(j)}_{m_j}$. This likelihood can also be written as a matrix
product, say,
\begin{equation}
  \label{eq:multi:lik:cens:matrix}
  L_i = \mathbf{1}^T P^{1,2} P^{2,3} \ldots P^{n_i-1, n_i} \mathbf{1}
\end{equation}
where $P^{j-1, j}$ is a $m_{j-1} \times m_j$ matrix with $(r,s)$ entry
$p_{c^{(j-1)}_r c^{(j)}_s}(t_j - t_{j-1})$, and $\mathbf{1}$ is the
vector of ones.

The \Rpackage{msm} package allows multi-state models to be fitted to
data from processes with arbitrary observation times (panel data), exactly-observed
transition times, exact death times and censored states, or a mixture
of these schemes.


\subsection{Covariates}
\label{sec:multi:covariates}

The relation of constant or time-varying characteristics of
individuals to their transition rates is often of interest in a
multi-state model.  Explanatory variables for a particular transition
intensity can be investigated by modelling the intensity as a function
of these variables.  Marshall and Jones \cite{retino2} described a
form of a {\em proportional hazards} model, where the transition
intensity matrix elements $q_{rs}$ which are of interest can be
replaced by
\[
q_{rs}(z(t)) = q_{rs}^{(0)} \exp(\beta_{rs}^T z(t))
\]
The new $Q$ is then used to determine the likelihood. If the
covariates $z(t)$ are time dependent, the contributions to the
likelihood of the form $p_{rs} (t - u)$ are replaced by
\[
p_{rs}(t - u, z(u))
\]
although this requires that the value of the covariate is known at
every observation time $u$. Sometimes covariates are observed at
different times to the main response, for example recurrent disease
events or other biological markers. In some of these cases it could be
assumed that the covariate is a step function which remains constant
between its observation times.  If the main response (the state of the
Markov process) is not observed at the times when the covariate
changes, it could be considered as a "censored" state (as in
Section \ref{sec:multi:likelihood}).

The \Rpackage{msm} package allows individual-specific or time
dependent covariates to be fitted to transition intensities.  In order
to calculate transition probabilities $P(t)$ on which the likelihood
depends, time-dependent covariates are assumed to be
piecewise-constant.  Models whose intensities change with time are
called \emph{time-inhomogeneous}. An important special case handled by
\Rpackage{msm} is the model in which intensities change at a series of
times common to each individual.

Marshall and Jones \cite{retino2} described likelihood ratio and Wald
tests for covariate selection and testing hypotheses, for example
whether the effect of a covariate is the same for all forward
transitions in a disease progression model, or whether the effect on
backward transitions is equal to minus the effect on forward
transitions.





\subsection{Hidden Markov models}
\label{sec:hmm}


In a {\em hidden Markov model} (HMM) the states of the Markov chain
are not observed. The observed data are governed by some probability
distribution (the \emph{emission} distribution) conditionally on the
unobserved state.  The evolution of the underlying Markov chain is
governed by a transition intensity matrix $Q$ as before.
(Figure \ref{fig:multi:hidden}).  Hidden Markov models are mixture
models, where observations are generated from a certain number of
unknown distributions.  However the distribution changes through time
according to states of a hidden Markov chain.  This class of model is
commonly used in areas such as speech and signal processing
\cite{juang:rabiner} and the analysis of biological sequence data
\cite{biolog:seq}. In engineering and biological sequencing
applications, the Markov process usually evolves over an
equally-spaced, discrete `time' space.  Therefore most of the theory
of HMM estimation was developed for discrete-time models.

HMMs have less frequently been used in medicine, where continuous time
processes are often more suitable.  A disease process evolves in
continuous time, and patients are often monitored at irregular and
differing intervals.  These models are suitable for estimating
population quantities for chronic diseases which have a natural staged
interpretation, but which can only be diagnosed by an error-prone
marker.  The \Rpackage{msm} package can fit continuous-time hidden
Markov models with a variety of emission distributions.


\begin{figure}[htbp]
  \begin{center}
    \scalebox{0.6}{\rotatebox{270}{\includegraphics{figures/hidden}}}
    \caption{A hidden Markov model in continuous time. Observed states
      are generated conditionally on an underlying Markov process. }
    \label{fig:multi:hidden}
  \end{center}
\end{figure}


\subsubsection{Misclassification models}

An example of a hidden Markov model is a multi-state model with
misclassification.  Here the observed data are states, assumed to be
misclassifications of the true, underlying states.

For example, consider a disease progression model with at least a
disease-free and a disease state. When screening for the presence of
the disease, the screening process can sometimes be subject to error.
Then the Markov disease process $S_i(t)$ for individual $i$ is not
observed directly, but through realisations $O_i(t)$.  The quality of
a diagnostic test is often measured by the probabilities that the true
and observed states are equal, $Pr(O_i(t) = r | S_i(t) = r)$. Where
$r$ represents a `positive' disease state, this is the {\em
  sensitivity}, or the probability that a true positive is detected by
the test.  Where $r$ represents a `negative' or disease-free state,
this represents the {\em specificity}, or the probability that, given
the condition of interest is absent, the test produces a negative
result.

As an extension to the simple multi-state model described in
section \ref{sec:multistate}, the \Rpackage{msm} package can fit a
general multi-state model with misclassification.  For patient $i$,
observation time $t_{ij}$, observed states $O_{ij}$ are generated
conditionally on true states $S_{ij}$ according to a {\em
  misclassification matrix} $E$. This is a $n \times n$ matrix, whose
$(r,s)$ entry is
\begin{equation}
  \label{eq:misc}
  e_{rs} = Pr(O(t_{ij}) = s | S(t_{ij}) = r),
\end{equation}
which we first assume to be independent of time $t$. Analogously to
the entries of $Q$, some of the $e_{rs}$ may be fixed to reflect
knowledge of the diagnosis process. For example, the probability of
misclassification may be negligibly small for non-adjacent states of
disease.  Thus the progression through underlying states is governed
by the transition intensity matrix $Q$, while the observation process
of the underlying states is governed by the misclassification matrix
$E$.

To investigate explanatory variables $w(t)$ for the probability $e_{rs}$ of
misclassification as state $s$ given underlying state $r$, a
multinomial logistic regression model can be used:
\begin{equation}
  \label{eq:misccovs}
  \log \frac{e_{rs}(t)}{e_{rs_0}(t)}  =   \gamma_{rs}^T w(t).
\end{equation}
where $s_0$ is some baseline state, usually chosen as the underlying
state, or the state with the highest probability (for numerical
stability).

\subsubsection{General hidden Markov model}
\label{sec:hmm:general}

Consider now a general hidden Markov model in continuous time.  The
true state of the model $S_{ij}$ evolves as an unobserved Markov
process.  Observed data $y_{ij}$ are generated conditionally true
states $S_{ij} = 1, 2, \ldots, n$ according to a set of distributions
$f_1(y | \theta_1, \gamma_1)$, $f_2(y | \theta_2, \gamma_2)$,
$\ldots$, $f_n(y | \theta_n, \gamma_n)$, respectively.  $\theta_r$ is
a vector of parameters for the state $r$ distribution.  One or more of
these parameters may depend on explanatory variables through a
link-transformed linear model with coefficients $\gamma_r$.



\subsubsection{Likelihood for general hidden Markov models}

A type of EM algorithm known as the {\em Baum-Welch} or {\em
  forward-backward} algorithm \cite{baum:petrie66, baum:petrie70}, is
commonly used for hidden Markov model estimation in discrete-time
applications.  See, for example, Durbin \etal
\cite{biolog:seq}, Albert \cite{albert99}.  A generalisation of this
algorithm to continuous time was described by Bureau \etal
\cite{bureau:hughes:shiboski:00}.

The \Rpackage{msm} package uses a direct method of calculating
likelihoods for continuous-time models based on matrix products.
This type of method has been described by Macdonald and Zucchini
\cite[pp.  77--79]{macdonald:zucchini}, Lindsey
\cite[p.73]{lindsey:rm} and Guttorp \cite{guttorp}.  Satten and
Longini \cite{sattlong} used this method to calculate likelihoods for
a hidden Markov model in continuous time with observations of a
continuous marker generated conditionally on underlying discrete
states.

Patient $i$'s contribution to the likelihood is
\begin{eqnarray}
  \label{eq:multi:hiddencontrib}
  L_i     & = & Pr(y_{i1}, \ldots, y_{in_i})\\
  & = & \sum Pr(y_{i1}, \ldots, y_{in_i} | S_{i1}, \ldots, S_{in_i})
  Pr(S_{i1}, \ldots, S_{in_i}) \nonumber
\end{eqnarray}
where the sum is taken over all possible paths of underlying states
$S_{i1}, \ldots, S_{in_i}$.  Assume that the observed states are
conditionally independent given the values of the underlying states.
Also assume the Markov property, $Pr(S_{ij}|S_{i,j-1}, \ldots, S_{i1})
= Pr(S_{ij}|S_{i,j-1})$.  Then the contribution $L_i$ can be written
as a product of matrices, as follows. To derive this matrix product,
decompose the overall sum in equation \ref{eq:multi:hiddencontrib}
into sums over each underlying state. The sum is accumulated over the
unknown first state, the unknown second state, and so on until the
unknown final state:
\begin{eqnarray}
  \label{eq:multi:hiddenlik}
  L_i & = & \sum_{S_{i1}} Pr(y_{i1}|S_{i1})Pr(S_{i1}) \sum_{S_{i2}} Pr(y_{i2}|S_{i2})Pr(S_{i2}|S_{i1}) \sum_{S_{i3}} Pr(y_{i3}|S_{i3}) Pr(S_{i3}|S_{i2})
  \nonumber \\
  &   & \ldots \sum_{S_{in_i}} Pr(y_{in_i}|S_{in_i})
  Pr(S_{in_i}|S_{in_{i-1}})
\end{eqnarray}
where $Pr(y_{ij}|S_{ij})$ is the emission probability density.  For
misclassification models, this is the misclassification probability
$e_{S_{ij} O_{ij}}$.  For general hidden Markov models, this is the
probability density
$f_{S_{ij}}(y_{ij}|\theta_{S_{ij}},\gamma_{S_{ij}})$.
$Pr(S_{i,j+1}|S_{ij})$ is the $(S_{ij}, S_{i,j+1})$ entry of the
Markov chain transition matrix $P(t)$ evaluated at $t = t_{i,j+1} -
t_{ij}$.  Let $f$ be the vector with $r$ element the product of the
initial state occupation probability $Pr(S_{i1}=r)$ and $Pr(y_{i1}|
r)$, and let $\mathbf 1$ be a column vector consisting of ones. For $j
= 2, \ldots, n_i$ let $T_{ij}$ be the $R \times R$ matrix (where $R$
is the number of states) with $(r,s)$ entry
\[
Pr(y_{ij}| s) p_{rs} (t_{ij} - t_{i,j-1})
\]
Then subject $i$'s likelihood contribution is
\begin{equation}
  L_i =  f T_{i2} T_{i3}, \ldots T_{in_i} \mathbf 1
  \label{eq:multi:hidden:matprod}
\end{equation}

If $S(t_{j}) = D$ is an absorbing state such as death, measured
without error, whose entry time is known exactly, then the
contribution to the likelihood is summed over the unknown state at the
previous instant before death.  The $(r,s)$ entry of $T_{ij}$ is then
\[
p_{rs}(t_{j} - t_{j-1}) q_{s, D}
\]
Section \ref{sec:fitting:hmm:misc} describes how to fit multi-state
models with misclassification using the \Rpackage{msm} package, and
Section \ref{sec:fitting:hmm:general} describes how to apply general
hidden Markov models.


\subsubsection{Example of a general hidden Markov model}
\label{sec:hmm:example:fev}

Jackson and Sharples \cite{jackson:sharples:2002} described a model
for FEV$_1$ (forced expiratory volume in 1 second) in recipients of
lung transplants.  These patients were at risk of BOS (bronchiolitis
obliterans syndrome), a progressive, chronic deterioration in lung
function.  In this example, BOS was modelled as a discrete, staged
process, a model of the form illustrated in Figure \ref{fig:disease},
with 4 states.  State 1 represents absence of BOS. State 1 BOS is
roughly defined as a sustained drop below 80\% below baseline FEV$_1$,
while state 2 BOS is a sustained drop below 65\% baseline.  FEV$_1$ is
measured as a percentage of a baseline value for each individual,
determined in the first six months after transplant, and assumed to be
100\% baseline at six months.

As FEV$_1$ is subject to high short-term variability due to acute
events and natural fluctuations, the exact BOS state at each
observation time is difficult to determine.  Therefore, a hidden
Markov model for FEV$_1$, conditionally on underlying BOS states, was
used to model the natural history of the disease.  Discrete states are
appropriate as onset is often sudden.


\paragraph{Model 1}
Jackson \cite{my:phd} considered models for these data where
FEV$_1$ were Normally distributed, with an unknown mean and variance
conditionally each state (\ref{eq:fev:normal}).  This model seeks the
most likely location for the within-state FEV$_1$ means.

\begin{equation}
  \label{eq:fev:normal}
  y_{ij} | \{ S_{ij} = k\}  \sim N(\mu_k + \beta x_{ij}, \sigma^2_k)
\end{equation}


\paragraph{Model 2}
Jackson and Sharples \cite{jackson:sharples:2002} used a more complex
two-level model for FEV$_1$ measurements. Level
1 (\ref{eq:fev:level1}) represents the short-term fluctuation error of
the marker around its underlying continuous value $y^{hid}_{ij}$.
Level 2 (\ref{eq:fev:level2}) represents the distribution of
$y^{hid}_{ij}$ conditionally on each underlying state, as follows.
\begin{equation}
  \label{eq:fev:level1}
  y_{ij} | y^{hid}_{ij} \qquad \sim  N ( y^{hid}_{ij} + \beta x_{ij} ,
  \sigma^2_\epsilon)
\end{equation}
\begin{equation}
  \label{eq:fev:level2}
  y^{hid}_{ij} | S_{ij} \quad \sim \quad
  \left\{
    \begin{array}{cll}
      \mbox{State}& \mbox{Three state model}               & \mbox{Four state model} \\
      S_{ij} = 0  & N(\mu_0, \sigma^2_0)I_{[80, \infty)} & N(\mu_0, \sigma^2_0)I_{[80, \infty)} \\
      S_{ij} = 1  & N(\mu_1, \sigma^2_1)I_{(0, 80)}      & Uniform(65, 80)                        \\
      S_{ij} = 2  & \mbox{(death)}                         & N(\mu_2, \sigma^2_2)I_{(0, 65)}      \\
      S_{ij} = 3  &                                        & \mbox{(death)}
    \end{array}
  \right .
\end{equation}

Integrating over $y^{hid}_{ij}$ gives an explicit distribution for
$y_{ij}$ conditionally on each underlying state (given in
Section \ref{sec:fitting:hmm:general}, Table \ref{tab:hmm:dists}).
Similar distributions were originally applied by Satten and
Longini \cite{sattlong} to modelling the progression through discrete,
unobserved HIV disease states using continuous CD4 cell counts.  The
\Rpackage{msm} package includes density, quantile, cumulative density
and random number generation functions for these distributions.

In both models 1 and 2, the term $\beta x_{ij}$ models the short-term
fluctuation of the marker in terms of acute events. $x_{ij}$ is an
indicator for the occurrence of an acute rejection or infection
episode within 14 days of the observation of FEV$_1$.

Section \ref{sec:fitting:hmm:general} describes how these and more
general hidden Markov models can be fitted with the \Rpackage{msm}
package.



\clearpage

\section{Fitting multi-state models with {\tt msm}}

<<echo=FALSE>>=
options(width = 60)
@

\Rpackage{msm} is a package of functions for multi-state modelling using
the R statistical software.  The \Rfunction{msm} function itself
implements maximum-likelihood estimation for general multi-state
Markov or hidden Markov models in continuous time. We illustrate
its use with a set of data from monitoring heart transplant patients.
Throughout this section ``\textsl{\texttt{>}}'' indicates the R
command prompt, \textsl{\texttt{slanted typewriter}} text indicates R
commands, and \texttt{typewriter} text R output.

\subsection{Installing \tt{msm}}
\label{sec:installing}

The easiest way to install the \Rpackage{msm} package on a computer
connected to the Internet is to run the R command:

\begin{Scode}
  install.packages("msm")
\end{Scode}

This downloads \Rpackage{msm} from the CRAN archive of contributed R
packages (\texttt{cran.r-project.org} or one of its mirrors) and
installs it to the default R system library.  To install to a
different location, for example if you are a normal user with no
administrative privileges, create a directory in which R packages are
to be stored, say, \texttt{/your/library/dir}, and run

\begin{Scode}
  install.packages("msm", lib='/your/library/dir')
\end{Scode}

After \Rpackage{msm} has been installed, its functions can be made
visible in an R session by
<<>>=
library(msm)
@
or, if it has been installed into a non-default library,

\begin{Scode}
  library(msm, lib.loc='/your/library/dir')
\end{Scode}


\subsection{Getting the data in}
\label{sec:datain}

The data are specified as a series of observations, grouped by
patient. At minimum there should be a data frame with variables
indicating
\begin{itemize}
\item the time of the observation,
\item the observed state of the process.
\end{itemize}
If the data do not also contain
\begin{itemize}
\item the subject identification number,
\end{itemize}
then all the observations are assumed to be from the same subject.
The subject ID does not need to be numeric, but data must be grouped
by subject, and observations must be ordered by time within subjects.
If the model includes variables with missing values, then the corresponding
observations  are omitted by \Rfunction{msm} with a warning.  If you have missing data,
as in any statistical model, it is recommended to ensure these do not
result in biases.

An example data set, taken from monitoring a set of heart transplant
recipients, is provided with \Rpackage{msm}.  (Note: since \Rpackage{msm}
version 1.3, the command \Rfunction{data(cav)} is no longer needed to
load the data --- it is now ``lazy-loaded'' when required). Sharples
\etal \cite{my:cav} studied the progression of coronary allograft
vasculopathy (CAV), a post-transplant deterioration of the arterial
walls, using these data.  Risk factors and the accuracy of the
screening test were investigated using multi-state Markov and hidden
Markov models.

The first three patient histories are shown below. There are 622
patients in all.  \Robject{PTNUM} is the subject identifier.
Approximately each year after transplant, each patient has an
angiogram, at which CAV can be diagnosed.  The result of the test is
in the variable \Robject{state}, with possible values 1, 2, 3
representing CAV-free, mild CAV and moderate or severe CAV
respectively.  A value of 4 is recorded at the date of death.
\Robject{years} gives the time of the test in years since the heart
transplant. Other variables include \Robject{age} (age at screen),
\Robject{dage} (donor age), \Robject{sex} (0=male, 1=female),
\Robject{pdiag} (primary diagnosis, or reason for transplant - IHD
represents ischaemic heart disease, IDC represents idiopathic dilated
cardiomyopathy), \Robject{cumrej} (cumulative number of rejection
episodes), and \Robject{firstobs}, an indicator which is 1 when the
observation corresponds to the patient's transplant (the first
observation), and 0 when the observation corresponds to a later
angiogram.

<<>>=
cav[1:21,]
@

A useful way to summarise multi-state data is as a frequency table of
pairs of consecutive states.  This counts over all individuals, for
each state $r$ and $s$, the number of times an individual had an
observation of state $r$ followed by an observation of state $s$. The
function \Rfunction{statetable.msm} can be used to produce such a table,
as follows,

<<>>=
statetable.msm(state, PTNUM, data=cav)
@
Thus there were 148 CAV-free deaths, 48 deaths from state 2, and
55 deaths from state 3.  On only four occasions was there an
observation of severe CAV followed by an observation of no CAV.   


\subsection{Specifying a model}
\label{sec:specifying:model}

We now specify the multi-state model to be fitted to the data.  A
model is governed by a transition intensity matrix $Q$. For the heart
transplant example, there are four possible states through which the
patient can move, corresponding to CAV-free, mild/moderate CAV, severe
CAV and death.  We assume that the patient can advance or recover from
consecutive states while alive, and die from any state. Thus the model
is illustrated by Figure \ref{fig:disease} with four states, and we
have

\[
Q = \left(
  \begin{array}{llll}
    -(q_{12} + q_{14}) & q_{12} &  0     & q_{14}\\
    q_{21} & -(q_{21}+q_{23}+q_{24}) & q_{23} & q_{24}\\
      0    & q_{32} & -(q_{32}+q_{34}) & q_{34}\\
      0    &   0    &   0    &   0   \\
  \end{array}
\right )
\]

It is important to remember that this defines which
\emph{instantaneous} transitions can occur in the Markov process, and
that the data are \emph{snapshots} of the process (see
section \ref{sec:arbitr-observ-times}).  Although there were 44
occasions on which a patient was observed in state 1 followed by state
3, we can still have $q_{13}=0$. The underlying model specifies that the patient must have passed
through state 2 in between, rather than jumping straight from 1 to 3.  If your data represent the exact and
complete transition times of the process, then you must specify
\Rfunarg{exacttimes=TRUE} or \Rfunarg{obstype=2} in the call to
\Rfunction{msm}.

To tell \Rfunction{msm} what the allowed transitions of our model are,
we define a matrix of the same size as $Q$, containing zeroes in the
positions where the entries of $Q$ are zero.  All other positions
contain an initial value for the corresponding transition intensity.
The diagonal entries supplied in this matrix do not matter, as the
diagonal entries of $Q$ are defined as minus the sum of all the other
entries in the row.  This matrix will eventually be used as the
\Rfunarg{qmatrix} argument to the \Rfunction{msm} function. For
example,

<<>>=
Q  <-  rbind ( c(0, 0.25, 0, 0.25),
               c(0.166, 0, 0.166, 0.166),
               c(0, 0.25, 0, 0.25),
               c(0, 0, 0, 0) )
@

Fitting the model is a process of finding values of the seven unknown
transition intensities: $q_{12}$, $q_{14}$, $q_{21}$, $q_{23}$,
$q_{24}$, $q_{32}$, $q_{34}$, which maximise the likelihood.

\subsection{Specifying initial values}
\label{sec:inits}

The likelihood is maximised by numerical methods, which need a set of
initial values to start the search for the maximum.  For reassurance
that the true maximum likelihood estimates have been found, models
should be run repeatedly starting from different initial
values. However a sensible choice of initial values can be important
for unstable models with flat or multi-modal likelihoods.  For
example, the transition rates for a model with misclassification could
be initialised at the corresponding estimates for an approximating
model without misclassification. Initial values for a model without
misclassification could be set by supposing that transitions between
states take place only at the observation times.  If we observe
$n_{rs}$ transitions from state $r$ to state $s$, and a total of $n_r$
transitions from state $r$, then $q_{rs} / q_{rr}$ can be estimated by
$n_{rs} / n_r$. Then, given a total of $T_r$ years spent in state $r$,
the mean sojourn time $1 / q_{rr}$ can be estimated as $T_r / n_r$.
Thus, $n_{rs} / T_r$ is a crude estimate of $q_{rs}$.

Such default initial values can be used by supplying
\Rfunarg{gen.inits=TRUE} in the call to \Rfunction{msm} below, along
with a \Rfunarg{qmatrix} whose non-zero entries represent the allowed
transitions of the model.  Alternatively the function \Rfunction{crudeinits.msm}
could be used to get this matrix of initial values explicitly as follows.
These methods are only available for non-hidden Markov models.

<<>>=
Q.crude  <- crudeinits.msm(state ~ years, PTNUM, data=cav,
                                   qmatrix=Q)
@

However, if there are are many changes of state in between the
observation times, then this crude approach may fail to give sensible
initial values.  For the heart transplant example we could also guess
that the mean period in each state before moving to the next is about
2 years, and there is an equal probability of progression, recovery or
death.  This gives $q_{rr} = - 0.5$ for $r = 1, 2, 3$, and $q_{12} =
q_{14} = 0.25$, $q_{21} = q_{23} = q_{24} = 0.166$, $q_{32} = q_{34} =
0.25$, and the initial value matrix \Robject{Q} shown above,
which we now use to fit the model.

\subsection{Running \Rfunction{msm}}
\label{sec:running}

To fit the model, call the \Rfunction{msm} function with the appropriate
arguments.  For our running example, we have defined a data set
\Robject{cav}, a matrix \Robject{Q} indicating the allowed
transitions, and initial values. We are ready to run \Rfunction{msm}.

\paragraph{Model 1: simple bidirectional model}


<<>>=
cav.msm <- msm( state ~ years, subject=PTNUM, data = cav,
                    qmatrix = Q, deathexact = 4)
@

In this example the day of death is assumed to be recorded exactly, as
is usual in studies of chronic diseases.  At the previous instant
before death the state of the patient is unknown.  Thus we specify
\Rfunarg{deathexact = 4}, to indicate to \Rfunction{msm} that the
entry times into state 4 are observed in this manner.
If the model had
five states, and states 4 and 5 were two competing causes of death with
times recorded exactly in this way, then we would specify \Rfunarg{deathexact =
  c(4,5)}.

By default, the data are assumed to represent snapshots of the process
at arbitrary times.  However, observations can also represent exact
times of transition, ``exact death times'', or a mixture of these. See the
\Rfunarg{obstype} argument to \Rfunction{msm}.  

While the \Rfunction{msm} function runs, it searches for the maximum
of the likelihood of the unknown parameters.  Internally, it uses the
R function \Rfunction{optim} to minimise the minus log-likelihood.
When the data set, the model, or both, are large, then this may take a
long time.  It can then be useful to see the progress of the
optimisation algorithm. To do this, we can specify a \Rfunarg{control}
argument to \Rfunction{msm}, which is passed internally to the
\Rfunction{optim} function.  For example \texttt{control = list(trace=1,
  REPORT=1)}. See the help page for \Rfunction{optim},

<<eval=FALSE>>=
help(optim)
@
for more options to control the optimisation. \footnote{Note that since version 1.3.2, \Rfunarg{method=''BFGS''}, is the default optimisation algorithm in \Rfunction{msm}, since it can use analytic derivatives, which are available for most models.}

When completed, the \Rfunction{msm} function returns a value.  This
value is a list of the important results of the model fitting,
including the parameter estimates and their covariances.  To keep
these results for post-processing, we store them in an R object, here
called \Robject{cav.msm}.  When running several similar
\Rfunction{msm} models, it is recommended to store the respective
results in informatively-named objects.


\subsection{Showing results}

To show the maximum likelihood estimates and 95\% confidence
intervals, type the name of the fitted model object at the R command
prompt.  \footnote{This is equivalent to typing
  \texttt{print.msm(cav.msm)}.  The function \Rfunction{print.msm}
  formats the important information in the model object for printing
  on the screen.}  The confidence level can be changed using the
\Rfunarg{cl} argument to \Rfunction{msm}.

<<>>=
cav.msm
@

From the estimated intensities, we see patients are three times
as likely to develop symptoms than die without symptoms (transitions from state 1).
After disease onset (state 2), progression to severe symptoms (state
3) is 50\% more likely than recovery.  Once in the severe state, death
is more likely than recovery, and a mean of 1 / -0.44 = 2.3 years is
spent in state 3 before death or recovery.

Section \ref{sec:extractor} describes various functions that can be
used to obtain summary information from the fitted model.


\subsection{Covariates on the transition rates}
\label{sec:msm:covariates}

We now model the effect of explanatory variables on the rates of
transition, using a proportional intensities model. Now we have an
intensity matrix $Q(z)$ which depends on a covariate vector $z$. For
each entry of $Q(z)$, the transition intensity for patient $i$ at
observation time $j$ is $q_{rs}(z_{ij}) = q_{rs}^{(0)}
\exp(\beta_{rs}^T z_{ij})$.  The covariates $z$ are specified through
the \Rfunarg{covariates} argument to \Rfunction{msm}. If $z_{ij}$ is
time-dependent, we assume it is constant in between the observation
times of the Markov process.  \Rfunction{msm} calculates the
probability for a state transition from times $t_{i,j-1}$ to $t_{ij}$
using the covariate value at time $t_{i,j-1}$.

We consider a model with just one covariate, female sex.  Out of the
622 transplant recipients, 535 are male and 87 are female.  By
default, all linear covariate effects $\beta_{rs}$ are initialised to
zero. To specify different initial values, use a \Rfunarg{covinits}
argument, as described in \Rfunction{help(msm)}.  Initial values given
in the \Rfunarg{qmatrix} represent the intensities with covariate values
set to their means in the data.  In the following model, all transition
intensities are modelled in terms of sex.

\paragraph{Model 2: sex as a covariate}
<<>>=
cavsex.msm <- msm( state ~ years, subject=PTNUM, data = cav,
                     qmatrix = Q, deathexact = 4, covariates = ~ sex)
@

Printing the \Robject{msm} object now displays the estimated covariate
effects and their confidence intervals (note since version 1.3.2 these
are \emph{hazard ratios} $\exp(\beta_{rs})$, not \emph{log hazard ratios} $\beta_{rs}$ as in previous
versions).

<<>>=
cavsex.msm
@
The sizes of the
confidence intervals for some of the hazard ratios suggests there is no information in the data about 
the corresponding covariate effects, leading to a likelihood that is a flat function of these parameters, 
and this model should be simplified. The first column shown in the output
is the estimated transition intensity matrix
$q_{rs}(z) = q_{rs}^{(0)} \exp(\beta_{rs}^T z)$ with the covariate $z$
set to its mean value in the data.  This represents an average
intensity matrix for the population of 535 male and 87 female
patients. To extract separate intensity matrices for male and female
patients ($z = 0$ and $1$ respectively), use the function
\Rfunction{qmatrix.msm}, as shown below.  This and similar summary
functions will be described in more detail in
section \ref{sec:extractor}.

<<>>=
qmatrix.msm(cavsex.msm, covariates=list(sex=0)) # Male
qmatrix.msm(cavsex.msm, covariates=list(sex=1)) # Female
@

Since \Rpackage{msm} version 1.2.3, different transition rates may be
easily modelled on different covariates by specifying a named list of
formulae as the \Rfunarg{covariates} argument.  Each element of the
list has a name identifying the transition.  In the model below, the
transition rate from state 1 to state 2 and the rate from state 1 to
state 4 are each modelled on sex as a covariate, but no other
intensities have covariates on them.

\paragraph{Model 2a: transition-specific covariates}
<<eval=FALSE>>=
cavsex.msm <- msm( state ~ years, subject=PTNUM, data = cav,
                  qmatrix = Q, deathexact = 4,
                  covariates = list("1-2" = ~ sex, "1-4" = ~sex) )
@

We may also want to constrain the effect of a covariate to be equal for
certain transition rates, to reduce the number of parameters in the
model, or to investigate hypotheses on the covariate effects.  A
\Rfunarg{constraint} argument can be used to indicate which of the
transition rates have common covariate effects.

\paragraph{Model 3: constrained covariate effects}

<<eval=FALSE>>=
cav3.msm <- msm( state ~ years, subject=PTNUM, data = cav,
                qmatrix = Q, deathexact = 4,
                covariates = ~ sex,
                constraint = list(sex=c(1,2,3,1,2,3,2)) )
@

This constrains the effect of sex to be equal for the progression
rates $q_{12}, q_{23}$, equal for the death rates $q_{14}, q_{24},
q_{34}$, and equal for the recovery rates $q_{21}, q_{32}$. The
intensity parameters are assumed to be ordered by reading across the
rows of the transition matrix, starting at the first row: ($q_{12},
q_{14}, q_{21}, q_{23}, q_{24}, q_{32}, q_{34}$), giving constraint
indicators \Rfunarg{(1,2,3,1,2,3,2)}.  Any vector of increasing
numbers can be used for the indicators.  Negative entries can be used
to indicate that some effects are equal to minus others:
\Rfunarg{(1,2,3,-1,2,3,2)} sets the fourth effect to be minus the
first.

In a similar manner, we can constrain some of the baseline transition
intensities to be equal to one another, using the
\Rfunarg{qconstraint} argument.  For example, to constrain the rates
$q_{12}$ and $q_{23}$ to be equal, and $q_{24}$ and $q_{34}$ to be
equal, specify \Rfunarg{qconstraint = c(1,2,3,1,4,5,4)}.



\subsection{Fixing parameters at their initial values}

For exploratory purposes we may want to fit a model assuming that some
parameters are fixed, and estimate the remaining parameters.  This may
be necessary in cases where there is not enough information in the
data to be able to estimate a proposed model, and we have strong prior
information about a certain transition rate. To do this, use the
\Rfunarg{fixedpars} argument to \Rfunction{msm}.  For model 1, the
following statement fixes the parameters numbered 6, 7, that is,
$q_{32}$, $q_{34}$, to their initial values (0.25 and 0.25,
respectively).

\paragraph{Model 4: fixed parameters}
<<eval=FALSE>>=
cav4.msm <- msm( state ~ years, subject=PTNUM, data = cav,
                qmatrix = Q, deathexact = 4,
                control = list(trace=2, REPORT=1),
                fixedpars = c(6, 7) )
@

A \Rfunarg{fixedpars} statement can also be used for fixing covariate
effect parameters to zero, that is to assume no effect of a covariate
on a certain transition rate.

\subsection{Extractor functions}
\label{sec:extractor}

We may want to extract some of the information from the \Rfunction{msm}
model fit for post-processing, for example for plotting graphs or
generating summary tables.  A set of functions is provided for
extracting interesting features of the fitted model.

\begin{description}

\item[Intensity matrices] The function \Rfunction{qmatrix.msm}
  extracts the estimated transition intensity matrix and its confidence intervals
  for a given set of covariate values, as shown in
  section \ref{sec:msm:covariates}.  Confidence intervals are
  calculated from the covariance matrix of the estimates by assuming
  the distribution is symmetric on the log scale.  Standard errors for
  the intensities are also available from the object returned by
  \Rfunction{qmatrix.msm}.  These are calculated by the delta method.
  The \Rpackage{msm} package provides a general-purpose function
  \Rfunction{deltamethod} for estimating the variance of a function of
  a random variable $X$ given the expectation and variance of $X$. See
  \texttt{help(deltamethod)} for further details.  Bootstrap
  confidence intervals are also available for \Rfunction{qmatrix.msm}
  and for most output functions; these are often more accurate, at the
  cost of computational time.  For more about bootstrapping in
  \Rpackage{msm}, see Section \ref{sec:boot}.

\item[Transition probability matrices] The function
  \Rfunction{pmatrix.msm} extracts the estimated transition probability
  matrix $P(t)$ within a given time. For example, for model 1, the 10
  year transition probabilities are given by:

<<>>=
pmatrix.msm(cav.msm, t=10)
@

  Thus, a typical person in state 1, disease-free, has a probability
  of 0.5 of being dead ten years from now, a probability of 0.3 being
  still disease-free, and probabilities of 0.1 of being alive with
  mild/moderate or severe disease, respectively.

  This assumes $Q$ is constant within the desired time interval.  For
  non-homogeneous processes, where $Q$ varies with time-dependent
  covariates but can be approximated as piecewise constant, there is
  an equivalent function \Rfunction{pmatrix.piecewise.msm}.  Consult its
  help page for further details.

  If \Rfunarg{ci=''norm''} is specified, then a confidence interval is
  calculated based on drawing a random sample (default size 1000) from
  the assumed multivariate normal distribution of the maximum
  likelihood estimates and covariance matrix, and transforming.  If
  \Rfunarg{ci=''boot''} is specified, then a bootstrap confidence
  interval for the transition probability matrix is calculated (see Section \ref{sec:boot}) .
  However, both of these are computationally intensive, particularly
  the bootstrap method, so no confidence interval is calculated by default.

\item[Mean sojourn times] The function \Rfunction{sojourn.msm}
  extracts the estimated mean sojourn times in each transient state
  $r$, for a given set of covariate values. This is calculated as $-1
  / \hat q_{rr}$, where $\hat q_{rr}$ is the $r$th diagonal entry of
  the estimated transition intensity matrix.

<<>>=
sojourn.msm(cav.msm)
@

\item[Probability that each state is next] The function
  \Rfunction{pnext.msm} extracts the matrix of probabilities $-q_{rs}
  / q_{rr}$ that the next state after state $r$ is state $s$, for each
  $r$ and $s$.  Together with the mean sojourn times, this gives a more
  intuitive parameterisation of a continuous-time Markov model than
  the raw transition intensities $q_{rs}$.  Note these are different from
  the transition probabilities in a given time $t$ returned by
  \Rfunction{pmatrix.msm}.

<<>>=
pnext.msm(cav.msm)
@

\item[Total length of stay] Mean sojourn times describe the
  average period in a single stay in a state. For processes with
  successive periods of recovery and relapse, we may want to
  forecast the total time spent healthy or diseased, before
        death.  The function \Rfunction{totlos.msm} estimates the
        forecasted total length of time spent in each transient state
        $s$ between two future time points $t_1$ and $t_2$, for a
        given set of covariate values. This defaults to the expected
        amount of time spent in each state between the start of the
        process (time 0, the present time) and death or a specified
        future time. This is obtained as
\[
L_s = \int_{t_1}^{t_2} P(t)_{r,s} dt
\]
where $r$ is the state at the start of the process, which defaults to
1. This is calculated using numerical integration.  For model 1, each
patient is forecasted to spend 8.8 years disease free, 2.2 years with
mild or moderate disease and 1.8 years with severe disease.

Bootstrap and asymptotic confidence intervals are available, as for
\Rfunction{pmatrix.msm}, but are not calculated by default.

<<>>=
totlos.msm(cav.msm)
@

\item[Expected first passage times] The function \Rfunction{efpt.msm} estimates
  the expected time until the process first enters a given state or set of states,
  also called the ``hitting time''.  See its help page for further details.

\item[Expected number of visits] The function \Rfunction{envisits.msm}
  estimates the expected number of visits to a state, computed in a
  similar way to the total length of stay. See its help page for
  further details.

\item[Ratio of transition intensities] The function
  \Rfunction{qratio.msm} estimates a ratio of two entries of the
  transition intensity matrix at a given set of covariate values,
  together with a confidence interval estimated assuming normality on
  the log scale and using the delta method.  For example, we may want
  to estimate the ratio of the progression rate $q_{12}$ into the
  first state of disease to the corresponding recovery rate $q_{21}$.
  For example in model 1, recovery is 1.8 times as likely as
  progression.

<<>>=
qratio.msm(cav.msm, ind1=c(2,1), ind2=c(1,2))
@

\item[Hazard ratios for transition] The function \Rfunction{hazard.msm}
  gives the estimated hazard ratios corresponding to each covariate
  effect on the transition intensities. 95\% confidence limits are
  computed by assuming normality of the log-effect.

<<>>=
hazard.msm(cavsex.msm)
@

\end{description}

\paragraph{Setting covariate values}
All of these extractor functions take an argument called
\Rfunarg{covariates}.  If this argument is omitted, for example,

<<eval=FALSE>>=
qmatrix.msm(cav.msm)
@
then the intensity matrix is evaluated as $Q(\bar x)$ with all
covariates set to their mean values $\bar x$ in the data.  For factor / categorical variables, the mean of the 0/1 dummy variable for each factor level is used, representing an average over all values in the data, rather than a specific factor level.

Alternatively, set \Rfunarg{covariates} to 0 to return the result
$Q(0)$ with covariates set to zero. This will usually be
preferable for categorical covariates, where we wish to see the
result for the baseline category.

<<eval=FALSE>>=
qmatrix.msm(cavsex.msm, covariates = 0)
@
Alternatively, the desired covariate values can be specified
explicitly as a list,

<<eval=FALSE>>=
qmatrix.msm(cavsex.msm, covariates = list(sex = 1))
@

Values of categorical covariates must be quoted. For example, consider
a covariate \texttt{smoke}, representing tobacco smoking status, with
three levels, \texttt{NON, CURRENT, EX}, representing a non-smoker,
current smoker or ex-smoker.

\begin{Scode}
qmatrix.msm(example.msm, covariates = list(age = 60, smoke=''CURRENT''))
\end{Scode}


\subsection{Survival plots}

In studies of chronic disease, an important use of multi-state models
is in predicting the probability of survival for patients in
increasingly severe states of disease, for some time $t$ in the
future.  This can be obtained directly from the transition probability
matrix $P(t)$.

The \Rfunction{plot} method for \Robject{msm} objects produces a plot
of the expected probability of survival against time, from each
transient state. Survival is defined as not entering the final
absorbing state.

<<fig=TRUE>>=
plot(cav.msm, legend.pos=c(8, 1))
@

This shows that the 10-year survival probability with severe CAV is
approximately 0.1, as opposed to 0.3 with mild CAV and 0.5 without
CAV.  With severe CAV the survival probability diminishes very quickly
to around 0.3 in the first five years after transplant.  The
\Rfunarg{legend.pos} argument adjusts the position of the legend in
case of clashes with the plot lines.  A \Rfunarg{times} argument can
be supplied to indicate the time interval to forecast survival for.

A more sophisticated analysis of these data might explore competing
causes of death from causes related or unrelated to the disease under
study.

\subsection{Bootstrapping}
\label{sec:boot}
Most of \Rpackage{msm}'s output functions present confidence intervals
based on asymptotic standard errors calculated from the Hessian, or
transformations of these using the delta method.  The asymptotic
standard errors are expected to be underestimates of the true standard
errors (Cramer-Rao lower bound).  For some output functions, such as
\Rfunction{pmatrix.msm}, and functions based on
\Rfunction{pmatrix.msm} such as \Rfunction{totlos.msm} and
\Rfunction{prevalence.msm}, the delta method cannot be used at all to
obtain standard errors.  In these cases, confidence intervals can be
calculated by drawing a random sample from the assumed multivariate
normal distribution of the maximum likelihood estimates and covariance
matrix, and transforming.  However, this is still based on potentially
inaccurate asymptotic theory.  The \Rpackage{msm} package provides the
function \Rfunction{boot.msm} to enable bootstrap refitting of
\Rfunction{msm} models, an alternative way to estimate uncertainty.

For non-hidden Markov models, a bootstrap dataset is drawn by
resampling pairs of consecutive states from the full data, i.e.
\emph{transitions}.  These are assumed to be independent when
calculating the likelihood (Section \ref{sec:multi:likelihood}).  For
hidden Markov models and models with censoring, a bootstrap dataset is
drawn by resampling complete series from independent subjects.  The
bootstrap datasets have the same number of transitions, or subjects,
respectively, as the original data.

For most output extractor functions provided with \Rpackage{msm}, the
option \Rfunarg{ci=''boot''} is available, as a wrapper around
\Rfunction{boot.msm}, to enable bootstrap confidence intervals to be calculated.
But any user-defined output statistic can be bootstrapped, as follows.
The function \Rfunction{boot.msm} is called with the fitted
\Rfunction{msm} model as first argument, and an R function specifying
the statistic to be bootstrapped as the second argument \texttt{stat}.
The return value from \Rfunction{boot.msm} is a list of \texttt{B}
replicates (by default, \texttt{B=1000}) of the desired statistic.  For
example, to bootstrap the transition intensity matrix of the heart
transplantation model \Robject{cav.msm}:
\begin{Scode}
  q.list <- boot.msm(cav.msm, stat=function(x){qmatrix.msm(x)$estimates})
\end{Scode}
Note that for \Rfunction{boot.msm} to be able to refit the original
model that produced \Robject{cav.msm}, all objects used in the
original model fit (for example, in this case, \Robject{Q})
must be in the working environment.  Otherwise, \Rfunction{boot.msm}
will give an ``object not found'' error.

The user can then summarise these replicates by calculating empirical
standard deviations or quantile-based intervals.  In this example,
\Robject{q.list} is a list of 1000 4$\times$4 matrices.  The following
code calculates the bootstrap standard error as the empirical standard
deviation of the 1000 replicates, and a similar 95\% bootstrap
confidence interval.
\begin{Scode}
  q.array <- array(unlist(q.list), dim=c(4,4,1000))
  apply(q.array, c(1,2), sd)
  apply(q.array, c(1,2), function(x)quantile(x, c(0.025, 0.975)))
\end{Scode}

Note that when bootstrapping, the refits of the model to the resampled
datasets may occasionally fail to converge (as discussed in
Section \ref{sec:failure}) even if the original model fit did
converge.  In these cases, a warning is given, but
\Rfunction{boot.msm} simply discards the failed dataset and moves on
to the next bootstrap iteration. Unless convergence failure occurs for
a large proportion of iterations, this should not affect the accuracy
of the final bootstrap confidence intervals.


\subsection{Convergence failure}
\label{sec:failure}

Inevitably if over-complex models are applied with insufficient data
then the parameters of the model will not be identifiable.  This will
result in the optimisation algorithm failing to find the maximum of
the log-likelihood, or even failing to evaluate the likelihood. For
example, it will commonly be inadvisable to include several covariates
in a model simultaneously.

In some circumstances, the optimisation may report convergence, but
fail to calculate any standard errors.  In these cases, the Hessian of
the log-likelihood at the reported solution is not positive definite.
Thus the reported solution may be a saddle point rather than the
maximum likelihood, or it may be close to the maximum.

\begin{description}

\item[Model simplification]

  Firstly, make sure there are not too many parameters in the model.  There may not be enough
  information in the data on a certain transition rate.  It is
  recommended to count all the pairs of transitions between states in
  successive observation times, making a frequency table of previous
  state against current state (function \Rfunction{statetable.msm}),
  and do this for any subgroups defining covariates.
  Although the data are a series of snapshots of a continuous-time
  process, and the actual transitions take place in between the
  observation times, this type of table may still be helpful.  If
  there are not many observed `transitions' from state 2 to state 4,
  for example, then the data may be insufficient to estimate $q_{24}$.

  For a staged disease model (Figure \ref{fig:disease}), the number of
  disease states should be low enough that all transition rates can be
  estimated.  Consecutive states of disease severity should be merged
  if necessary.  If it is realistic, consider applying constraints on
  the intensities or the covariate effects so that the parameters are
  equal for certain transitions, or zero for certain transitions.

  Be careful to use a observation scheme and transition matrix
  appropriate to your data (see
  Section \ref{sec:arbitr-observ-times}). By default, \Rfunction{msm}
  assumes that the data represent snapshots of the process, and the
  true state is unknown between observation times.  In such
  circumstances, it is rarely feasible to estimate an intensity matrix
  with \emph{instantaneous} transitions allowed between every pair of
  states.  This would be easier if the complete course of the process
  is known \Rfunarg{(exacttimes = TRUE)} in the call to
  \Rfunction{msm}.  Understand the difference between
  \emph{instantaneous} and \emph{interval} transitions - although
  individuals may be in state 1 at time $t_r$, and state 3 at time
  $t_{r+1}$, that doesn't mean that instantaneous transitions from 1
  to 3 should be permitted.


\item[Initial values] Make sure that a sensible set of initial values
  have been chosen. The optimisation may only converge within a
  limited range of `informative' initial values.  It is also sensible
  to run the model for several different initial values to ensure that
  the estimation has converged to a global rather than a local optimum.

\item[Scaling] It is often necessary to apply a scaling factor to normalise the likelihood (\Rfunarg{fnscale}), or certain individual parameters \Rfunarg{(parscale)}.  This may prevent overflow or underflow problems within the optimisation.  For example, if the value of the -2 $\times$ log-likelihood is around 5000, then the following option leads to an minimisation of the -2 $\times$ log-likelihood on an approximate unit scale: \Rfunarg{control = list(fnscale = 5000)}.  
  % Though since version 1.4.1, \Rfunarg{fnscale} is applied automatically using the likelihood at the initial values, unless the user has already supplied it.
%    If not provided by the user, \code{control=list(fnscale = a)} is
%    applied automatically to normalise the optimisation, where \code{a}
%    is the minus twice log likelihood at the initial values.

  It is also advisable to analyse all variables, including covariates
  and the time unit, on a roughly normalised scale. For example,
  working in terms of a time unit of months or years instead of days,
  when the data range over thousands of days.

\item[Convergence criteria] ``False convergence'', in which
  \Rfunction{optim()} reports convergence of the optimisation but the
  Hessian is not positive definite, can sometimes be solved by
  tightening the criteria (\Rfunarg{reltol}, defaults to
  \texttt{1e-08}) for reporting convergence. For example,
  \Rfunarg{control = list(reltol = 1e-16)}.

  Alternatively consider using smaller step sizes for the numerical
  approximation to the gradient, used in calculating the Hessian.
  This is given by the control parameter \Rfunarg{ndeps}. For example,
  for a model with 5 parameters, \Rfunarg{control = list(ndeps =
    rep(1e-6, 5))}

\item[Choice of algorithm] By default, since version 1.3.2,
  \Rfunction{msm} uses the BFGS method of \Rfunction{optim}, which
  makes use of analytic derivatives.  Analytic derivatives are
  available for all models in msm, apart from hidden Markov models
  with unknown initial state probabilities, misclassification models
  with equality constraints on misclassification probabilities, and
  truncated or measurement-error outcome distributions.  This speeds
  up optimisation.  Though alternative algorithms are available such
  as \Rfunarg{method = ``CG''}.  Or use the \Rfunction{nlm} R function
  via \Rfunarg{msm(..., opt.method = "nlm" , ...)}  Note also the
  Fisher scoring method available for non-hidden Markov models for
  panel data, via \Rfunarg{msm(..., opt.method = "fisher", ...)}, is
  expected to be faster than the generic methods, but less robust to
  bad initial values.  Or since version 1.3.2, msm can also use
  \Rfunarg{method=``bobyqa''} from the \Rpackage{minqa} package, a fast
  derivative-free method.

\item[\Rfunction{optim} "function cannot be evaluated at initial parameters"]
  To diagnose this problem, run \Rfunction{msm} again with
  \Rfunarg{fixedpars=TRUE} set, to calculate the -2 log-likelihood at
  the initial values.  This will probably be \Robject{Inf}. To show
  the contributions of individual subjects to the overall log
  likelihood, call \Rfunction{logLik.msm(x, by.subject=TRUE)}, where
  \Robject{x} is the fitted model object.  If only a few subjects give
  an infinite log-likelihood, then you can check whether their state
  histories are particularly unusual and conflict with the model.  For
  example, they might appear to make unusually large jumps between
  states in short periods of time.  For models with misclassification,
  note that the default true initial state distribution
  \Rfunarg{initprobs} puts all individuals in true state 1 at their
  first observation.  If someone starts in a much higher state, this
  may result in an infinite log-likelihood, and changing
  \Rfunarg{initprobs} would be sensible.


\end{description}


\subsection{Model assessment}
\label{sec:model-assessment}

\paragraph{Observed and expected prevalence}

To compare the relative fit of two nested models, it is easy to
compare their likelihoods.  However it is not always easy to determine
how well a fitted multi-state model describes an irregularly-observed
process.  Ideally we would like to compare observed data with fitted
or expected data under the model.  If there were times at which all
individuals were observed then the fit of the expected numbers in each
state or {\em prevalences} can be assessed directly at those times.
Otherwise, some approximations are necessary.  We could assume that an
individual's state at an arbitrary time $t$ was the same as the state
at their previous observation time.  This might be fairly accurate if
observation times are close together.  This approach is taken by the
function \Rfunction{prevalence.msm}, which constructs a table of observed
and expected numbers and percentages of individuals in each state at a
set of times.

A set of expected counts can be produced if the process begins at a
common time for all individuals.  Suppose at this time, each
individual is in state 0.  Then given $n(t)$ individuals are under
observation at time $t$, the expected number of individuals in state
$r$ at time $t$ is $n(t) P(t)_{0,r}$.   If the covariates on which
$P(t)$ depends vary between individuals, then this can be
averaged over the covariates observed in the data.

For example, we calculate the observed and expected numbers and
percentages at two-yearly intervals up to 20 years after transplant,
for the heart transplant model \Rfunction{cav.msm}.  The number of
individuals still alive and under observation decreases from 622 to
251 at year 20.  The observed and expected percentages are plotted
against time.

<<>>=
options(digits=3)
prevalence.msm(cav.msm, times=seq(0,20,2))
@
<<fig=TRUE>>=
plot.prevalence.msm(cav.msm, mintime=0, maxtime=20)
@

Comparing the observed and expected percentages in each state,
we see that the predicted number of individuals who die (State 4) is
under-estimated by the model from about year 8 onwards.  Similarly the
number of individuals sill alive and free of CAV (State 1) is
over-estimated by the model from about year 8 onwards.

Such discrepancies could have many causes.  One possibility is that
the transition rates vary with the time since the beginning of the
process, the age of the patient, or some other omitted covariate, so
that the Markov model is {\em non-homogeneous}.  This could be
accounted for by modelling the intensity as a function of age, for
example, such as a piecewise-constant function.  The \Rfunarg{pci}
argument to \Rfunction{msm} can be used to automatically construct
models with transition intensities which are piecewise-constant in
time.

In this example, the hazard of death may increase with age, so that
the model underestimates the number of deaths when forecasting far
into the future.  Another cause of poor model fit may sometimes be the
failure of the Markov assumption.  That is, the transition intensities
may depend on the time spent in the current state (a semi-Markov
process) or other characteristics of the process history.  Accounting
for the process history is difficult as the process is only observed
through a series of snapshots.  Semi-Markov 
models may in principle be fitted to this type of data using phase-type distributions.  Since
version 1.4.1 the \Rfunarg{phase.states} option to \Rfunction{msm} can be used to 
define some phase-type models.  See \Rfunction{help(msm)} for further details.

However, if it is known that individuals who died would not have been
followed up after a certain time, had they survived to that time, then
they should not be included in the observed prevalence of the death
state after that time.  This can be accounted for by passing a vector
of maximum potential follow-up times, one for each individual in the
same order as the original data, in the \Rfunarg{censtime} argument to
\Rfunction{prevalence.msm}.  Ignoring the potential follow-up times is
likely to have resulted in overestimates of the number of deaths
at later times for the ``observed'' prevalences in the CAV example,
though these times are not
available in the data supplied with \Rpackage{msm}.


\paragraph{Pearson-type goodness-of-fit test}
\label{sec:pearson}

Suppose that the true transition times are unknown, and data consist
of observations of the process at arbitrary times which differ between
individuals (panel data).  Assessing goodness of fit by prevalence
counts then involves estimating the observed prevalence at a series of
points by some form of interpolation. This is only advisable if
observation times are close together.  An alternative method of
assessing goodness-of-fit is to construct tables of observed and
expected numbers of transitions, as described by Aguirre-Hernandez and
Farewell \cite{ahf}.  This leads to a formal test of goodness-of-fit,
analogous to the classical Pearson $\chi^2$ test for contingency
tables. The tables are constructed as follows.  Each pair of
successive observations in the data (\emph{transition}) is classified
by
\begin{itemize}
\item the starting state $r$ and finishing state $s$,
\item time between the start of the process and the first of the pair
  of observations (indexed by $h$),
\item time interval between the observations (indexed by $l_h$, within
  categories $h$),
\item (if there are fitted covariates) the impact of covariates, as
  summarised by $q_{rr}$ (indexed by $c$),
\item any other grouping of interest for diagnosing lack of fit
  (indexed by $g$).
\end{itemize}
Groupings of continuous quantities are normally defined by quantiles,
so that there are a similar number of observed transitions in each
(one-dimensional) category.  The observed and expected numbers of
transitions in each group are then defined by
\[
o_{hl_h rscg} = \sum I(S(t_{i,j+1}) = s, S(t_{ij}) = r)
\]
\[
e_{hl_h rscg} = \sum P(S(t_{i,j+1}) = s | S(t_{ij}) = r)
\]
where $I(A)$ is the indicator function for an event $A$ and the
summation is over the set of transitions in the category defined by
$h,l_h,c,g$, over all individuals $i$. The Pearson-type test statistic
is then
\[
T = \sum_{hl_h rscg} \frac{(o_{hl_h rscg} - e_{hl_h rscg})^2}{e_{hl_h rscg}}
\]
The classical Pearson test statistic is distributed as $\chi^2_{n-p}$,
where $n$ is the number of independent cells in the table and $p$ is
the number of estimated parameters $p$. But the null distribution of
$T$ is not exactly $\chi^2$, since the time intervals are
non-identical, therefore the observed transitions are realizations
from a set of independent but non-identical multinomial distributions.
Titman \cite{titman:asympnull} showed that the null distribution
of $T$ is asymptotically approximated by a weighted sum of $\chi^2_1$
random variables.   Aguirre-Hernandez and Farewell \cite{ahf} also showed that $\chi^2_{n-p}$
is a good approximation if there are no fitted covariates. For models
with covariates, the null mean of $T$ is higher than $n - p$, but
lower than $n$.  Therefore, upper and lower bounds for the true
$p$-value of the statistic can be obtained from the $\chi^2_{n-p}$ and
$\chi^2_n$ distributions.  Aguirre-Hernandez and Farewell \cite{ahf}
also described a bootstrap procedure for obtaining an accurate $p$-value.

Titman and Sharples \cite{titman:sharples} described modifications to
the test to correct for the biases introduced where in addition to the
panel-data observation scheme:
\begin{itemize}
\item Times of death are known exactly.  In this case, transitions
  ending in death are classified according to the next scheduled
  observation time after the death, which is estimated by multiple
  imputation from a Kaplan-Meier estimate of the distribution of time
  intervals between observations.

\item An individual's final observation is censored, so that they are
  only known to be alive at that point.

\item States are misclassified.
\end{itemize}

The \Rpackage{msm} package provides the function
\Rfunction{pearson.msm} to perform the Pearson-type test.  By default,
three groups are used for each of $h$, $l_h$ and $c$.  Often the
number of groups will need to be reduced in cases where the resulting
contingency tables are sparse (thus there are several low expected
counts and the variance of $T$ is inflated).

The test is now performed on the model \Robject{cav.msm} for the heart
transplant dataset (a version of which was also analysed by
Titman and Sharples \cite{titman:sharples}). The default three interval groups are used,
and two groups of the time since the start of the process. The
\Rfunarg{transitions} argument groups the transitions from state 3 to
each of states 1, 2 and 3 (the 9th, 10th and 11th transitions)
together in the table, since these transitions are infrequent.

<<>>=
options(digits=2)
pearson.msm(cav.msm, timegroups=2,
            transitions=c(1,2,3,4,5,6,7,8,9,9,9,10))
@

The first two tables in the output show the contingency tables of
observed and expected numbers of transitions.  Note that the observed
number of transitions in certain categories is not a whole number,
since these are averaged over multiple imputations of the next
scheduled observation time following deaths.  The column \texttt{Time}
is the group defined by time since the start of the process, and the
column \texttt{Interval} is the group defined by intervals between
observations.  The columns indicate the allowed transitions, or pairs
of states which can be observed in successive observations.

The third table presents the ``deviance'', the value of
$\frac{(o_{hl_h rscg} - e_{hl_h rscg})^2}{e_{hl_h rscg}}$ for each
cell, multipled by the sign of $o_{hl_h rscg} - e_{hl_h rscg}$ to
indicate whether there were more or fewer transitions than expected in
each cell.  These can indicate areas of bad fit.  For example,
systematic changes in deviance by time or time interval between
observations can indicate that a model with time-varying transition
intensities is more suitable.  Changes in deviance by covariate impact
may indicate heterogeneity between individuals which is unexplained
by the fitted covariates.  Changes in deviance with the length of the
interval between observations may also indicate failure of the Markov
assumption, and that a semi-Markov model (in which intensities depend
on the time spent in the current state) may fit better.

In this example, the test statistic is 100.  \Robject{p.upper} is an
upper bound for the $p$-value of the statistic based on an asymptotic
$\chi^2_{42}$ distribution, therefore the model does not fit well.  It
is not clear from the table of deviances which aspects of the fit are
most influental to the test statistic.  However, the two-way Markov
model itself is not biologically plausible, as discussed in
Section \ref{sec:fitting:hmm:misc}.

For non-hidden Markov models for panel data, \Rfunction{pearson.msm}
also presents the accurate analytic p-value of
Titman \cite{titman:asympnull}.  For all models,
\Rfunction{pearson.msm} provides an option for parametric
bootstrapping to obtain an accurate p-value.


\subsection{Fitting misclassification models with \Rpackage{msm}}
\label{sec:fitting:hmm:misc}

In fact, in the heart transplant example from
section \ref{sec:datain}, it is not medically realistic for patients
to recover from a diseased state to a healthy state.  Progression of
coronary artery vasculopathy is thought to be an irreversible process.
The angiography scan for CAV is actually subject to error, which leads
to some false measurements of CAV states and apparent recoveries.
Thus we account for misclassification by fitting a \emph{hidden Markov
model} using \Rpackage{msm}.  Firstly we replace the two-way multi-state
model by a one-way model with transition intensity matrix
\[
Q = \left(
  \begin{array}{llll}
    -(q_{12} + q_{14}) & q_{12} &  0     & q_{14}\\
      0    & -(q_{23}+q_{24}) & q_{23} & q_{24}\\
      0    &   0    & -q_{34} & q_{34}\\
      0    &   0    &   0    &   0   \\
  \end{array}
\right )
\]
We also assume that true state 1 (CAV-free) can be classified as state
1 or 2, state 2 (mild/moderate CAV) can be classified as state 1, 2 or
3, while state 3 (severe CAV) can be classified as state 2 or 3.
Recall that state 4 represents death.  Thus our matrix of misclassification
probabilities is
\[
E = \left(
  \begin{array}{llll}
    1 - e_{12} & e_{12} & 0 & 0 \\
    e_{21} & 1 - e_{21} - e_{23} & e_{23} & 0 \\
    0 & e_{32} & 1 - e_{32} & 0 \\
    0 & 0 & 0 & 0\\
  \end{array}
  \right)
\]
with underlying states as rows, and observed states as columns.

To model observed states with misclassification, we define a matrix
\Rfunarg{ematrix} indicating the states that can be misclassified.
Rows of this matrix correspond to true states, columns to observed
states.  It should contains zeroes in the positions where
misclassification is not permitted.  Non-zero entries are initial
values for the corresponding misclassification probabilities. We then
call \Rfunction{msm} as before, but with this matrix as the
\Rfunarg{ematrix} argument.  Initial values of 0.1 are assumed for
each of the four misclassification probabilities $e_{12}, e_{21},
e_{23}, e_{32}$.  Zeroes are given where the elements of $E$ are zero.
The diagonal elements supplied in \Rfunarg{ematrix} are ignored, as
rows must sum to one. The matrix \Rfunarg{qmatrix}, specifying
permitted transition intensities and their initial values, also
changes to correspond to the new $Q$ representing the progression-only
model for the underlying states.

The true state for every patient at the date of transplant is known to
be ``CAV-free'', not misclassified.  To indicate this we use the
argument \Rfunarg{obstrue} to \Rfunction{msm}.  This is set to be a
variable in the dataset, \Rfunarg{firstobs}, indicating where the
observed state equals the true state.  This takes the value of 1 at
the patient's first observation, at the transplant date, and 0
elsewhere.

We use an alternative quasi-Newton optimisation algorithm
\Rfunarg{(method="BFGS")} which can often be faster or more robust
than the default Nelder-Mead simplex-based algorithm.  An optional
argument \Rfunarg{initprobs} could also have been given here,
representing the vector of the probabilities of occupying each true
state at the initial observation
(equation \ref{eq:multi:hidden:matprod}).  This can also be a matrix
with number of rows equal to the number of subjects, if these
probabilities are subject-dependent and known. If not given, all
individuals are assumed to be in true state 1 at their initial
observation.  If \Rfunarg{est.initprobs=TRUE} is specified, then these
probabilites are estimated as part of the model fit, using
a vector \Rfunarg{initprobs} as initial values.  Covariate effects on these
probabilities can also be estimated using a multinomial logistic
regression model, if an \Rfunarg{initcovariates} argument is
specified.  See \Rfunction{help(msm)} for further details.


\paragraph{Model 5: multi-state model with misclassification}

<<>>=
Qm <- rbind(c(0, 0.148, 0, 0.0171),
            c(0, 0, 0.202, 0.081),
            c(0, 0, 0, 0.126),
            c(0, 0, 0, 0))
ematrix <- rbind(c(0, 0.1, 0, 0),
                 c(0.1, 0, 0.1, 0),
                 c(0, 0.1, 0, 0),
                 c(0, 0, 0, 0))
cavmisc.msm <- msm(state ~ years, subject = PTNUM, data = cav,
                   qmatrix = Qm, ematrix = ematrix, deathexact = 4,
                   obstrue = firstobs)
cavmisc.msm
@

Thus there is an estimated probability of about 0.03 that
mild/moderate CAV will be diagnosed erroneously, but a rather higher
probability of 0.17 that underlying mild/moderate CAV will be
diagnosed as CAV-free.  Between the two CAV states, the mild state
will be misdiagnosed as severe with a probability of 0.06, and the
severe state will be misdiagnosed as mild with a probability of 0.12.

The model also estimates the progression rates through underlying
states.  An average of 8 years is spent disease-free, an average of
about 3 years is spent with mild/moderate disease, and periods of severe
disease also last about 3 years on average before death.

\subsection{Effects of covariates on misclassification rates}

We can investigate how the probabilities of misclassification depend
on covariates in a similar way to the transition intensities, using a
\Rfunarg{misccovariates} argument to \Rfunction{msm}. For example, we
now include female sex as a covariate for the misclassification
probabilities.  The linear effects on the log odds of each misclassified
state relative to the true state are initialised to zero by default (but
this can be changed with the \Rfunarg{misccovinits} argument).


\paragraph{Model 6: misclassification model with misclassification probabilities modelled on sex}

<<>>=
cavmiscsex.msm <- msm(state ~ years, subject = PTNUM, data = cav,
                      qmatrix = Qm, ematrix = ematrix,
                      deathexact = 4, misccovariates = ~sex,
                      obstrue=firstobs)
@
<<>>=
cavmiscsex.msm
@
The large confidence interval for the odds ratio for 1/2 misclassification suggests
there is no information in the data about the difference between genders in 
the false positive rates for angiography. On the other hand, women have slightly more
false negatives.


\subsection{Extractor functions}

As well as the functions described in section \ref{sec:extractor} for
extracting useful information from fitted models, there are a number
of extractor functions specific to models with misclassification.

\begin{description}
\item[Misclassification matrix] The function \Rfunction{ematrix.msm}
gives the estimated misclassification probability matrix at the given
covariate values.  For illustration, the fitted misclassification
probabilities for men and women in model 6 are given by

<<>>=
ematrix.msm(cavmiscsex.msm, covariates=list(sex=0))
ematrix.msm(cavmiscsex.msm, covariates=list(sex=1))
@
The confidence intervals for the estimates for women are wider, since
there are only 87 women in this set of 622 patients.

\item[Odds ratios for misclassification] The function
  \Rfunction{odds.msm} would give the estimated odds ratios corresponding to
  each covariate effect on the misclassification probabilities.

\begin{Scode}
odds.msm(cavmiscsex.msm)
\end{Scode}


\item[Observed and expected prevalences]

  The function \Rfunction{prevalence.msm} is intended to assess the
  goodness of fit of the hidden Markov model for the \emph{observed}
  states to the data.  Tables of observed prevalences of observed
  states are calculated as described in
  section \ref{sec:model-assessment}, by assuming that observed states
  are retained between observation times.

  The expected numbers of individuals in each observed state are
  calculated similarly. Suppose the process begins at a common time
  for all individuals, and at this time, the probability of occupying
  \emph{true} state $r$ is $f_r$.  Then given $n(t)$ individuals under
  observation at time $t$, the expected number of individuals in true
  state $r$ at time $t$ is the $r$th element of the vector $n(t) f
  P(t)$. Thus the expected number of individuals in \emph{observed}
  state $r$ is the $r$th element of the vector $n(t) f P(t) E$, where
  $E$ is the misclassification probability matrix.

  The expected prevalences (not shown) for this example are similar to
  those forecasted by the model without misclassification, with
  underestimates of the rates of death from 8 years onwards.  To
  improve this model's long-term prediction ability, it is probably
  necessary to account for the natural increase in the hazard of death
  from any cause as people become older.

\item[Goodness-of-fit test] The Pearson-type goodness-of-fit test is
  performed, as in Section \ref{sec:pearson}.  The table of deviances
  indicates that there are more 1-3 and 1-4 transitions than expected
  in short intervals, and fewer in long intervals.  This may indicate
  some time-dependence in the transition rates. Indeed, Titman
  \cite{titman:phd} found that a model with piecewise-constant
  transition intensities gave a greatly improved fit to these data.
<<>>=
pearson.msm(cavmisc.msm, timegroups=2,
            transitions=c(1,2,3,4,5,6,7,8,9,9,9,10))
@


\end{description}


\subsection{Recreating the path through underlying states}

In speech recognition and signal processing, {\em decoding} is the
procedure of determining the underlying states that are most likely to
have given rise to the observations. The most common method of
reconstructing the most likely state path is the {\em Viterbi}
algorithm. Originally proposed by Viterbi \cite{viterbi}, it is also
described by Durbin \etal \cite{biolog:seq} and Macdonald and
Zucchini \cite{macdonald:zucchini} for discrete-time hidden Markov
chains.  For continuous-time models it proceeds as follows. Suppose
that a hidden Markov model has been fitted and a Markov transition
matrix $P(t)$ and misclassification matrix $E$ are known.  Let
$v_k(t_i)$ be the probability of the most probable path ending in state
$k$ at time $t_i$.
\begin{enumerate}
\item Estimate $v_k(t_1)$ using known or estimated initial-state
  occupation probabilities (\texttt{initprobs}), each multiplied by
  the probability of the observed outcome at the initial time, given that the true initial state is $k$. 
\item For $i = 1 \ldots N$, calculate $v_l(t_i) = e_{l,O_{t_i}} \max_k
  v_k(t_{i-1}) P_{kl}(t_{i} - t_{i-1})$. Let $K_i(l)$ be the
  maximising value of $k$.
\item At the final time point $t_N$, the most likely underlying
  state $S^*_N$ is the value of $k$ which maximises $v_k(t_N)$.
\item Retrace back through the time points, setting $S^*_{i-1} =
  K_i(S^*_i)$.
\end{enumerate}

The computations should be done in log space to prevent underflow.
The \Rpackage{msm} package provides the function \Rfunction{viterbi.msm} to
implement this method.  For example, the following is an extract from
a result of calling \Rfunction{viterbi.msm} to determine the most likely
underlying states for all patients. The results for patient 100103 are
shown, who appeared to `recover' to a less severe state of disease
while in state 3.  We assume this is not biologically possible for the
true states, so we expect that either the observation of state 3 at
time 4.98 was an erroneous observation of state 2, or their apparent
state 2 at time 5.94 was actually state 3.  According to the expected
path constructed using the Viterbi algorithm, it is the observation at
time 5.94 which is most probably misclassified.

<<>>=
vit <- viterbi.msm(cavmisc.msm)
vit[vit$subject==100103,]
@


\subsection{Fitting general hidden Markov models with \Rpackage{msm}}
\label{sec:fitting:hmm:general}

The \Rpackage{msm} package provides a framework for fitting
continuous-time hidden Markov models with general, continuous
outcomes.  As before, we use the \Rfunction{msm} function itself.

\paragraph{Specifying the hidden Markov model}

A hidden Markov model consists of two related components:
\begin{itemize}
\item the model for the evolution of the underlying Markov chain,
\item the set of models for the observed data conditionally on each
  underlying state.
\end{itemize}

The model for the transitions between underlying states is specified
as before, by supplying a \Rfunarg{qmatrix}.  The model for the
outcomes is specified using the argument \Rfunarg{hmodel} to
\Rfunction{msm}.  This is a list, with one element for each underlying
state, in order.  Each element of the list should be an object
returned by a hidden Markov model \emph{constructor function}.  The
HMM constructor functions provided with \Rpackage{msm} are listed in
Table \ref{tab:hmm:dists}.  There is a separate constructor function
for each class of outcome distribution, such as uniform, normal or
gamma.

Consider a three-state hidden Markov model, with a transition
intensity matrix of
\[
Q = \left(
  \begin{array}{llll}
    -q_{12} & q_{12} &  0  \\
    0  & -q_{23} & q_{23}\\
    0    &   0    &   0   \\
  \end{array}
\right )
\]
Suppose the outcome distribution for state 1 is Normal$(\mu_1,
\sigma^2_1)$, the distribution for state 2 is Normal$(\mu_2,
\sigma^2_2)$, and state 3 is exactly observed. Observations of state 3
are given a label of -9 in the data.  Here our \Rfunarg{hmodel}
argument should be a list of objects returned by \Rfunction{hmmNorm} and
\Rfunction{hmmIdent} constructor functions.

We must specify initial values for the parameters as the arguments to
the constructor functions.  For example, we take initial values of
$\mu_1 = 90, \sigma_1 = 8, \mu_2 = 70, \sigma_2 = 8$.  Initial values
for $q_{12}$ and $q_{23}$ are 0.25 and 0.2. Finally suppose the
observed data are in a variable called \Robject{y}, the measurement
times are in \Robject{time}, and subject identifiers are in
\Robject{ptnum}.  The call to \Rfunction{msm} to estimate the
parameters of this hidden Markov model would then be
\begin{Scode}
  msm( y ~ time, subject=ptnum, data = example.df,
  qmatrix = rbind( c(0, 0.25, 0), c(0, 0, 0.2), c(0, 0, 0)),
  hmodel = list (hmmNorm(mean=90, sd=8), hmmNorm(mean=70, sd=8),
                 hmmIdent(-9)) )
\end{Scode}


\begin{table}[htbp]
\scriptsize
  \centering
  \begin{tabular}{lp{0.8in}p{0.6in}p{0.6in}p{0.7in}l}
    \hline
    Function & Distribution & Parameters &  &  Location (link) & Density for an observation $x$  \\
    \hline
    \Rfunction{hmmCat}   & Categorical   & \Rfunarg{prob, basecat}  & $p,c_0$ & $p$ (logit) & $p_x$, $x = 1, \ldots, n$  \\
    \Rfunction{hmmIdent}   & Identity  & \Rfunarg{x}  & $x_0$  & & $I_{x = x_0}$  \\
    \Rfunction{hmmUnif}   & Uniform  & \Rfunarg{lower, upper}  & $l,u$  & & $1 / (u - l)$, $u \leq x \leq l$  \\
    \Rfunction{hmmNorm}   & Normal  & \Rfunarg{mean, sd}  & $\mu,\sigma$  & $\mu$ (identity) & $\phi(x, \mu, \sigma) = \frac{1}{\sqrt{2 \pi \sigma^2}} \exp(-(x - \mu)^2/(2 \sigma^2) )$  \\
    \Rfunction{hmmLNorm}   & Log-normal  & \Rfunarg{meanlog, sdlog}  & $\mu,\sigma$  & $\mu$ (identity) & $\frac{1}{x \sqrt{2 \pi \sigma^2}} \exp(-(\log x - \mu)^2 / (2 \sigma^2))$   \\
    \Rfunction{hmmExp}   & Exponential  & \Rfunarg{rate}  & $\lambda$  & $\lambda$ (log) & $\lambda e^{- \lambda x}$, $x > 0$  \\
    \Rfunction{hmmGamma}   & Gamma  & \Rfunarg{shape, rate}  & $n,\lambda$  &  $\lambda$ (log) & $\frac{\lambda^n}{\Gamma(n)}x^{n-1} \exp(-\lambda x)$, $x > 0, n > 0, \lambda > 0$   \\
    \Rfunction{hmmWeibull}   & Weibull  & \Rfunarg{shape, scale}  & $a, b$  & $b$ (log) & $\frac{a}{b} (\frac{x}{b})^{a-1} \exp{(-(\frac{x}{b})^a)}$, $x > 0$  \\
    \Rfunction{hmmPois}   & Poisson  & \Rfunarg{rate}  & $\lambda$  & $\lambda$ (log) & $\lambda^x \exp(-\lambda)/x!$, $x = 0, 1, 2, \ldots$  \\
    \Rfunction{hmmBinom}   & Binomial  & \Rfunarg{size, prob}  & $n,p$  & $p$ (logit) & ${n \choose x} p^x (1-p)^{n-x}$  \\
    \Rfunction{hmmNBinom}   & Negative binomial  & \Rfunarg{disp, prob}  & $n,p$  & $p$ (logit) & $\Gamma(x+n)/(\Gamma(n)x!) p^n (1-p)^x$ \\
    \Rfunction{hmmBetaBinom}   & Beta-binomial  & \Rfunarg{size, meanp, sdp}  & $n,\mu,\sigma$  & $\mu$ (logit) & ${n \choose x} Beta(x+a,n-x+b)/Beta(a,b)$ \\
                                                                                                 & &   &   &  & where $a=\mu/\sigma, b=(1-\mu)/\sigma$ \\
    \Rfunction{hmmBeta}   & Beta  & \Rfunarg{shape1,shape2}  & $a,b$  & & $ \Gamma(a+b) / (\Gamma(a)\Gamma(b))x^{a-1}(1-x)^{b-1}$ \\
    \Rfunction{hmmT}   & Student $t$  & \Rfunarg{mean, scale, df}  & $\mu,\sigma,k$  &  $\mu$ (identity) & $\frac{\Gamma\left((k+1)/2\right)}{\Gamma(k/2)}{\sqrt{\frac{1}{k\pi\sigma^2}}}\left\{1 + \frac{1}{k\sigma^2}(x - \mu)^{2} \right\}^{-(k + 1)/2}$ \\
    \Rfunction{hmmTNorm}   & Truncated normal  & \Rfunarg{mean, sd, lower, upper}  & $\mu,\sigma,l,u$  & $\mu$ (identity) &
\parbox[t]{2in}{$\phi(x, \mu, \sigma) / \\ (\Phi(u, \mu, \sigma) - \Phi(l, \mu, \sigma))$, \\ where $\Phi(x,\mu,\sigma) = \int_{-\infty}^x \phi(u,\mu,\sigma) du$}
 \\
    \Rfunction{hmmMETNorm}   & Normal with truncation and measurement error  & \Rfunarg{mean, sd, lower, upper, sderr, meanerr}  & \parbox[t]{1in} {$\mu_0,\sigma_0,l,u$, \\ $\sigma_\epsilon,\mu_\epsilon$}  &  $\mu_\epsilon$ (identity) &
\parbox[t]{2in}{$( \Phi(u, \mu_2, \sigma_3) - \Phi(l, \mu_2, \sigma_3)) / $ \\
$(\Phi(u, \mu_0, \sigma_0) - \Phi(l, \mu_0, \sigma_0)) $ \\
$\times \phi(x, \mu_0 + \mu_\epsilon, \sigma_2)$, \\
$\sigma_2^2 = \sigma_0^2 + \sigma_\epsilon^2$, \\
$\sigma_3 = \sigma_0 \sigma_\epsilon / \sigma_2$, \\
$\mu_2 = (x - \mu_\epsilon) \sigma_0^2 + \mu_0 \sigma_\epsilon^2$}
 \\
    \Rfunction{hmmMEUnif}   & Uniform with measurement error  & \Rfunarg{lower, upper, sderr, meanerr}  & $l,u,\mu_\epsilon,\sigma_\epsilon$  & $\mu_\epsilon$ (identity) &
    \parbox[t]{2in}{$(\Phi(x, \mu_\epsilon+l, \sigma_\epsilon) - \Phi(x, \mu_\epsilon+u, \sigma_\epsilon)) / \\ (u - l)$} \\
    \hline
  \end{tabular}
  \caption{Hidden Markov model distributions in \Rpackage{msm}.}
  \label{tab:hmm:dists}
\end{table}




\paragraph{Covariates on hidden Markov model parameters}

Most of the outcome distributions can be parameterised by covariates,
using a link-transformed linear model.  For example, an observation
$y_{ij}$ may have distribution $f_1$ conditionally on underlying state
1. The link-transformed parameter $\theta_1$ is a linear function of
the covariate vector $x_{ij}$ at the same observation time.
\begin{eqnarray*}
  \label{eq:hmm:covs}
  y_{ij} | S_{ij} &  \sim &   f_1 (y | \theta_1, \gamma_1)\\
  g(\theta_1) & = & \alpha + \beta^T x_{ij}
\end{eqnarray*}
Specifically, parameters named as the ``Location'' parameter in
Table \ref{tab:hmm:dists} can be modelled in terms of covariates, with
the given link function.

The \Rfunarg{hcovariates} argument to \Rfunction{msm} specifies the
model for covariates on the hidden Markov outcome distributions.  This
is a list of the same length as the number of underlying states, and
the same length as the \Rfunarg{hmodel} list.  Each element of the
list is a formula, in standard R linear model notation, defining the
covariates on the distribution for the corresponding state.  If there
are no covariates for a certain hidden state, then insert a \texttt{NULL} in
the corresponding place in the list.  For example, in the three-state
normal-outcome example above, suppose that the normal means on states
1 and 2 are parameterised by a single covariate $x$.
\[
\mu_1 = \alpha_1 + \beta_1 x_{ij}, \qquad \mu_2 = \alpha_2 + \beta_2 x_{ij}.
\]
The equivalent call to \Rfunction{msm} would be
\begin{Scode}
  msm( state ~ time, subject=ptnum, data = example.df,
  qmatrix = rbind( c(0, 0.25, 0), c(0, 0, 0.2), c(0, 0, 0)),
  hmodel = list (hmmNorm(mean=90, sd=8), hmmNorm(mean=70, sd=8),
                 hmmIdent(-9)),
  hcovariates = list ( ~ x, ~ x, NULL)
 ).
\end{Scode}



\paragraph{Constraints on hidden Markov model parameters}

Sometimes it is realistic that parameters are shared between some of
the state-specific outcome distributions.  For example, the
Normally-distributed outcome in the previous example could have a
common variance $\sigma^2_1 = \sigma^2_2 = \sigma^2$ between states 1
and 2, but differing means.  It would also be realistic for any
covariates on the mean to have a common effect $\beta_1 = \beta_2 =
\beta$ on the state 1 and 2 outcome distributions.

The argument \Rfunarg{hconstraint} to \Rfunction{msm} specifies which
hidden Markov model parameters are constrained to be equal.  This is a
named list.  Each element is a vector of constraints on the named
hidden Markov model parameter. The vector has length equal to the
number of times that class of parameter appears in the whole model.
As for the other constraint arguments such as \Rfunarg{qconstraint},
identical values of this vector indicate parameters constrained to be
equal.

For example, consider the three-state hidden Markov model described
above, with normally-distributed outcomes for states 1 and 2.
To constrain the outcome variance to be equal for states 1 and 2,
and to also constrain the effect of \Robject{x} on the outcome mean
to be equal for states 1 and 2, specify
\begin{Scode}
hconstraint = list(sd = c(1,1), x=c(1,1))
\end{Scode}

Parameters of the
outcome distributions may also be constrained within specific ranges.
If chosen carefully, this may improve identifiability of hidden Markov states.
For example to constrain the mean for state 1 to be between 80 and 110,
and the mean for state 2 to be between 50 and 80, specify
\begin{Scode}
hranges = list(mean=list(lower=c(80,50), upper=c(110,80)))
\end{Scode}
Maximum likelihood estimation is then performed on the appropriate log
or logit-transformed scale so that these constraints are
satisfied. See the \Rfunction{msm} help page for further details.
Note that initial values should be strictly within the ranges, and not
on the range boundary.

\paragraph{FEV$_1$ after lung transplants}

Now we give an example of fitting a hidden Markov model to a real
dataset.  The data on FEV$_1$ measurements from lung transplant
recipients, described in \ref{sec:hmm:example:fev}, are provided with
the \Rpackage{msm} package in a dataset called \Robject{fev}.
We fit models Models 1 and 2, each with three states and common $Q$
matrix.
<<>>=
three.q <- rbind(c(0, exp(-6), exp(-9)), c(0, 0, exp(-6)), c(0, 0, 0))
@

The simpler Model 1 is specified as follows. Under this model the
FEV$_1$ outcome is Normal with unknown mean and variance, and the mean
and variance are different between BOS state 1 and state 2.
\Rfunarg{hcovariates} specifies that the mean of the Normal outcome
depends linearly on acute events.  Specifically, this covariate is an
indicator for the occurrence of an acute event within 14 days of the
observation, denoted \texttt{acute} in the data.  As an initial guess,
we suppose the mean FEV$_1$ is 100\% baseline in state 1, and 54\%
baseline in state 2, with corresponding standard deviations 16 and 18,
and FEV$_1$ observations coinciding with acute events are on average
8\% baseline lower. \Rfunarg{hconstraint} specifies that the acute
event effect is equal between state 1 and state 2.

Days of death are coded as 999 in the \texttt{fev} outcome variable.

<<>>=
hmodel1 <- list(hmmNorm(mean=100, sd=16), hmmNorm(mean=54, sd=18),
                hmmIdent(999))

fev1.msm <- msm(fev ~ days, subject=ptnum, data=fev, qmatrix=three.q,
                deathexact=3, hmodel=hmodel1,
                hcovariates=list(~acute, ~acute, NULL),
                hcovinits = list(-8, -8, NULL),
                hconstraint = list(acute = c(1,1)))

fev1.msm

sojourn.msm(fev1.msm)
@


Printing the \Rclass{msm} object \Rfunarg{fev1.msm} shows estimates
and confidence intervals for the transition intensities and the
hidden Markov model parameters.  The estimated within-state means of
FEV$_1$ are around 98\% and 52\% baseline respectively.  From the
estimated transition intensities, individuals spend around 1421 days
(3.9 years) before getting BOS, after which they live for an average
of 1248 days (3.4 years).  FEV$_1$ is lower by an average of 8\%
baseline within 14 days of acute events.

Model 2, where the outcome distribution is a more complex two-level
model, is specified as follows.  We use the distribution defined by
equations \ref{eq:fev:level1}--\ref{eq:fev:level2}. The
\Rfunction{hmmMETNorm} constructor defines the truncated normal
outcome with an additional normal measurement error.  The explicit
probability density for this distribution is given in
Table \ref{tab:hmm:dists}.

Our initial values are 90 and 54 for the means of the
within-state distribution of \emph{underlying} FEV$_1$, and 16 and 18
for the standard errors.  This time, underlying FEV$_1$ is truncated
normal. The truncation limits \Rfunarg{lower} and \Rfunarg{upper} are
not estimated.  We take an initial measurement error standard
deviation of \Rfunarg{sderr=8}. The extra shift \Rfunarg{meanerr} in
the measurement error model is fixed to zero and not estimated.

The \Rfunarg{hconstraint} specifies that the measurement error
variance $\sigma^2_\epsilon$ is equal between responses in states 1
and 2, as is the effect of short-term acute events on the FEV$_1$
response.

The convergence of maximum likelihood estimation in this example is
particularly sensitive to the optimisation method and options, initial values,
the unit of the time variable and whether covariates are centered,
probably because the likelihood surface is irregular near to the
true maximum.

\begin{Scode}
hmodel2 <- list(hmmMETNorm(mean=90, sd=16, sderr=8,
                           lower=80, upper=Inf, meanerr=0),
                hmmMETNorm(mean=54, sd=18, sderr=8,
                           lower=0, upper=80, meanerr=0),
                hmmIdent(999))

fev2.msm <- msm(fev ~ days, subject=ptnum, data=fev, qmatrix=three.q,
                deathexact=3, hmodel=hmodel2,
                hcovariates=list(~acute, ~acute, NULL),
                hcovinits = list(-8, -8, NULL),
                hconstraint = list(sderr = c(1,1), acute = c(1,1)),
                control=list(maxit=10000), center=TRUE)
\end{Scode}

Under this model the standard deviation of FEV$_1$ measurements caused
by measurement error (more realistically, natural short-term
fluctuation) is around 9\% baseline.  The estimated effect of acute
events on FEV$_1$ and sojourn times in the BOS-free state and in BOS
before death are similar to Model 1.

The following code will create a plot that illustrates a trajectory of declining FEV$_1$ from
the first lung transplant recipient in this dataset.  The Viterbi algorithm is used to locate the
most likely point at which this individual moved from BOS state 1 to
BOS state 2, according to the fitted Model 2.  This is illustrated by
a vertical dotted line.  This is the point at which the individual's
lung function started to remain consistently below 80\% baseline
FEV$_1$.

\begin{Scode}
keep <- fev$ptnum==1 & fev$fev<999
plot(fev$days[keep], fev$fev[keep], type="l",
ylab=expression(paste("% baseline ", FEV[1])), xlab="Days after transplant")
vit <- viterbi.msm(fev2.msm)[keep,]
(max1 <- max(vit$time[vit$fitted==1]))
(min2 <- min(vit$time[vit$fitted==2]))
abline(v = mean(max1,min2), lty=2)
text(max1 - 500, 50, "STATE 1")
text(min2 + 500, 50, "STATE 2")
\end{Scode}

\includegraphics{figures/fev_viterbi}



\paragraph{An alternative way of specifying a misclassification model}

This general framework for specifying hidden Markov models can also be
used to specify multi-state models with misclassification.  A
misclassification model is a hidden Markov model with a categorical
outcome distribution.  So instead of an \Rfunarg{ematrix} argument to
\Rfunction{msm}, we can use a \Rfunarg{hmodel} argument with
\Rfunction{hmmCat} constructor functions.

\Rfunction{hmmCat} takes at least one argument \Rfunarg{prob}, a
vector of probabilities of observing outcomes of $1, 2, \ldots, n$
respectively, where $n$ is the length of \Rfunarg{prob}.  All outcome
probabilities with an initial value of zero are assumed to be fixed at
zero.  \Rfunarg{prob} is scaled if necessary to sum to one.

The model in section \ref{sec:fitting:hmm:misc} specifies that an
individual occupying underlying state 1 can be observed as states 2
(and 1), underlying state 2 can be observed as states 1, 2 or 3, and
state 3 can be observed as states 2 or 3, and underlying state 4
(death) cannot be misclassified. Initial values of 0.1 are given for
the 1-2, 2-1, 2-3 and 3-2 misclassification probabilities.

This is equivalent to the model below, specified using a
\Rfunarg{hmodel} argument to \Rfunction{msm}.  

The maximum likelihood estimates should be the same as before (Model
5) if the \Rfunarg{obstrue=firstobs} is removed from the
\Rfunction{msm()} call.  \Rfunarg{obstrue} has a slightly different
meaning for models specified with \Rfunarg{hmodel}.  If supplied, the
variable indicated by \Rfunarg{obstrue} should contain the true state
if this is known, and \Robject{NA} otherwise, whereas the
\Rfunarg{state} variable contains an observation generated from the
HMM outcome model given the (unobserved) true state.  For models
specified with \Rfunarg{ematrix}, the \Rfunarg{state} variable
contains the (observed) true state itself.  Thus the \Rfunarg{hmodel}
specification is not strictly suitable for the CAV data, since the
true state is both \emph{known} and \emph{observed} at the time of
transplant.

\begin{Scode}
Qm <- rbind(c(0, 0.148, 0, 0.0171),
            c(0, 0, 0.202, 0.081),
            c(0, 0, 0, 0.126),
            c(0, 0, 0, 0))
cavmisc.msm <- msm(state ~ years, subject = PTNUM, data = cav,
                     hmodel = list (hmmCat(c(0.9, 0.1, 0, 0)),
                                    hmmCat(c(0.1, 0.8, 0.1, 0)),
                                    hmmCat(c(0, 0.1, 0.9, 0)),
                                    hmmIdent(4)),
                     qmatrix = Qm, deathexact = 4)
cavmisc.msm
\end{Scode}


\subsubsection{Hidden Markov models with multivariate outcomes}

Since version 1.5.2, \Rpackage{msm} can fit models where at each time
point, there are multiple outcomes generated conditionally on a single
hidden Markov state.  The outcomes must be independent conditionally
on the hidden state, but they may be generated from the same or
different univariate distributions.

See \Rfunction{help(hmmMV)} for detailed documentation and a worked
example.



\subsubsection{Defining a new hidden Markov model distribution}

Suppose the hidden Markov model outcome distributions supplied with
\Rpackage{msm} (Table \ref{tab:hmm:dists}) are insufficient.  We want
to define our own univariate distribution, called
\Rfunction{hmmNewDist}, taking two parameters \Robject{location} and
\Robject{scale}.  Download the source package, for example
\texttt{msm-0.7.2.tar.gz} for version 0.7.2, from CRAN and edit the
files in there, as follows.

\begin{enumerate}
\item Add an element to \Robject{.msm.HMODELPARS} in the file
  \texttt{R/constants.R}, naming the parameters of the distribution.
  For example
\begin{Scode}
  newdist = c('location', 'scale')
\end{Scode}

\item Add a corresponding element to the C variable \texttt{HMODELS}
  in the file \texttt{src/lik.c}.  This MUST be in the same position
  as in the \Robject{.msm.HMODELPARS} list.  For example,
\begin{Scode}
hmmfn HMODELS[] = {
  ...,
  hmmNewDist
};.
\end{Scode}

\item The new distribution is allowed to have one parameter which can
  be modelled in terms of covariates.  Add the name of this parameter
  to the named vector \Robject{.msm.LOCPARS} in
  \texttt{R/constants.R}.  For example \texttt{newdist = 'location'}.
  Specify \texttt{newdist = NA} if there is no such parameter.

\item Supposed we have specified a parameter with a non-standard name,
  that is, one which doesn't already appear in
  \Robject{.msm.HMODELPARS}. Standard names include, for example,
  \texttt{'mean'}, \texttt{'sd'}, \texttt{'shape'} or
  \texttt{'scale'}.  Then we should add the allowed range of the
  parameter to \Robject{.msm.PARRANGES}. In this example, we add
  \texttt{meanpars = c(-Inf, Inf)} to \Robject{.msm.PARRANGES}.  This
  ensures that the optimisation to estimate the parameter takes place
  on a suitable scale, for example, a log scale for a parameter
  constrained to be positive.  If the parameter should be fixed during
  maximum likelihood estimation (for example, the denominator of a
  binomial distribution) then add its name to \Robject{.msm.AUXPARS}.

\item Add an R constructor function for the distribution to
  \texttt{R/hmm-dists.R}.  For a simple univariate distribution, this is of the
  form
\begin{Scode}
  hmmNewDist <- function(location, scale)
  {
    hmmDIST(label = "newdist",
    link = "identity",
    r = function(n) rnewdist(n, location, scale),
    match.call())
  }
\end{Scode}

  \begin{itemize}
  \item The \texttt{'label'} must be the same as the name you
    supplied for the new element of \Robject{.msm.HMODELPARS}

  \item \texttt{link} is the link function for modelling the location
    parameter of the distribution as a linear function of covariates. This
    should be the quoted name of an R function.  A log link is
    \texttt{'log'} and a logit link is \texttt{'qlogis'}.  If
    using a new link function other than \texttt{'identity'},
    \texttt{'log'}, or \texttt{'qlogis'}, you should add its name
    to the vector \Robject{.msm.LINKFNS} in \texttt{R/constants.R},
    and add the name of the corresponding inverse link to
    \Robject{.msm.INVLINK}.  You should also add the names of these
    functions to the C array \texttt{LINKFNS} in \texttt{src/lik.c},
    and write the functions if they do not already exist.

  \item \texttt{r} is an R function, of the above format, returning a
    vector of \texttt{n} random values from the distribution.
    You should write this if it doesn't already exist.
  \end{itemize}

\item Add the name of the new constructor function to the NAMESPACE in
  the top-level directory of the source package.

\item Write a C function to compute the probability density of the
  distribution, and put this in \texttt{src/hmm.c}, with a declaration
  in \texttt{src/hmm.h}. This must be of the form
\begin{Scode}
  double hmmNewDist(double x, double *pars)
\end{Scode}

  where \texttt{*pars} is a vector of the parameters of the
  distribution, and the density is evaluated at \texttt{x}.

\item (Optionally) Write a C function to compute the derivatives of
  the probability density with respect to the parameters, and put this
  in \texttt{src/hmmderiv.c}, with a declaration in
  \texttt{src/hmm.h}.  Then add the model to \texttt{DHMODELS} in
  \texttt{lik.c} (in the same position as in \texttt{HMODELS}) and
  \texttt{.msm.HMODELS.DERIV} in \texttt{R/constants.R}.  This will
  generally double the speed of maximum likelihood estimation, but analytic 
  derivatives will not be available for all distributions.
  
\item Update the documentation (\texttt{man/hmm-dists.Rd}) and the
  distribution table in\\ \texttt{inst/doc/msm-manual.Rnw}) if you like.

\item Recompile the package (see the ``Writing R Extensions'' manual)

\end{enumerate}

Your new distribution will be available to use in the \Rfunarg{hmodel}
argument to \Rfunction{msm}, as, for example
\begin{Scode}
hmodel = list(..., hmmNewDist(location = 0, scale = 1), ...)
\end{Scode}
If your distribution may be of interest to others, ask me
(\texttt{chris.jackson@mrc-bsu.cam.ac.uk}) to include it in a future
release.


\clearpage

\section{\Rpackage{msm} reference guide}

The R help page for \Rfunction{msm} gives details of all the allowed
arguments and options to the \Rfunction{msm} function.  To view this
online in R, type:

<<eval=FALSE>>=
help(msm)
@

Similarly all the other functions in the package have help pages,
which should always be consulted in case of doubt about how to
call them.  The web-browser based help interface may often be
convenient - type

<<eval=FALSE>>=
help.start()
@
and navigate to \textsf{Packages} $\ldots$ \textsf{msm}, which brings
up a list of all the functions in the package with links to their
documentation, and a link to this manual in PDF format.


\appendix

\section{Changes in the msm package}

For a detailed list of the changes in recent versions of
\Rpackage{msm}, see the \texttt{NEWS} file in the top-level directory
of the installed package.

Development versions of \Rpackage{msm} are to be found on GitHub
\url{https://github.com/chjackson/msm}.  These are often more recent
than the version released on CRAN, so if you think you have found a
bug, then please check first to see whether it has been fixed on the
GitHub version.

\section{Get in touch}

If you use \Rpackage{msm} in your work, whatever your field of
application, please let me know, for my own interest!  Suggestions for
improvement are welcome.


\clearpage

\bibliography{msm}



\end{document}