An S-Curve Method for Estimating Abrupt and Gradual Changepoints
Reading through the first example (“Nile River Data”) is sufficient for a quick start.
The code here implements the method of Jiang et al (2023), whch does changepoint detection/estimation for changes in mean, performed by using an S-curve (logistic function) to approximate a step function. This enables asymptotic standard errors, and associated confidence intervals and tests for changepoint locations and change magnitudes. Both abrupt and gradual changes may be modeled.
Also, changes in slope and intercept in piecewise linear models can be analyzed, again with the ability to form confidence intervals for the various quantities of interest..
The Nile dataset, built-in to R, consists of yearly measurements of the height of the river during 1871-1970. Here is the code:
nile <- data.frame(t=1871:1970, ht=Nile)
nileOut <- fitS(nile,1,2,10) # abrupt change model
nileOut
# [1] "point estimates of the alpha_i"
# postMean preMean changePt
# 849.9696 1097.9298 1898.3814
# [1] "covariance matrix"
# postMean preMean changePt
# postMean 228.8062041 0.7764724 -0.5270315
# preMean 0.7764724 609.8103449 -11.2918541
# changePt -0.5270315 -11.2918541 6.1601174
# [1] "standard error of the difference between pre-changepoint and post-changepoint means"
# [1] 28.93205
Our model concluded that there is an abrupt changepoint in the middle of year 1898. This matches the fact that the British started the construction of the Aswan Low Dam in that year. We modeled an abrupt change here, by setting the S-curve slope to a large value, 10.
Here domain expertise would identify the location of the changepoint, so the main interest is the value of the change. A major drop of water flow is detected around this point, from 1097.75 down to 849.972. An approximate 95% confidence interval for the change in mean would then be
849.972 - 1097.75 ± 1.96 x 28.93205
We can also plot the fit against the raw data:
We next consider applying the S-Curve approach to data collected in on the rates of breast cancer among women in Sweden. There had been speculation that such rates rise with the onset of menopause; see Pawitan (2005; we are grateful to Prof. Pawitan for making his data available.
While that paper considers an abrupt model, the relationship, if one exists, may be gradual. For a given woman the transition to menopause is gradual. And even if it were abrupt, different women experience menopause at different ages, so that the data would follow a mixture of abrupt changes, hence gradual overall. Thus this is a good example use case for our S-curve approach.
data(cancerRates)
crOut <- fitS(cancerRates,1,2)
crOut
# [1] "point estimates of the alpha_i"
# postMean preMean slope changePt
# 8.965952 3.425623 1.178302 43.262817
# [1] "covariance matrix"
# postMean preMean slope changePt
# postMean 0.16734502 -0.1215118 -0.1318708 0.02302526
# preMean -0.12151178 1.1340273 0.5080011 0.44335717
# slope -0.13187077 0.5080011 0.4107214 0.16400967
# changePt 0.02302526 0.4433572 0.1640097 0.29987282
# [1] "standard error of the difference between pre-changepoint and post-changepoint means"
# [1] 1.242737
Here we did not specify an S-curve slope, asking fitS to estimate it for us. This models a gradual change.
We can plot the output:
This dataset came from a study that investigated the impact of Medicare, the US medical insurance program for retired people. One nomimally qualifies at age 65, though this can occur earlier or later. Here we consider 90-day mortality in relation to age.
# get Medicare data of David Card et al
con <- url('https://github.com/108michael/ms_thesis/raw/master/medicare.Rdata')
load(con)
z <- fitS_linear(medicare[,c(4,3)],1,2)
summary(z)
# Formula: y ~ big_linear_guy(b1, h1, s1, c, b2, h2, s2, x = x)
#
# Parameters:
# Estimate Std. Error t value Pr(>|t|)
# b1 0.60490 0.07135 8.478 9.87e-14 ***
# h1 0.78810 0.18587 4.240 4.59e-05 ***
# s1 9.59339 717.41168 0.013 0.98935
# c 65.99895 0.10660 619.109 < 2e-16 ***
# b2 -23.93643 4.40366 -5.436 3.19e-07 ***
# h2 -37.23297 12.69091 -2.934 0.00406 **
# s2 2.89020 1.76829 1.634 0.10495
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 0.6417 on 113 degrees of freedom
#
# Number of iterations to convergence: 91
# Achieved convergence tolerance: 1.49e-08
As we would hope, mortality did decline after people became eligible for Medicare. For instance, the intercept went down from -23.9364322 to -37.2329709. Interestingly, the changepoint is close to 66, indicating that many people opted to start the program a little later.
An approximate 95% confidence interval for the location of the changepoint is 65.99895 ± 1.96 x 0.10660.
There are many R packages for determining changepoints. We will mentipn two here for comparison to changeS, the changepoints and mcp packages.
Types of changes
changeS package is the only one of the three to handle both abrupt and gradual changes;
the other two model only the abrupt case.
Changepoint location domain
changepoints assumes that the location of a changepoint is an integer
changeS and mcp assume the location is a general continuous number
Statistical basis
changepoints offers several different methods, but the general theme is to conduct a series of statistical hypothesis tests at various candiate integer locations
changeS uses nonlinear least-squares estimation
mcp takes the Bayesian estimation philosophy
Uncertainty analysis: changepoint locations and magnitudes
changepoints does not form confidence intervals for these quantities
changeS offers confidence intervals
mcp offers Bayesian credible intervals
An S-Curve Method for Abrupt and Gradual Changepoint Analysis, Lan Jiang, Collin Kennedy, Norman Matloff; SDSS 2023
Encyclopedia of Biostatistics, Change-point Problem Yudi Pawitan, 2005, https://doi.org/10.1002/0470011815.b2a12011