This vignette introduces the following functions from the PHEindicatormethods package and provides basic sample code to demonstrate their execution. The code included is based on the code provided within the ‘examples’ section of the function documentation. This vignette does not explain the methods applied in detail but these can (optionally) be output alongside the statistics or for a more detailed explanation, please see the references section of the function documentation.
library(PHEindicatormethods)
library(dplyr)
This vignette covers the following functions available within the first release of the package (v1.0.8) but has been updated to apply to these functions in their latest release versions. If further functions are added to the package in future releases these will be explained elsewhere.
Function | Type | Description |
---|---|---|
phe_proportion | Non-aggregate | Performs a calculation on each row of data (unless data is grouped) |
phe_rate | Non-aggregate | Performs a calculation on each row of data (unless data is grouped) |
phe_mean | Aggregate | Performs a calculation on each grouping set |
phe_dsr | Aggregate, standardised | Performs a calculation on each grouping set and requires additional reference inputs |
phe_smr | Aggregate, standardised | Performs a calculation on each grouping set and requires additional reference inputs |
phe_isr | Aggregate, standardised | Performs a calculation on each grouping set and requires additional reference inputs |
The following code chunk creates a data frame containing observed number of events and populations for 4 geographical areas over 2 time periods that is used later to demonstrate the PHEindicatormethods package functions:
df <- data.frame(
area = rep(c("Area1","Area2","Area3","Area4"), 2),
year = rep(2015:2016, each = 4),
obs = sample(100, 2 * 4, replace = TRUE),
pop = sample(100:200, 2 * 4, replace = TRUE))
df
#> area year obs pop
#> 1 Area1 2015 22 118
#> 2 Area2 2015 54 123
#> 3 Area3 2015 66 183
#> 4 Area4 2015 89 117
#> 5 Area1 2016 62 199
#> 6 Area2 2016 69 113
#> 7 Area3 2016 100 145
#> 8 Area4 2016 22 195
INPUT: The phe_proportion and phe_rate functions take a single data frame as input with columns representing the numerators and denominators for the statistic. Any other columns present will be retained in the output.
OUTPUT: The functions output the original data frame with additional columns appended. By default the additional columns are the proportion or rate, the lower 95% confidence limit, the upper 95% confidence limit, the confidence level, the statistic name and the method.
OPTIONS: The functions also accept additional arguments to specify the level of confidence, the multiplier and a reduced level of detail to be output.
Here are some example code chunks to demonstrate these two functions and the arguments that can optionally be specified
# default proportion
phe_proportion(df, obs, pop)
#> area year obs pop value lowercl uppercl confidence
#> 1 Area1 2015 22 118 0.1864407 0.12646991 0.2661835 95%
#> 2 Area2 2015 54 123 0.4390244 0.35448712 0.5272550 95%
#> 3 Area3 2015 66 183 0.3606557 0.29460770 0.4324336 95%
#> 4 Area4 2015 89 117 0.7606838 0.67587418 0.8289195 95%
#> 5 Area1 2016 62 199 0.3115578 0.25129248 0.3789606 95%
#> 6 Area2 2016 69 113 0.6106195 0.51849360 0.6954715 95%
#> 7 Area3 2016 100 145 0.6896552 0.61027614 0.7592446 95%
#> 8 Area4 2016 22 195 0.1128205 0.07569503 0.1649060 95%
#> statistic method
#> 1 proportion of 1 Wilson
#> 2 proportion of 1 Wilson
#> 3 proportion of 1 Wilson
#> 4 proportion of 1 Wilson
#> 5 proportion of 1 Wilson
#> 6 proportion of 1 Wilson
#> 7 proportion of 1 Wilson
#> 8 proportion of 1 Wilson
# specify confidence level for proportion
phe_proportion(df, obs, pop, confidence=99.8)
#> area year obs pop value lowercl uppercl confidence
#> 1 Area1 2015 22 118 0.1864407 0.10079593 0.3190373 99.8%
#> 2 Area2 2015 54 123 0.4390244 0.31014063 0.5766941 99.8%
#> 3 Area3 2015 66 183 0.3606557 0.26040507 0.4747280 99.8%
#> 4 Area4 2015 89 117 0.7606838 0.62216724 0.8598574 99.8%
#> 5 Area1 2016 62 199 0.3115578 0.22070803 0.4196652 99.8%
#> 6 Area2 2016 69 113 0.6106195 0.46561136 0.7383878 99.8%
#> 7 Area3 2016 100 145 0.6896552 0.56234155 0.7935314 99.8%
#> 8 Area4 2016 22 195 0.1128205 0.06018842 0.2016041 99.8%
#> statistic method
#> 1 proportion of 1 Wilson
#> 2 proportion of 1 Wilson
#> 3 proportion of 1 Wilson
#> 4 proportion of 1 Wilson
#> 5 proportion of 1 Wilson
#> 6 proportion of 1 Wilson
#> 7 proportion of 1 Wilson
#> 8 proportion of 1 Wilson
# specify to output proportions as percentages
phe_proportion(df, obs, pop, multiplier=100)
#> area year obs pop value lowercl uppercl confidence statistic
#> 1 Area1 2015 22 118 18.64407 12.646991 26.61835 95% percentage
#> 2 Area2 2015 54 123 43.90244 35.448712 52.72550 95% percentage
#> 3 Area3 2015 66 183 36.06557 29.460770 43.24336 95% percentage
#> 4 Area4 2015 89 117 76.06838 67.587418 82.89195 95% percentage
#> 5 Area1 2016 62 199 31.15578 25.129248 37.89606 95% percentage
#> 6 Area2 2016 69 113 61.06195 51.849360 69.54715 95% percentage
#> 7 Area3 2016 100 145 68.96552 61.027614 75.92446 95% percentage
#> 8 Area4 2016 22 195 11.28205 7.569503 16.49060 95% percentage
#> method
#> 1 Wilson
#> 2 Wilson
#> 3 Wilson
#> 4 Wilson
#> 5 Wilson
#> 6 Wilson
#> 7 Wilson
#> 8 Wilson
# specify level of detail to output for proportion
phe_proportion(df, obs, pop, confidence=99.8, multiplier=100)
#> area year obs pop value lowercl uppercl confidence statistic
#> 1 Area1 2015 22 118 18.64407 10.079593 31.90373 99.8% percentage
#> 2 Area2 2015 54 123 43.90244 31.014063 57.66941 99.8% percentage
#> 3 Area3 2015 66 183 36.06557 26.040507 47.47280 99.8% percentage
#> 4 Area4 2015 89 117 76.06838 62.216724 85.98574 99.8% percentage
#> 5 Area1 2016 62 199 31.15578 22.070803 41.96652 99.8% percentage
#> 6 Area2 2016 69 113 61.06195 46.561136 73.83878 99.8% percentage
#> 7 Area3 2016 100 145 68.96552 56.234155 79.35314 99.8% percentage
#> 8 Area4 2016 22 195 11.28205 6.018842 20.16041 99.8% percentage
#> method
#> 1 Wilson
#> 2 Wilson
#> 3 Wilson
#> 4 Wilson
#> 5 Wilson
#> 6 Wilson
#> 7 Wilson
#> 8 Wilson
# specify level of detail to output for proportion and remove metadata columns
phe_proportion(df, obs, pop, confidence=99.8, multiplier=100, type="standard")
#> area year obs pop value lowercl uppercl
#> 1 Area1 2015 22 118 18.64407 10.079593 31.90373
#> 2 Area2 2015 54 123 43.90244 31.014063 57.66941
#> 3 Area3 2015 66 183 36.06557 26.040507 47.47280
#> 4 Area4 2015 89 117 76.06838 62.216724 85.98574
#> 5 Area1 2016 62 199 31.15578 22.070803 41.96652
#> 6 Area2 2016 69 113 61.06195 46.561136 73.83878
#> 7 Area3 2016 100 145 68.96552 56.234155 79.35314
#> 8 Area4 2016 22 195 11.28205 6.018842 20.16041
# default rate
phe_rate(df, obs, pop)
#> area year obs pop value lowercl uppercl confidence
#> 1 Area1 2015 22 118 18644.07 11680.079 28228.63 95%
#> 2 Area2 2015 54 123 43902.44 32978.643 57284.27 95%
#> 3 Area3 2015 66 183 36065.57 27891.791 45884.96 95%
#> 4 Area4 2015 89 117 76068.38 61087.425 93609.70 95%
#> 5 Area1 2016 62 199 31155.78 23885.695 39940.97 95%
#> 6 Area2 2016 69 113 61061.95 47507.757 77278.94 95%
#> 7 Area3 2016 100 145 68965.52 56111.797 83881.34 95%
#> 8 Area4 2016 22 195 11282.05 7067.945 17081.94 95%
#> statistic method
#> 1 rate per 100000 Byars
#> 2 rate per 100000 Byars
#> 3 rate per 100000 Byars
#> 4 rate per 100000 Byars
#> 5 rate per 100000 Byars
#> 6 rate per 100000 Byars
#> 7 rate per 100000 Byars
#> 8 rate per 100000 Byars
# specify rate parameters
phe_rate(df, obs, pop, confidence=99.8, multiplier=100)
#> area year obs pop value lowercl uppercl confidence statistic
#> 1 Area1 2015 22 118 18.64407 8.689823 34.52658 99.8% rate per 100
#> 2 Area2 2015 54 123 43.90244 27.707417 65.70466 99.8% rate per 100
#> 3 Area3 2015 66 183 36.06557 23.874210 52.01610 99.8% rate per 100
#> 4 Area4 2015 89 117 76.06838 53.546996 104.44933 99.8% rate per 100
#> 5 Area1 2016 62 199 31.15578 20.331935 45.43921 99.8% rate per 100
#> 6 Area2 2016 69 113 61.06195 40.820252 87.38821 99.8% rate per 100
#> 7 Area3 2016 100 145 68.96552 49.588753 93.06457 99.8% rate per 100
#> 8 Area4 2016 22 195 11.28205 5.258457 20.89301 99.8% rate per 100
#> method
#> 1 Byars
#> 2 Byars
#> 3 Byars
#> 4 Byars
#> 5 Byars
#> 6 Byars
#> 7 Byars
#> 8 Byars
# specify rate parameters and reduce columns output and remove metadata columns
phe_rate(df, obs, pop, type="standard", confidence=99.8, multiplier=100)
#> area year obs pop value lowercl uppercl
#> 1 Area1 2015 22 118 18.64407 8.689823 34.52658
#> 2 Area2 2015 54 123 43.90244 27.707417 65.70466
#> 3 Area3 2015 66 183 36.06557 23.874210 52.01610
#> 4 Area4 2015 89 117 76.06838 53.546996 104.44933
#> 5 Area1 2016 62 199 31.15578 20.331935 45.43921
#> 6 Area2 2016 69 113 61.06195 40.820252 87.38821
#> 7 Area3 2016 100 145 68.96552 49.588753 93.06457
#> 8 Area4 2016 22 195 11.28205 5.258457 20.89301
These functions can also return aggregate data if the input dataframes are grouped:
# default proportion - grouped
df %>%
group_by(year) %>%
phe_proportion(obs, pop)
#> # A tibble: 2 x 9
#> year obs pop value lowercl uppercl confidence statistic method
#> <int> <int> <int> <dbl> <dbl> <dbl> <chr> <chr> <chr>
#> 1 2015 231 541 0.427 0.386 0.469 95% proportion of 1 Wilson
#> 2 2016 253 652 0.388 0.351 0.426 95% proportion of 1 Wilson
# default rate - grouped
df %>%
group_by(year) %>%
phe_rate(obs, pop)
#> # A tibble: 2 x 9
#> year obs pop value lowercl uppercl confidence statistic method
#> <int> <int> <int> <dbl> <dbl> <dbl> <chr> <chr> <chr>
#> 1 2015 231 541 42699. 37369. 48575. 95% rate per 1000~ Byars
#> 2 2016 253 652 38804. 34169. 43892. 95% rate per 1000~ Byars
The remaining functions aggregate the rows in the input data frame to produce a single statistic. It is also possible to calculate multiple statistics in a single execution of these functions if the input data frame is grouped - for example by indicator ID, geographic area or time period (or all three). The output contains only the grouping variables and the values calculated by the function - any additional unused columns provided in the input data frame will not be retained in the output.
The df test data generated earlier can be used to demonstrate phe_mean:
INPUT: The phe_mean function take a single data frame as input with a column representing the numbers to be averaged.
OUTPUT: By default, the function outputs one row per grouping set containing the grouping variable values (if applicable), the mean, the lower 95% confidence limit, the upper 95% confidence limit, the confidence level, the statistic name and the method.
OPTIONS: The function also accepts additional arguments to specify the level of confidence and a reduced level of detail to be output.
Here are some example code chunks to demonstrate the phe_mean function and the arguments that can optionally be specified
# default mean
phe_mean(df,obs)
#> value_sum value_count stdev value lowercl uppercl confidence
#> 1 484 8 27.98979 60.5 37.09995 83.90005 95%
#> statistic method
#> 1 mean Student's t-distribution
# multiple means in a single execution with 99.8% confidence
df %>%
group_by(year) %>%
phe_mean(obs, confidence=0.998)
#> # A tibble: 2 x 10
#> year value_sum value_count stdev value lowercl uppercl confidence
#> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 2015 231 4 27.9 57.8 -84.8 200. 99.8%
#> 2 2016 253 4 32.1 63.2 -101. 227. 99.8%
#> # ... with 2 more variables: statistic <chr>, method <chr>
# multiple means in a single execution with 99.8% confidence and data-only output
df %>%
group_by(year) %>%
phe_mean(obs, type = "standard", confidence=0.998)
#> # A tibble: 2 x 7
#> year value_sum value_count stdev value lowercl uppercl
#> <int> <int> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 2015 231 4 27.9 57.8 -84.8 200.
#> 2 2016 253 4 32.1 63.2 -101. 227.
The following code chunk creates a data frame containing observed number of events and populations by age band for 4 areas, 5 time periods and 2 sexes:
df_std <- data.frame(
area = rep(c("Area1", "Area2", "Area3", "Area4"), each = 19 * 2 * 5),
year = rep(2006:2010, each = 19 * 2),
sex = rep(rep(c("Male", "Female"), each = 19), 5),
ageband = rep(c(0, 5,10,15,20,25,30,35,40,45,
50,55,60,65,70,75,80,85,90), times = 10),
obs = sample(200, 19 * 2 * 5 * 4, replace = TRUE),
pop = sample(10000:20000, 19 * 2 * 5 * 4, replace = TRUE))
head(df_std)
#> area year sex ageband obs pop
#> 1 Area1 2006 Male 0 101 15670
#> 2 Area1 2006 Male 5 165 18259
#> 3 Area1 2006 Male 10 196 18783
#> 4 Area1 2006 Male 15 153 12733
#> 5 Area1 2006 Male 20 96 10731
#> 6 Area1 2006 Male 25 128 16682
INPUT: The minimum input requirement for the phe_dsr function is a single data frame with columns representing the numerators and denominators for each standardisation category. This is sufficient if the data is:
The 2013 European Standard Population is provided within the package in vector form (esp2013) and is used by default by this function. Alternative standard populations can be used but must be provided by the user. When the function joins a standard population vector to the input data frame it does this by position so it is important that the data is sorted accordingly. This is a user responsibility.
The function can also accept standard populations provided as a column within the input data frame.
standard populations provided as a vector - the vector and the input data frame must both contain rows for the same standardisation categories, and both must be sorted, within each grouping set, by these standardisation categories in the same order
standard populations provided as a column within the input data frame - the standard populations can be appended to the input data frame by the user prior to execution of the function - if the data is grouped to generate multiple dsrs then the standard populations will need to be repeated and appended to the data rows for every grouping set.
OUTPUT: By default, the function outputs one row per grouping set containing the grouping variable values, the total count, the total population, the dsr, the lower 95% confidence limit, the upper 95% confidence limit, the confidence level, the statistic name and the method.
OPTIONS: If standard populations are being provided as a column within the input data frame then the user must specify this using the stdpoptype argument as the function expects a vector by default. The function also accepts additional arguments to specify the standard populations, the level of confidence, the multiplier and a reduced level of detail to be output.
Here are some example code chunks to demonstrate the phe_dsr function and the arguments that can optionally be specified
# calculate separate dsrs for each area, year and sex
df_std %>%
group_by(area, year, sex) %>%
phe_dsr(obs, pop)
#> # A tibble: 40 x 11
#> # Groups: area, year [20]
#> area year sex total_count total_pop value lowercl uppercl confidence
#> <fct> <int> <fct> <int> <int> <dbl> <dbl> <dbl> <chr>
#> 1 Area1 2006 Fema~ 1609 297542 593. 563. 624. 95%
#> 2 Area1 2006 Male 2293 292439 823. 788. 859. 95%
#> 3 Area1 2007 Fema~ 2341 301123 799. 764. 834. 95%
#> 4 Area1 2007 Male 1687 295453 572. 543. 602. 95%
#> 5 Area1 2008 Fema~ 1322 290781 498. 469. 527. 95%
#> 6 Area1 2008 Male 1960 305175 670. 638. 703. 95%
#> 7 Area1 2009 Fema~ 1930 297974 624. 595. 654. 95%
#> 8 Area1 2009 Male 2013 288190 761. 726. 796. 95%
#> 9 Area1 2010 Fema~ 1910 295888 692. 659. 726. 95%
#> 10 Area1 2010 Male 1829 280043 654. 623. 687. 95%
#> # ... with 30 more rows, and 2 more variables: statistic <chr>,
#> # method <chr>
# calculate separate dsrs for each area, year and sex and drop metadata fields from output
df_std %>%
group_by(area, year, sex) %>%
phe_dsr(obs, pop, type="standard")
#> # A tibble: 40 x 8
#> # Groups: area, year [20]
#> area year sex total_count total_pop value lowercl uppercl
#> <fct> <int> <fct> <int> <int> <dbl> <dbl> <dbl>
#> 1 Area1 2006 Female 1609 297542 593. 563. 624.
#> 2 Area1 2006 Male 2293 292439 823. 788. 859.
#> 3 Area1 2007 Female 2341 301123 799. 764. 834.
#> 4 Area1 2007 Male 1687 295453 572. 543. 602.
#> 5 Area1 2008 Female 1322 290781 498. 469. 527.
#> 6 Area1 2008 Male 1960 305175 670. 638. 703.
#> 7 Area1 2009 Female 1930 297974 624. 595. 654.
#> 8 Area1 2009 Male 2013 288190 761. 726. 796.
#> 9 Area1 2010 Female 1910 295888 692. 659. 726.
#> 10 Area1 2010 Male 1829 280043 654. 623. 687.
#> # ... with 30 more rows
# calculate same specifying standard population in vector form
df_std %>%
group_by(area, year, sex) %>%
phe_dsr(obs, pop, stdpop = esp2013)
#> # A tibble: 40 x 11
#> # Groups: area, year [20]
#> area year sex total_count total_pop value lowercl uppercl confidence
#> <fct> <int> <fct> <int> <int> <dbl> <dbl> <dbl> <chr>
#> 1 Area1 2006 Fema~ 1609 297542 593. 563. 624. 95%
#> 2 Area1 2006 Male 2293 292439 823. 788. 859. 95%
#> 3 Area1 2007 Fema~ 2341 301123 799. 764. 834. 95%
#> 4 Area1 2007 Male 1687 295453 572. 543. 602. 95%
#> 5 Area1 2008 Fema~ 1322 290781 498. 469. 527. 95%
#> 6 Area1 2008 Male 1960 305175 670. 638. 703. 95%
#> 7 Area1 2009 Fema~ 1930 297974 624. 595. 654. 95%
#> 8 Area1 2009 Male 2013 288190 761. 726. 796. 95%
#> 9 Area1 2010 Fema~ 1910 295888 692. 659. 726. 95%
#> 10 Area1 2010 Male 1829 280043 654. 623. 687. 95%
#> # ... with 30 more rows, and 2 more variables: statistic <chr>,
#> # method <chr>
# calculate the same dsrs by appending the standard populations to the data frame
df_std %>%
mutate(refpop = rep(esp2013,40)) %>%
group_by(area, year, sex) %>%
phe_dsr(obs,pop, stdpop=refpop, stdpoptype="field")
#> # A tibble: 40 x 11
#> # Groups: area, year [20]
#> area year sex total_count total_pop value lowercl uppercl confidence
#> <fct> <int> <fct> <int> <int> <dbl> <dbl> <dbl> <chr>
#> 1 Area1 2006 Fema~ 1609 297542 593. 563. 624. 95%
#> 2 Area1 2006 Male 2293 292439 823. 788. 859. 95%
#> 3 Area1 2007 Fema~ 2341 301123 799. 764. 834. 95%
#> 4 Area1 2007 Male 1687 295453 572. 543. 602. 95%
#> 5 Area1 2008 Fema~ 1322 290781 498. 469. 527. 95%
#> 6 Area1 2008 Male 1960 305175 670. 638. 703. 95%
#> 7 Area1 2009 Fema~ 1930 297974 624. 595. 654. 95%
#> 8 Area1 2009 Male 2013 288190 761. 726. 796. 95%
#> 9 Area1 2010 Fema~ 1910 295888 692. 659. 726. 95%
#> 10 Area1 2010 Male 1829 280043 654. 623. 687. 95%
#> # ... with 30 more rows, and 2 more variables: statistic <chr>,
#> # method <chr>
# calculate for under 75s by filtering out records for 75+ from input data frame and standard population
df_std %>%
filter(ageband <= 70) %>%
group_by(area, year, sex) %>%
phe_dsr(obs, pop, stdpop = esp2013[1:15])
#> # A tibble: 40 x 11
#> # Groups: area, year [20]
#> area year sex total_count total_pop value lowercl uppercl confidence
#> <fct> <int> <fct> <int> <int> <dbl> <dbl> <dbl> <chr>
#> 1 Area1 2006 Fema~ 1414 237103 618. 585. 651. 95%
#> 2 Area1 2006 Male 2038 237815 858. 821. 897. 95%
#> 3 Area1 2007 Fema~ 1833 233436 803. 766. 841. 95%
#> 4 Area1 2007 Male 1319 231741 564. 534. 596. 95%
#> 5 Area1 2008 Fema~ 1013 220431 493. 463. 526. 95%
#> 6 Area1 2008 Male 1480 240874 668. 633. 704. 95%
#> 7 Area1 2009 Fema~ 1421 238969 579. 548. 610. 95%
#> 8 Area1 2009 Male 1828 235334 804. 767. 843. 95%
#> 9 Area1 2010 Fema~ 1513 229003 694. 659. 731. 95%
#> 10 Area1 2010 Male 1297 224371 586. 554. 619. 95%
#> # ... with 30 more rows, and 2 more variables: statistic <chr>,
#> # method <chr>
# calculate separate dsrs for persons for each area and year)
df_std %>%
group_by(area, year, ageband) %>%
summarise(obs = sum(obs),
pop = sum(pop)) %>%
group_by(area, year) %>%
phe_dsr(obs,pop)
#> # A tibble: 20 x 10
#> # Groups: area [4]
#> area year total_count total_pop value lowercl uppercl confidence
#> <fct> <int> <int> <int> <dbl> <dbl> <dbl> <chr>
#> 1 Area1 2006 3902 589981 700. 677. 723. 95%
#> 2 Area1 2007 4028 596576 689. 667. 712. 95%
#> 3 Area1 2008 3282 595956 575. 554. 596. 95%
#> 4 Area1 2009 3943 586164 685. 663. 708. 95%
#> 5 Area1 2010 3739 575931 646. 624. 668. 95%
#> 6 Area2 2006 3753 565383 676. 653. 700. 95%
#> 7 Area2 2007 3901 600519 658. 637. 681. 95%
#> 8 Area2 2008 3424 598133 561. 541. 581. 95%
#> 9 Area2 2009 3993 563161 718. 695. 742. 95%
#> 10 Area2 2010 3944 585114 739. 715. 764. 95%
#> 11 Area3 2006 4178 563455 788. 763. 813. 95%
#> 12 Area3 2007 3484 574973 631. 609. 653. 95%
#> 13 Area3 2008 3682 566841 666. 644. 690. 95%
#> 14 Area3 2009 3558 547466 649. 626. 671. 95%
#> 15 Area3 2010 3601 534559 762. 737. 789. 95%
#> 16 Area4 2006 3525 580083 583. 562. 604. 95%
#> 17 Area4 2007 3813 564195 693. 670. 717. 95%
#> 18 Area4 2008 4039 566938 727. 703. 751. 95%
#> 19 Area4 2009 3943 547596 696. 673. 720. 95%
#> 20 Area4 2010 4234 557976 794. 769. 819. 95%
#> # ... with 2 more variables: statistic <chr>, method <chr>
INPUT: Unlike the phe_dsr function, there is no default standard or reference data for the phe_smr and phe_isr functions. These functions take a single data frame as input, with columns representing the numerators and denominators for each standardisation category, plus reference numerators and denominators for each standardisation category.
The reference data can either be provided in a separate data frame/vectors or as columns within the input data frame:
reference data provided as a data frame or as vectors - the data frame/vectors and the input data frame must both contain rows for the same standardisation categories, and both must be sorted, within each grouping set, by these standardisation categories in the same order.
reference data provided as columns within the input data frame - the reference numerators and denominators can be appended to the input data frame prior to execution of the function - if the data is grouped to generate multiple smrs/isrs then the reference data will need to be repeated and appended to the data rows for every grouping set.
OUTPUT: By default, the functions output one row per grouping set containing the grouping variable values, the observed and expected counts, the reference rate (isr only), the smr or isr, the lower 95% confidence limit, and the upper 95% confidence limit, the confidence level, the statistic name and the method.
OPTIONS: If reference data are being provided as columns within the input data frame then the user must specify this as the function expects vectors by default. The function also accepts additional arguments to specify the level of confidence, the multiplier and a reduced level of detail to be output.
The following code chunk creates a data frame containing the reference data - this example uses the all area data for persons in the baseline year:
df_ref <- df_std %>%
filter(year == 2006) %>%
group_by(ageband) %>%
summarise(obs = sum(obs),
pop = sum(pop))
head(df_ref)
#> # A tibble: 6 x 3
#> ageband obs pop
#> <dbl> <int> <int>
#> 1 0 801 128271
#> 2 5 979 126392
#> 3 10 1014 129367
#> 4 15 685 109258
#> 5 20 734 106391
#> 6 25 584 126046
Here are some example code chunks to demonstrate the phe_smr function and the arguments that can optionally be specified
# calculate separate smrs for each area, year and sex
df_std %>%
group_by(area, year, sex) %>%
phe_smr(obs, pop, df_ref$obs, df_ref$pop)
#> # A tibble: 40 x 11
#> # Groups: area, year [20]
#> area year sex observed expected value lowercl uppercl confidence
#> <fct> <int> <fct> <int> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 Area1 2006 Fema~ 1609 1997. 0.806 0.767 0.846 95%
#> 2 Area1 2006 Male 2293 1962. 1.17 1.12 1.22 95%
#> 3 Area1 2007 Fema~ 2341 2024. 1.16 1.11 1.20 95%
#> 4 Area1 2007 Male 1687 1995. 0.846 0.806 0.887 95%
#> 5 Area1 2008 Fema~ 1322 1932. 0.684 0.648 0.722 95%
#> 6 Area1 2008 Male 1960 2039. 0.961 0.919 1.00 95%
#> 7 Area1 2009 Fema~ 1930 1981. 0.974 0.931 1.02 95%
#> 8 Area1 2009 Male 2013 1923. 1.05 1.00 1.09 95%
#> 9 Area1 2010 Fema~ 1910 1979. 0.965 0.922 1.01 95%
#> 10 Area1 2010 Male 1829 1843. 0.992 0.947 1.04 95%
#> # ... with 30 more rows, and 2 more variables: statistic <chr>,
#> # method <chr>
# calculate the same smrs by appending the reference data to the data frame
df_std %>%
mutate(refobs = rep(df_ref$obs,40),
refpop = rep(df_ref$pop,40)) %>%
group_by(area, year, sex) %>%
phe_smr(obs, pop, refobs, refpop, refpoptype="field")
#> # A tibble: 40 x 11
#> # Groups: area, year [20]
#> area year sex observed expected value lowercl uppercl confidence
#> <fct> <int> <fct> <int> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 Area1 2006 Fema~ 1609 1997. 0.806 0.767 0.846 95%
#> 2 Area1 2006 Male 2293 1962. 1.17 1.12 1.22 95%
#> 3 Area1 2007 Fema~ 2341 2024. 1.16 1.11 1.20 95%
#> 4 Area1 2007 Male 1687 1995. 0.846 0.806 0.887 95%
#> 5 Area1 2008 Fema~ 1322 1932. 0.684 0.648 0.722 95%
#> 6 Area1 2008 Male 1960 2039. 0.961 0.919 1.00 95%
#> 7 Area1 2009 Fema~ 1930 1981. 0.974 0.931 1.02 95%
#> 8 Area1 2009 Male 2013 1923. 1.05 1.00 1.09 95%
#> 9 Area1 2010 Fema~ 1910 1979. 0.965 0.922 1.01 95%
#> 10 Area1 2010 Male 1829 1843. 0.992 0.947 1.04 95%
#> # ... with 30 more rows, and 2 more variables: statistic <chr>,
#> # method <chr>
# calculate separate smrs for each year and drop metadata columns from output
df_std %>%
group_by(year, ageband) %>%
summarise(obs = sum(obs),
pop = sum(pop)) %>%
group_by(year) %>%
phe_smr(obs, pop, df_ref$obs, df_ref$pop, type="standard")
#> # A tibble: 5 x 6
#> year observed expected value lowercl uppercl
#> <int> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 2006 15358 15358 1 0.984 1.02
#> 2 2007 15226 15638. 0.974 0.958 0.989
#> 3 2008 14427 15607. 0.924 0.909 0.940
#> 4 2009 15437 15008. 1.03 1.01 1.04
#> 5 2010 15518 15010. 1.03 1.02 1.05
The phe_isr function works exactly the same way but instead of expressing the result as a ratio of the observed and expected rates the result is expressed as a rate and the reference rate is also provided. Here are some examples:
# calculate separate isrs for each area, year and sex
df_std %>%
group_by(area, year, sex) %>%
phe_isr(obs, pop, df_ref$obs, df_ref$pop)
#> # A tibble: 40 x 12
#> # Groups: area, year [20]
#> area year sex observed expected ref_rate value lowercl uppercl
#> <fct> <int> <fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Area1 2006 Fema~ 1609 1997. 668. 538. 512. 565.
#> 2 Area1 2006 Male 2293 1962. 668. 781. 749. 813.
#> 3 Area1 2007 Fema~ 2341 2024. 668. 773. 742. 805.
#> 4 Area1 2007 Male 1687 1995. 668. 565. 538. 593.
#> 5 Area1 2008 Fema~ 1322 1932. 668. 457. 433. 482.
#> 6 Area1 2008 Male 1960 2039. 668. 642. 614. 671.
#> 7 Area1 2009 Fema~ 1930 1981. 668. 651. 622. 680.
#> 8 Area1 2009 Male 2013 1923. 668. 699. 669. 730.
#> 9 Area1 2010 Fema~ 1910 1979. 668. 645. 616. 674.
#> 10 Area1 2010 Male 1829 1843. 668. 663. 633. 694.
#> # ... with 30 more rows, and 3 more variables: confidence <chr>,
#> # statistic <chr>, method <chr>
# calculate the same isrs by appending the reference data to the data frame
df_std %>%
mutate(refobs = rep(df_ref$obs,40),
refpop = rep(df_ref$pop,40)) %>%
group_by(area, year, sex) %>%
phe_isr(obs, pop, refobs, refpop, refpoptype="field")
#> # A tibble: 40 x 12
#> # Groups: area, year [20]
#> area year sex observed expected ref_rate value lowercl uppercl
#> <fct> <int> <fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Area1 2006 Fema~ 1609 1997. 668. 538. 512. 565.
#> 2 Area1 2006 Male 2293 1962. 668. 781. 749. 813.
#> 3 Area1 2007 Fema~ 2341 2024. 668. 773. 742. 805.
#> 4 Area1 2007 Male 1687 1995. 668. 565. 538. 593.
#> 5 Area1 2008 Fema~ 1322 1932. 668. 457. 433. 482.
#> 6 Area1 2008 Male 1960 2039. 668. 642. 614. 671.
#> 7 Area1 2009 Fema~ 1930 1981. 668. 651. 622. 680.
#> 8 Area1 2009 Male 2013 1923. 668. 699. 669. 730.
#> 9 Area1 2010 Fema~ 1910 1979. 668. 645. 616. 674.
#> 10 Area1 2010 Male 1829 1843. 668. 663. 633. 694.
#> # ... with 30 more rows, and 3 more variables: confidence <chr>,
#> # statistic <chr>, method <chr>
# calculate separate isrs for each year and drop metadata columns from output
df_std %>%
group_by(year, ageband) %>%
summarise(obs = sum(obs),
pop = sum(pop)) %>%
group_by(year) %>%
phe_isr(obs, pop, df_ref$obs, df_ref$pop, type="standard")
#> # A tibble: 5 x 7
#> year observed expected ref_rate value lowercl uppercl
#> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 2006 15358 15358 668. 668. 658. 679.
#> 2 2007 15226 15638. 668. 650. 640. 661.
#> 3 2008 14427 15607. 668. 618. 608. 628.
#> 4 2009 15437 15008. 668. 687. 676. 698.
#> 5 2010 15518 15010. 668. 691. 680. 702.