Introduction_to_PHEindicatormethods

Georgina Anderson

2019-04-18

Introduction

This vignette introduces the following functions from the PHEindicatormethods package and provides basic sample code to demonstrate their execution. The code included is based on the code provided within the ‘examples’ section of the function documentation. This vignette does not explain the methods applied in detail but these can (optionally) be output alongside the statistics or for a more detailed explanation, please see the references section of the function documentation.

The following packages must be installed and loaded if not already available

library(PHEindicatormethods)
library(dplyr)

Package functions

This vignette covers the following functions available within the first release of the package (v1.0.8) but has been updated to apply to these functions in their latest release versions. If further functions are added to the package in future releases these will be explained elsewhere.

Function Type Description
phe_proportion Non-aggregate Performs a calculation on each row of data (unless data is grouped)
phe_rate Non-aggregate Performs a calculation on each row of data (unless data is grouped)
phe_mean Aggregate Performs a calculation on each grouping set
phe_dsr Aggregate, standardised Performs a calculation on each grouping set and requires additional reference inputs
phe_smr Aggregate, standardised Performs a calculation on each grouping set and requires additional reference inputs
phe_isr Aggregate, standardised Performs a calculation on each grouping set and requires additional reference inputs

Non-aggregate functions

Create some test data for the non-aggregate functions

The following code chunk creates a data frame containing observed number of events and populations for 4 geographical areas over 2 time periods that is used later to demonstrate the PHEindicatormethods package functions:

df <- data.frame(
        area = rep(c("Area1","Area2","Area3","Area4"), 2),
        year = rep(2015:2016, each = 4),
        obs = sample(100, 2 * 4, replace = TRUE),
        pop = sample(100:200, 2 * 4, replace = TRUE))
df
#>    area year obs pop
#> 1 Area1 2015  22 118
#> 2 Area2 2015  54 123
#> 3 Area3 2015  66 183
#> 4 Area4 2015  89 117
#> 5 Area1 2016  62 199
#> 6 Area2 2016  69 113
#> 7 Area3 2016 100 145
#> 8 Area4 2016  22 195

Execute phe_proportion and phe_rate

INPUT: The phe_proportion and phe_rate functions take a single data frame as input with columns representing the numerators and denominators for the statistic. Any other columns present will be retained in the output.

OUTPUT: The functions output the original data frame with additional columns appended. By default the additional columns are the proportion or rate, the lower 95% confidence limit, the upper 95% confidence limit, the confidence level, the statistic name and the method.

OPTIONS: The functions also accept additional arguments to specify the level of confidence, the multiplier and a reduced level of detail to be output.

Here are some example code chunks to demonstrate these two functions and the arguments that can optionally be specified

# default proportion
phe_proportion(df, obs, pop)
#>    area year obs pop     value    lowercl   uppercl confidence
#> 1 Area1 2015  22 118 0.1864407 0.12646991 0.2661835        95%
#> 2 Area2 2015  54 123 0.4390244 0.35448712 0.5272550        95%
#> 3 Area3 2015  66 183 0.3606557 0.29460770 0.4324336        95%
#> 4 Area4 2015  89 117 0.7606838 0.67587418 0.8289195        95%
#> 5 Area1 2016  62 199 0.3115578 0.25129248 0.3789606        95%
#> 6 Area2 2016  69 113 0.6106195 0.51849360 0.6954715        95%
#> 7 Area3 2016 100 145 0.6896552 0.61027614 0.7592446        95%
#> 8 Area4 2016  22 195 0.1128205 0.07569503 0.1649060        95%
#>         statistic method
#> 1 proportion of 1 Wilson
#> 2 proportion of 1 Wilson
#> 3 proportion of 1 Wilson
#> 4 proportion of 1 Wilson
#> 5 proportion of 1 Wilson
#> 6 proportion of 1 Wilson
#> 7 proportion of 1 Wilson
#> 8 proportion of 1 Wilson

# specify confidence level for proportion
phe_proportion(df, obs, pop, confidence=99.8)
#>    area year obs pop     value    lowercl   uppercl confidence
#> 1 Area1 2015  22 118 0.1864407 0.10079593 0.3190373      99.8%
#> 2 Area2 2015  54 123 0.4390244 0.31014063 0.5766941      99.8%
#> 3 Area3 2015  66 183 0.3606557 0.26040507 0.4747280      99.8%
#> 4 Area4 2015  89 117 0.7606838 0.62216724 0.8598574      99.8%
#> 5 Area1 2016  62 199 0.3115578 0.22070803 0.4196652      99.8%
#> 6 Area2 2016  69 113 0.6106195 0.46561136 0.7383878      99.8%
#> 7 Area3 2016 100 145 0.6896552 0.56234155 0.7935314      99.8%
#> 8 Area4 2016  22 195 0.1128205 0.06018842 0.2016041      99.8%
#>         statistic method
#> 1 proportion of 1 Wilson
#> 2 proportion of 1 Wilson
#> 3 proportion of 1 Wilson
#> 4 proportion of 1 Wilson
#> 5 proportion of 1 Wilson
#> 6 proportion of 1 Wilson
#> 7 proportion of 1 Wilson
#> 8 proportion of 1 Wilson

# specify to output proportions as percentages
phe_proportion(df, obs, pop, multiplier=100)
#>    area year obs pop    value   lowercl  uppercl confidence  statistic
#> 1 Area1 2015  22 118 18.64407 12.646991 26.61835        95% percentage
#> 2 Area2 2015  54 123 43.90244 35.448712 52.72550        95% percentage
#> 3 Area3 2015  66 183 36.06557 29.460770 43.24336        95% percentage
#> 4 Area4 2015  89 117 76.06838 67.587418 82.89195        95% percentage
#> 5 Area1 2016  62 199 31.15578 25.129248 37.89606        95% percentage
#> 6 Area2 2016  69 113 61.06195 51.849360 69.54715        95% percentage
#> 7 Area3 2016 100 145 68.96552 61.027614 75.92446        95% percentage
#> 8 Area4 2016  22 195 11.28205  7.569503 16.49060        95% percentage
#>   method
#> 1 Wilson
#> 2 Wilson
#> 3 Wilson
#> 4 Wilson
#> 5 Wilson
#> 6 Wilson
#> 7 Wilson
#> 8 Wilson

# specify level of detail to output for proportion
phe_proportion(df, obs, pop, confidence=99.8, multiplier=100)
#>    area year obs pop    value   lowercl  uppercl confidence  statistic
#> 1 Area1 2015  22 118 18.64407 10.079593 31.90373      99.8% percentage
#> 2 Area2 2015  54 123 43.90244 31.014063 57.66941      99.8% percentage
#> 3 Area3 2015  66 183 36.06557 26.040507 47.47280      99.8% percentage
#> 4 Area4 2015  89 117 76.06838 62.216724 85.98574      99.8% percentage
#> 5 Area1 2016  62 199 31.15578 22.070803 41.96652      99.8% percentage
#> 6 Area2 2016  69 113 61.06195 46.561136 73.83878      99.8% percentage
#> 7 Area3 2016 100 145 68.96552 56.234155 79.35314      99.8% percentage
#> 8 Area4 2016  22 195 11.28205  6.018842 20.16041      99.8% percentage
#>   method
#> 1 Wilson
#> 2 Wilson
#> 3 Wilson
#> 4 Wilson
#> 5 Wilson
#> 6 Wilson
#> 7 Wilson
#> 8 Wilson

# specify level of detail to output for proportion and remove metadata columns
phe_proportion(df, obs, pop, confidence=99.8, multiplier=100, type="standard")
#>    area year obs pop    value   lowercl  uppercl
#> 1 Area1 2015  22 118 18.64407 10.079593 31.90373
#> 2 Area2 2015  54 123 43.90244 31.014063 57.66941
#> 3 Area3 2015  66 183 36.06557 26.040507 47.47280
#> 4 Area4 2015  89 117 76.06838 62.216724 85.98574
#> 5 Area1 2016  62 199 31.15578 22.070803 41.96652
#> 6 Area2 2016  69 113 61.06195 46.561136 73.83878
#> 7 Area3 2016 100 145 68.96552 56.234155 79.35314
#> 8 Area4 2016  22 195 11.28205  6.018842 20.16041

# default rate
phe_rate(df, obs, pop)
#>    area year obs pop    value   lowercl  uppercl confidence
#> 1 Area1 2015  22 118 18644.07 11680.079 28228.63        95%
#> 2 Area2 2015  54 123 43902.44 32978.643 57284.27        95%
#> 3 Area3 2015  66 183 36065.57 27891.791 45884.96        95%
#> 4 Area4 2015  89 117 76068.38 61087.425 93609.70        95%
#> 5 Area1 2016  62 199 31155.78 23885.695 39940.97        95%
#> 6 Area2 2016  69 113 61061.95 47507.757 77278.94        95%
#> 7 Area3 2016 100 145 68965.52 56111.797 83881.34        95%
#> 8 Area4 2016  22 195 11282.05  7067.945 17081.94        95%
#>         statistic method
#> 1 rate per 100000  Byars
#> 2 rate per 100000  Byars
#> 3 rate per 100000  Byars
#> 4 rate per 100000  Byars
#> 5 rate per 100000  Byars
#> 6 rate per 100000  Byars
#> 7 rate per 100000  Byars
#> 8 rate per 100000  Byars

# specify rate parameters
phe_rate(df, obs, pop, confidence=99.8, multiplier=100)
#>    area year obs pop    value   lowercl   uppercl confidence    statistic
#> 1 Area1 2015  22 118 18.64407  8.689823  34.52658      99.8% rate per 100
#> 2 Area2 2015  54 123 43.90244 27.707417  65.70466      99.8% rate per 100
#> 3 Area3 2015  66 183 36.06557 23.874210  52.01610      99.8% rate per 100
#> 4 Area4 2015  89 117 76.06838 53.546996 104.44933      99.8% rate per 100
#> 5 Area1 2016  62 199 31.15578 20.331935  45.43921      99.8% rate per 100
#> 6 Area2 2016  69 113 61.06195 40.820252  87.38821      99.8% rate per 100
#> 7 Area3 2016 100 145 68.96552 49.588753  93.06457      99.8% rate per 100
#> 8 Area4 2016  22 195 11.28205  5.258457  20.89301      99.8% rate per 100
#>   method
#> 1  Byars
#> 2  Byars
#> 3  Byars
#> 4  Byars
#> 5  Byars
#> 6  Byars
#> 7  Byars
#> 8  Byars

# specify rate parameters and reduce columns output and remove metadata columns
phe_rate(df, obs, pop, type="standard", confidence=99.8, multiplier=100)
#>    area year obs pop    value   lowercl   uppercl
#> 1 Area1 2015  22 118 18.64407  8.689823  34.52658
#> 2 Area2 2015  54 123 43.90244 27.707417  65.70466
#> 3 Area3 2015  66 183 36.06557 23.874210  52.01610
#> 4 Area4 2015  89 117 76.06838 53.546996 104.44933
#> 5 Area1 2016  62 199 31.15578 20.331935  45.43921
#> 6 Area2 2016  69 113 61.06195 40.820252  87.38821
#> 7 Area3 2016 100 145 68.96552 49.588753  93.06457
#> 8 Area4 2016  22 195 11.28205  5.258457  20.89301

These functions can also return aggregate data if the input dataframes are grouped:

# default proportion - grouped
df %>%
  group_by(year) %>%
  phe_proportion(obs, pop)
#> # A tibble: 2 x 9
#>    year   obs   pop value lowercl uppercl confidence statistic       method
#>   <int> <int> <int> <dbl>   <dbl>   <dbl> <chr>      <chr>           <chr> 
#> 1  2015   231   541 0.427   0.386   0.469 95%        proportion of 1 Wilson
#> 2  2016   253   652 0.388   0.351   0.426 95%        proportion of 1 Wilson

# default rate - grouped
df %>%
  group_by(year) %>%
  phe_rate(obs, pop)
#> # A tibble: 2 x 9
#>    year   obs   pop  value lowercl uppercl confidence statistic      method
#>   <int> <int> <int>  <dbl>   <dbl>   <dbl> <chr>      <chr>          <chr> 
#> 1  2015   231   541 42699.  37369.  48575. 95%        rate per 1000~ Byars 
#> 2  2016   253   652 38804.  34169.  43892. 95%        rate per 1000~ Byars



Aggregate functions

The remaining functions aggregate the rows in the input data frame to produce a single statistic. It is also possible to calculate multiple statistics in a single execution of these functions if the input data frame is grouped - for example by indicator ID, geographic area or time period (or all three). The output contains only the grouping variables and the values calculated by the function - any additional unused columns provided in the input data frame will not be retained in the output.

The df test data generated earlier can be used to demonstrate phe_mean:

Execute phe_mean

INPUT: The phe_mean function take a single data frame as input with a column representing the numbers to be averaged.

OUTPUT: By default, the function outputs one row per grouping set containing the grouping variable values (if applicable), the mean, the lower 95% confidence limit, the upper 95% confidence limit, the confidence level, the statistic name and the method.

OPTIONS: The function also accepts additional arguments to specify the level of confidence and a reduced level of detail to be output.

Here are some example code chunks to demonstrate the phe_mean function and the arguments that can optionally be specified

# default mean
phe_mean(df,obs)
#>   value_sum value_count    stdev value  lowercl  uppercl confidence
#> 1       484           8 27.98979  60.5 37.09995 83.90005        95%
#>   statistic                   method
#> 1      mean Student's t-distribution

# multiple means in a single execution with 99.8% confidence
df %>%
    group_by(year) %>%
        phe_mean(obs, confidence=0.998)
#> # A tibble: 2 x 10
#>    year value_sum value_count stdev value lowercl uppercl confidence
#>   <int>     <int>       <int> <dbl> <dbl>   <dbl>   <dbl> <chr>     
#> 1  2015       231           4  27.9  57.8   -84.8    200. 99.8%     
#> 2  2016       253           4  32.1  63.2  -101.     227. 99.8%     
#> # ... with 2 more variables: statistic <chr>, method <chr>

# multiple means in a single execution with 99.8% confidence and data-only output
df %>%
    group_by(year) %>%
        phe_mean(obs, type = "standard", confidence=0.998)
#> # A tibble: 2 x 7
#>    year value_sum value_count stdev value lowercl uppercl
#>   <int>     <int>       <int> <dbl> <dbl>   <dbl>   <dbl>
#> 1  2015       231           4  27.9  57.8   -84.8    200.
#> 2  2016       253           4  32.1  63.2  -101.     227.

Standardised Aggregate functions

Create some test data for the standardised aggregate functions

The following code chunk creates a data frame containing observed number of events and populations by age band for 4 areas, 5 time periods and 2 sexes:

df_std <- data.frame(
            area = rep(c("Area1", "Area2", "Area3", "Area4"), each = 19 * 2 * 5),
            year = rep(2006:2010, each = 19 * 2),
            sex = rep(rep(c("Male", "Female"), each = 19), 5),
            ageband = rep(c(0, 5,10,15,20,25,30,35,40,45,
                           50,55,60,65,70,75,80,85,90), times = 10),
            obs = sample(200, 19 * 2 * 5 * 4, replace = TRUE),
            pop = sample(10000:20000, 19 * 2 * 5 * 4, replace = TRUE))
head(df_std)
#>    area year  sex ageband obs   pop
#> 1 Area1 2006 Male       0 101 15670
#> 2 Area1 2006 Male       5 165 18259
#> 3 Area1 2006 Male      10 196 18783
#> 4 Area1 2006 Male      15 153 12733
#> 5 Area1 2006 Male      20  96 10731
#> 6 Area1 2006 Male      25 128 16682

Execute phe_dsr

INPUT: The minimum input requirement for the phe_dsr function is a single data frame with columns representing the numerators and denominators for each standardisation category. This is sufficient if the data is:

The 2013 European Standard Population is provided within the package in vector form (esp2013) and is used by default by this function. Alternative standard populations can be used but must be provided by the user. When the function joins a standard population vector to the input data frame it does this by position so it is important that the data is sorted accordingly. This is a user responsibility.

The function can also accept standard populations provided as a column within the input data frame.

OUTPUT: By default, the function outputs one row per grouping set containing the grouping variable values, the total count, the total population, the dsr, the lower 95% confidence limit, the upper 95% confidence limit, the confidence level, the statistic name and the method.

OPTIONS: If standard populations are being provided as a column within the input data frame then the user must specify this using the stdpoptype argument as the function expects a vector by default. The function also accepts additional arguments to specify the standard populations, the level of confidence, the multiplier and a reduced level of detail to be output.

Here are some example code chunks to demonstrate the phe_dsr function and the arguments that can optionally be specified

# calculate separate dsrs for each area, year and sex
df_std %>%
    group_by(area, year, sex) %>%
    phe_dsr(obs, pop)
#> # A tibble: 40 x 11
#> # Groups:   area, year [20]
#>    area   year sex   total_count total_pop value lowercl uppercl confidence
#>    <fct> <int> <fct>       <int>     <int> <dbl>   <dbl>   <dbl> <chr>     
#>  1 Area1  2006 Fema~        1609    297542  593.    563.    624. 95%       
#>  2 Area1  2006 Male         2293    292439  823.    788.    859. 95%       
#>  3 Area1  2007 Fema~        2341    301123  799.    764.    834. 95%       
#>  4 Area1  2007 Male         1687    295453  572.    543.    602. 95%       
#>  5 Area1  2008 Fema~        1322    290781  498.    469.    527. 95%       
#>  6 Area1  2008 Male         1960    305175  670.    638.    703. 95%       
#>  7 Area1  2009 Fema~        1930    297974  624.    595.    654. 95%       
#>  8 Area1  2009 Male         2013    288190  761.    726.    796. 95%       
#>  9 Area1  2010 Fema~        1910    295888  692.    659.    726. 95%       
#> 10 Area1  2010 Male         1829    280043  654.    623.    687. 95%       
#> # ... with 30 more rows, and 2 more variables: statistic <chr>,
#> #   method <chr>

# calculate separate dsrs for each area, year and sex and drop metadata fields from output
df_std %>%
    group_by(area, year, sex) %>%
    phe_dsr(obs, pop, type="standard")
#> # A tibble: 40 x 8
#> # Groups:   area, year [20]
#>    area   year sex    total_count total_pop value lowercl uppercl
#>    <fct> <int> <fct>        <int>     <int> <dbl>   <dbl>   <dbl>
#>  1 Area1  2006 Female        1609    297542  593.    563.    624.
#>  2 Area1  2006 Male          2293    292439  823.    788.    859.
#>  3 Area1  2007 Female        2341    301123  799.    764.    834.
#>  4 Area1  2007 Male          1687    295453  572.    543.    602.
#>  5 Area1  2008 Female        1322    290781  498.    469.    527.
#>  6 Area1  2008 Male          1960    305175  670.    638.    703.
#>  7 Area1  2009 Female        1930    297974  624.    595.    654.
#>  8 Area1  2009 Male          2013    288190  761.    726.    796.
#>  9 Area1  2010 Female        1910    295888  692.    659.    726.
#> 10 Area1  2010 Male          1829    280043  654.    623.    687.
#> # ... with 30 more rows

# calculate same specifying standard population in vector form
df_std %>%
    group_by(area, year, sex) %>%
    phe_dsr(obs, pop, stdpop = esp2013)
#> # A tibble: 40 x 11
#> # Groups:   area, year [20]
#>    area   year sex   total_count total_pop value lowercl uppercl confidence
#>    <fct> <int> <fct>       <int>     <int> <dbl>   <dbl>   <dbl> <chr>     
#>  1 Area1  2006 Fema~        1609    297542  593.    563.    624. 95%       
#>  2 Area1  2006 Male         2293    292439  823.    788.    859. 95%       
#>  3 Area1  2007 Fema~        2341    301123  799.    764.    834. 95%       
#>  4 Area1  2007 Male         1687    295453  572.    543.    602. 95%       
#>  5 Area1  2008 Fema~        1322    290781  498.    469.    527. 95%       
#>  6 Area1  2008 Male         1960    305175  670.    638.    703. 95%       
#>  7 Area1  2009 Fema~        1930    297974  624.    595.    654. 95%       
#>  8 Area1  2009 Male         2013    288190  761.    726.    796. 95%       
#>  9 Area1  2010 Fema~        1910    295888  692.    659.    726. 95%       
#> 10 Area1  2010 Male         1829    280043  654.    623.    687. 95%       
#> # ... with 30 more rows, and 2 more variables: statistic <chr>,
#> #   method <chr>

# calculate the same dsrs by appending the standard populations to the data frame
df_std %>%
    mutate(refpop = rep(esp2013,40)) %>%
    group_by(area, year, sex) %>%
    phe_dsr(obs,pop, stdpop=refpop, stdpoptype="field")
#> # A tibble: 40 x 11
#> # Groups:   area, year [20]
#>    area   year sex   total_count total_pop value lowercl uppercl confidence
#>    <fct> <int> <fct>       <int>     <int> <dbl>   <dbl>   <dbl> <chr>     
#>  1 Area1  2006 Fema~        1609    297542  593.    563.    624. 95%       
#>  2 Area1  2006 Male         2293    292439  823.    788.    859. 95%       
#>  3 Area1  2007 Fema~        2341    301123  799.    764.    834. 95%       
#>  4 Area1  2007 Male         1687    295453  572.    543.    602. 95%       
#>  5 Area1  2008 Fema~        1322    290781  498.    469.    527. 95%       
#>  6 Area1  2008 Male         1960    305175  670.    638.    703. 95%       
#>  7 Area1  2009 Fema~        1930    297974  624.    595.    654. 95%       
#>  8 Area1  2009 Male         2013    288190  761.    726.    796. 95%       
#>  9 Area1  2010 Fema~        1910    295888  692.    659.    726. 95%       
#> 10 Area1  2010 Male         1829    280043  654.    623.    687. 95%       
#> # ... with 30 more rows, and 2 more variables: statistic <chr>,
#> #   method <chr>

# calculate for under 75s by filtering out records for 75+ from input data frame and standard population
df_std %>%
    filter(ageband <= 70) %>%
    group_by(area, year, sex) %>%
    phe_dsr(obs, pop, stdpop = esp2013[1:15])
#> # A tibble: 40 x 11
#> # Groups:   area, year [20]
#>    area   year sex   total_count total_pop value lowercl uppercl confidence
#>    <fct> <int> <fct>       <int>     <int> <dbl>   <dbl>   <dbl> <chr>     
#>  1 Area1  2006 Fema~        1414    237103  618.    585.    651. 95%       
#>  2 Area1  2006 Male         2038    237815  858.    821.    897. 95%       
#>  3 Area1  2007 Fema~        1833    233436  803.    766.    841. 95%       
#>  4 Area1  2007 Male         1319    231741  564.    534.    596. 95%       
#>  5 Area1  2008 Fema~        1013    220431  493.    463.    526. 95%       
#>  6 Area1  2008 Male         1480    240874  668.    633.    704. 95%       
#>  7 Area1  2009 Fema~        1421    238969  579.    548.    610. 95%       
#>  8 Area1  2009 Male         1828    235334  804.    767.    843. 95%       
#>  9 Area1  2010 Fema~        1513    229003  694.    659.    731. 95%       
#> 10 Area1  2010 Male         1297    224371  586.    554.    619. 95%       
#> # ... with 30 more rows, and 2 more variables: statistic <chr>,
#> #   method <chr>
    
# calculate separate dsrs for persons for each area and year)
df_std %>%
    group_by(area, year, ageband) %>%
    summarise(obs = sum(obs),
              pop = sum(pop)) %>%
    group_by(area, year) %>%
    phe_dsr(obs,pop)
#> # A tibble: 20 x 10
#> # Groups:   area [4]
#>    area   year total_count total_pop value lowercl uppercl confidence
#>    <fct> <int>       <int>     <int> <dbl>   <dbl>   <dbl> <chr>     
#>  1 Area1  2006        3902    589981  700.    677.    723. 95%       
#>  2 Area1  2007        4028    596576  689.    667.    712. 95%       
#>  3 Area1  2008        3282    595956  575.    554.    596. 95%       
#>  4 Area1  2009        3943    586164  685.    663.    708. 95%       
#>  5 Area1  2010        3739    575931  646.    624.    668. 95%       
#>  6 Area2  2006        3753    565383  676.    653.    700. 95%       
#>  7 Area2  2007        3901    600519  658.    637.    681. 95%       
#>  8 Area2  2008        3424    598133  561.    541.    581. 95%       
#>  9 Area2  2009        3993    563161  718.    695.    742. 95%       
#> 10 Area2  2010        3944    585114  739.    715.    764. 95%       
#> 11 Area3  2006        4178    563455  788.    763.    813. 95%       
#> 12 Area3  2007        3484    574973  631.    609.    653. 95%       
#> 13 Area3  2008        3682    566841  666.    644.    690. 95%       
#> 14 Area3  2009        3558    547466  649.    626.    671. 95%       
#> 15 Area3  2010        3601    534559  762.    737.    789. 95%       
#> 16 Area4  2006        3525    580083  583.    562.    604. 95%       
#> 17 Area4  2007        3813    564195  693.    670.    717. 95%       
#> 18 Area4  2008        4039    566938  727.    703.    751. 95%       
#> 19 Area4  2009        3943    547596  696.    673.    720. 95%       
#> 20 Area4  2010        4234    557976  794.    769.    819. 95%       
#> # ... with 2 more variables: statistic <chr>, method <chr>

Execute phe_smr and phe_isr

INPUT: Unlike the phe_dsr function, there is no default standard or reference data for the phe_smr and phe_isr functions. These functions take a single data frame as input, with columns representing the numerators and denominators for each standardisation category, plus reference numerators and denominators for each standardisation category.

The reference data can either be provided in a separate data frame/vectors or as columns within the input data frame:

OUTPUT: By default, the functions output one row per grouping set containing the grouping variable values, the observed and expected counts, the reference rate (isr only), the smr or isr, the lower 95% confidence limit, and the upper 95% confidence limit, the confidence level, the statistic name and the method.

OPTIONS: If reference data are being provided as columns within the input data frame then the user must specify this as the function expects vectors by default. The function also accepts additional arguments to specify the level of confidence, the multiplier and a reduced level of detail to be output.

The following code chunk creates a data frame containing the reference data - this example uses the all area data for persons in the baseline year:

df_ref <- df_std %>%
    filter(year == 2006) %>%
    group_by(ageband) %>%
    summarise(obs = sum(obs),
              pop = sum(pop))
    
head(df_ref)
#> # A tibble: 6 x 3
#>   ageband   obs    pop
#>     <dbl> <int>  <int>
#> 1       0   801 128271
#> 2       5   979 126392
#> 3      10  1014 129367
#> 4      15   685 109258
#> 5      20   734 106391
#> 6      25   584 126046

Here are some example code chunks to demonstrate the phe_smr function and the arguments that can optionally be specified

# calculate separate smrs for each area, year and sex
df_std %>%
    group_by(area, year, sex) %>%
    phe_smr(obs, pop, df_ref$obs, df_ref$pop)
#> # A tibble: 40 x 11
#> # Groups:   area, year [20]
#>    area   year sex   observed expected value lowercl uppercl confidence
#>    <fct> <int> <fct>    <int>    <dbl> <dbl>   <dbl>   <dbl> <chr>     
#>  1 Area1  2006 Fema~     1609    1997. 0.806   0.767   0.846 95%       
#>  2 Area1  2006 Male      2293    1962. 1.17    1.12    1.22  95%       
#>  3 Area1  2007 Fema~     2341    2024. 1.16    1.11    1.20  95%       
#>  4 Area1  2007 Male      1687    1995. 0.846   0.806   0.887 95%       
#>  5 Area1  2008 Fema~     1322    1932. 0.684   0.648   0.722 95%       
#>  6 Area1  2008 Male      1960    2039. 0.961   0.919   1.00  95%       
#>  7 Area1  2009 Fema~     1930    1981. 0.974   0.931   1.02  95%       
#>  8 Area1  2009 Male      2013    1923. 1.05    1.00    1.09  95%       
#>  9 Area1  2010 Fema~     1910    1979. 0.965   0.922   1.01  95%       
#> 10 Area1  2010 Male      1829    1843. 0.992   0.947   1.04  95%       
#> # ... with 30 more rows, and 2 more variables: statistic <chr>,
#> #   method <chr>

# calculate the same smrs by appending the reference data to the data frame
df_std %>%
    mutate(refobs = rep(df_ref$obs,40),
           refpop = rep(df_ref$pop,40)) %>%
    group_by(area, year, sex) %>%
    phe_smr(obs, pop, refobs, refpop, refpoptype="field")
#> # A tibble: 40 x 11
#> # Groups:   area, year [20]
#>    area   year sex   observed expected value lowercl uppercl confidence
#>    <fct> <int> <fct>    <int>    <dbl> <dbl>   <dbl>   <dbl> <chr>     
#>  1 Area1  2006 Fema~     1609    1997. 0.806   0.767   0.846 95%       
#>  2 Area1  2006 Male      2293    1962. 1.17    1.12    1.22  95%       
#>  3 Area1  2007 Fema~     2341    2024. 1.16    1.11    1.20  95%       
#>  4 Area1  2007 Male      1687    1995. 0.846   0.806   0.887 95%       
#>  5 Area1  2008 Fema~     1322    1932. 0.684   0.648   0.722 95%       
#>  6 Area1  2008 Male      1960    2039. 0.961   0.919   1.00  95%       
#>  7 Area1  2009 Fema~     1930    1981. 0.974   0.931   1.02  95%       
#>  8 Area1  2009 Male      2013    1923. 1.05    1.00    1.09  95%       
#>  9 Area1  2010 Fema~     1910    1979. 0.965   0.922   1.01  95%       
#> 10 Area1  2010 Male      1829    1843. 0.992   0.947   1.04  95%       
#> # ... with 30 more rows, and 2 more variables: statistic <chr>,
#> #   method <chr>

# calculate separate smrs for each year and drop metadata columns from output
df_std %>%
    group_by(year, ageband) %>%
    summarise(obs = sum(obs),
              pop = sum(pop)) %>%
    group_by(year) %>%
    phe_smr(obs, pop, df_ref$obs, df_ref$pop, type="standard")
#> # A tibble: 5 x 6
#>    year observed expected value lowercl uppercl
#>   <int>    <int>    <dbl> <dbl>   <dbl>   <dbl>
#> 1  2006    15358   15358  1       0.984   1.02 
#> 2  2007    15226   15638. 0.974   0.958   0.989
#> 3  2008    14427   15607. 0.924   0.909   0.940
#> 4  2009    15437   15008. 1.03    1.01    1.04 
#> 5  2010    15518   15010. 1.03    1.02    1.05

The phe_isr function works exactly the same way but instead of expressing the result as a ratio of the observed and expected rates the result is expressed as a rate and the reference rate is also provided. Here are some examples:

# calculate separate isrs for each area, year and sex
df_std %>%
    group_by(area, year, sex) %>%
    phe_isr(obs, pop, df_ref$obs, df_ref$pop)
#> # A tibble: 40 x 12
#> # Groups:   area, year [20]
#>    area   year sex   observed expected ref_rate value lowercl uppercl
#>    <fct> <int> <fct>    <int>    <dbl>    <dbl> <dbl>   <dbl>   <dbl>
#>  1 Area1  2006 Fema~     1609    1997.     668.  538.    512.    565.
#>  2 Area1  2006 Male      2293    1962.     668.  781.    749.    813.
#>  3 Area1  2007 Fema~     2341    2024.     668.  773.    742.    805.
#>  4 Area1  2007 Male      1687    1995.     668.  565.    538.    593.
#>  5 Area1  2008 Fema~     1322    1932.     668.  457.    433.    482.
#>  6 Area1  2008 Male      1960    2039.     668.  642.    614.    671.
#>  7 Area1  2009 Fema~     1930    1981.     668.  651.    622.    680.
#>  8 Area1  2009 Male      2013    1923.     668.  699.    669.    730.
#>  9 Area1  2010 Fema~     1910    1979.     668.  645.    616.    674.
#> 10 Area1  2010 Male      1829    1843.     668.  663.    633.    694.
#> # ... with 30 more rows, and 3 more variables: confidence <chr>,
#> #   statistic <chr>, method <chr>

# calculate the same isrs by appending the reference data to the data frame
df_std %>%
    mutate(refobs = rep(df_ref$obs,40),
           refpop = rep(df_ref$pop,40)) %>%
    group_by(area, year, sex) %>%
    phe_isr(obs, pop, refobs, refpop, refpoptype="field")
#> # A tibble: 40 x 12
#> # Groups:   area, year [20]
#>    area   year sex   observed expected ref_rate value lowercl uppercl
#>    <fct> <int> <fct>    <int>    <dbl>    <dbl> <dbl>   <dbl>   <dbl>
#>  1 Area1  2006 Fema~     1609    1997.     668.  538.    512.    565.
#>  2 Area1  2006 Male      2293    1962.     668.  781.    749.    813.
#>  3 Area1  2007 Fema~     2341    2024.     668.  773.    742.    805.
#>  4 Area1  2007 Male      1687    1995.     668.  565.    538.    593.
#>  5 Area1  2008 Fema~     1322    1932.     668.  457.    433.    482.
#>  6 Area1  2008 Male      1960    2039.     668.  642.    614.    671.
#>  7 Area1  2009 Fema~     1930    1981.     668.  651.    622.    680.
#>  8 Area1  2009 Male      2013    1923.     668.  699.    669.    730.
#>  9 Area1  2010 Fema~     1910    1979.     668.  645.    616.    674.
#> 10 Area1  2010 Male      1829    1843.     668.  663.    633.    694.
#> # ... with 30 more rows, and 3 more variables: confidence <chr>,
#> #   statistic <chr>, method <chr>

# calculate separate isrs for each year and drop metadata columns from output
df_std %>%
    group_by(year, ageband) %>%
    summarise(obs = sum(obs),
              pop = sum(pop)) %>%
    group_by(year) %>%
    phe_isr(obs, pop, df_ref$obs, df_ref$pop, type="standard")
#> # A tibble: 5 x 7
#>    year observed expected ref_rate value lowercl uppercl
#>   <int>    <int>    <dbl>    <dbl> <dbl>   <dbl>   <dbl>
#> 1  2006    15358   15358      668.  668.    658.    679.
#> 2  2007    15226   15638.     668.  650.    640.    661.
#> 3  2008    14427   15607.     668.  618.    608.    628.
#> 4  2009    15437   15008.     668.  687.    676.    698.
#> 5  2010    15518   15010.     668.  691.    680.    702.