A tutorial for the GeoDetector R package

Chengdong Xu, Jinfeng Wang, Yue Hou (IGSNRR, CAS)

2018-07-31

GeoDetector method

Spatial stratified heterogeneity (SSH), referring to the within strata are more similar than the between strata, such as landuse types and climate zones, is ubiquitous in spatial data. SSH instead of random is a set of information, which has been being a window for humans to understand the nature since Aristotle time. In another aspect, a model with global parameters would be confounded if input data is SSH, the problem dissolves if SSH is identified so simple models can be applied to each stratum separately. Note that the “spatial” here can be either geospatial or the space in mathematical meaning.

GeoDetector is a novel tool to investigate SSH: (1) measure and find SSH of a variable \(Y\); (2) test the power of determinant X of a dependent variable \(Y\) according to the consistency between their spatial distributions; and (3) investigate the interaction between two explanatory variables \(X1\) and \(X2\) to a dependent variable \(Y\). All of the tasks are implementable by the geographical detector q-statistic: \[\begin{equation} q=1- \frac{1}{N\sigma^2}\sum_{h=1}^{L}N_h\sigma_h^2 \end{equation}\]

where N and \(\sigma^2\) stand for the number of units and the variance of Y in study area, respectively; the population Y is composed of L strata (h = 1, 2, …, L), \(N_h\) and \(\sigma_h^2\) stand for the number of units and the variance of Y in stratum h, respectively. The strata of Y (red polygons in Figure 1) are a partition of Y, either by itself (h(Y) in Figure 1) or by an explanatory variable X which is a categorical variable (h(X) in Figure 1). X should be stratified if it is a numerical variable, the number of strata L might be 2-10 or more, according to prior knowledge or a classification algorithm.

Figure 1. Principle of GeoDetector

Figure 1. Principle of GeoDetector

(Notation: Yi stands for the value of a variable Y at a sample unit i; h(Y) represents a partition of Y; h(X) represents a partition of an explanatory variable X. In GeoDetector, the terms “stratification”, “classification” and “partition” are equivalent.)

Interpretation of q value (please refer to Fig.1). The value of q \(\in\) [0, 1].

If Y is stratified by itself h(Y), then q = 0 indicates that Y is not SSH; q = 1 indicates that Y is SSH perfectly; the value of q indicates that the degree of SSH of Y is q.

If Y is stratified by an explanatory variable h(X), then q = 0 indicates that there is no association between Y and X; q = 1 indicates that Y is completely determined by X; the value of q-statistic indicates that X explains 100q% of Y. Please notice that the q-statistic measures the association between X and Y, both linearly and nonlinearly.

R package for GeoDetector

GeoDetector package includes five functions: factor_detector, interaction_detector, risk_detector, ecological_detector and GeoDetector. The first four functions implementing the calcution of factor detector, interaction detector, risk detector and ecological detector, which can be calculated using table data, e.g. csv format(Table 1). The last function GeoDetector is an auxiliary function, which can be used to implement the calculation for for shapefile format map data(Figure 2).

Table 1. Demo data in table format
incidence type region level
5.94 7 5 5
5.87 5 5 5
5.92 5 5 5
6.32 1 7 1
6.49 3 2 4
6.46 3 2 4
6.51 3 2 4
6.70 3 2 4
6.68 3 2 4
6.65 3 2 4

GeoDetector package depends on the following packages: rgeo, sp, maptools and rgdal, which should be installed in advance.

As a demo, neural-tube birth defects (NTD) Y and suspected risk factors or their proxies Xs in villages are provided, including data for the health effect layers and environmental factor layers, “elevation”, “soil type”, and “watershed”.

Figure 2. Demo data in GIS format (a)NTDs prevalence Y, (b)Elevation X1, (c)Soil types X2, (d)Watersheds X3

Figure 2. Demo data in GIS format (a)NTDs prevalence Y, (b)Elevation X1, (c)Soil types X2, (d)Watersheds X3

After download of GeoDetector package, using install.packages function to install it.

install.packages("./Geodetector/geodetector_1.0-1.tar.gz",
                 repos=NULL, type="source")

where, in the example the file path “./Geodector/GeoDetector_1.0-1.tar.gz” should be change to the data location in user’s computer.

Load package:

library("geodetector")

Read data in table format:

data(CollectData)

Data class:

class(CollectData)
## [1] "data.frame"

Field names:

names(CollectData)
## [1] "incidence" "type"      "region"    "level"

Factor detector

The factor detector q-statistic measures the SSH of a variable Y, or the determinant power of a covariate X of Y.

factor_detector implement the function of factor detector. In the following demo, the first parameter “incidence” represent explained variable, the second parameter “type” represent explanatory variable, and the third parameter” CollectData" represent dataset.

The output of the function include q statistic and the corresponding p value.

## [[1]]
##      q-statistic   p-value
## type   0.3857168 0.3632363

Another way also can be used to implement the function, in which the input parameters can be the index of each field. For example, in the following demo, the first parameter “1” represent explained variable in the first column of the dataset, the second parameter “2” represent explanatory variable in the second column of the dataset.

## [[1]]
##      q-statistic   p-value
## type   0.3857168 0.3632363

If there are more than one variable, the function can be used as the following. In which, c(“type”,“region”,“level”) and c(2,3,4) are field names and index of field for explanatory variables.

or

## [[1]]
##      q-statistic   p-value
## type   0.3857168 0.3632363
## 
## [[2]]
##        q-statistic      p-value
## region   0.6377737 0.0001169914
## 
## [[3]]
##       q-statistic    p-value
## level   0.6067087 0.04080407

Interaction detector

The interaction detector reveals whether the risk factors X1 and X2 (and more X) have an interactive influence on a disease Y.

The function interaction_detector implement the interaction detector. In the following demo, the first parameter “incidence” represent explained variable, the second parameter c(“type”,“region”,“level”) represent explanatory variables, and the third parameter " CollectData " represent dataset.

##                     type            region             level
## type   0.385716842809428 0.735680548139531 0.663523698335635
## region 0.735680548139531 0.637773670070423  0.71359677853471
## level  0.663523698335635  0.71359677853471 0.606708709727727

Risk detector

The risk detector calculates the average values in each stratum of explanatory variable (X), and presents if there exists difference between two strata.

The function risk_detector implement the risk detector. In the following demo, the first parameter “incidence” represents explained variable, the second parameter “type” represents explanatory variables, and the third parameter “CollectData” represent dataset.

In the function, result information for each variable is presented in two parts.

The first part gives the average value of explained variable in each stratum of a explanatory variables.

The second part gives the statistically significant difference for the averages value between two strata; if there is a significant difference (t test with significant level of 0.05), the corresponding value is “TRUE”, else it is “FALSE”.

## [[1]]
## [[1]]$`Risk Detector`
##   type Mean of explained variable
## 1    1                   6.340000
## 2    2                   6.687500
## 3    3                   6.583279
## 4    5                   5.843810
## 5    7                   6.347073
## 
## [[1]]$`Significance.t-test:0.05`
##       1     2     3     5     7
## 1 FALSE  TRUE  TRUE  TRUE FALSE
## 2  TRUE FALSE FALSE  TRUE  TRUE
## 3  TRUE FALSE FALSE  TRUE  TRUE
## 5  TRUE  TRUE  TRUE FALSE  TRUE
## 7 FALSE  TRUE  TRUE  TRUE FALSE

Another way also can be used to implement the function, in which the input parameters can be the index of each field. For example, in the following demo, the the first parameter “1” represent explained variable in the first column of the dataset, the second parameter “2” represent explanatory variable in the second column of the dataset.

## [[1]]
## [[1]]$`Risk Detector`
##   type Mean of explained variable
## 1    1                   6.340000
## 2    2                   6.687500
## 3    3                   6.583279
## 4    5                   5.843810
## 5    7                   6.347073
## 
## [[1]]$`Significance.t-test:0.05`
##       1     2     3     5     7
## 1 FALSE  TRUE  TRUE  TRUE FALSE
## 2  TRUE FALSE FALSE  TRUE  TRUE
## 3  TRUE FALSE FALSE  TRUE  TRUE
## 5  TRUE  TRUE  TRUE FALSE  TRUE
## 7 FALSE  TRUE  TRUE  TRUE FALSE

If there are more than one variable, the function can be used as the following. In which, c(“type”,“region”,“level”) and c(2,3,4) are field names and index of field for explanatory variables.

or

## [[1]]
## [[1]]$`Risk Detector`
##   type Mean of explained variable
## 1    1                   6.340000
## 2    2                   6.687500
## 3    3                   6.583279
## 4    5                   5.843810
## 5    7                   6.347073
## 
## [[1]]$`Significance.t-test:0.05`
##       1     2     3     5     7
## 1 FALSE  TRUE  TRUE  TRUE FALSE
## 2  TRUE FALSE FALSE  TRUE  TRUE
## 3  TRUE FALSE FALSE  TRUE  TRUE
## 5  TRUE  TRUE  TRUE FALSE  TRUE
## 7 FALSE  TRUE  TRUE  TRUE FALSE
## 
## 
## [[2]]
## [[2]]$`Risk Detector`
##   region Mean of explained variable
## 1      1                   6.167813
## 2      2                   6.813103
## 3      3                   6.474231
## 4      4                   6.728000
## 5      5                   5.910000
## 6      6                   5.845714
## 7      7                   6.494167
## 8      8                   6.360769
## 9      9                   6.579231
## 
## [[2]]$`Significance.t-test:0.05`
##       1     2     3     4     5     6     7     8     9
## 1 FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## 2  TRUE FALSE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE
## 3  TRUE  TRUE FALSE  TRUE  TRUE  TRUE FALSE FALSE FALSE
## 4  TRUE FALSE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE
## 5  TRUE  TRUE  TRUE  TRUE FALSE FALSE  TRUE  TRUE  TRUE
## 6  TRUE  TRUE  TRUE  TRUE FALSE FALSE  TRUE  TRUE  TRUE
## 7  TRUE  TRUE FALSE  TRUE  TRUE  TRUE FALSE FALSE FALSE
## 8  TRUE  TRUE FALSE  TRUE  TRUE  TRUE FALSE FALSE  TRUE
## 9  TRUE  TRUE FALSE  TRUE  TRUE  TRUE FALSE  TRUE FALSE
## 
## 
## [[3]]
## [[3]]$`Risk Detector`
##   level Mean of explained variable
## 1     1                   6.455882
## 2     2                   6.171111
## 3     3                   6.258108
## 4     4                   6.621364
## 5     5                   5.908889
## 6     6                   6.888636
## 7     7                   5.790000
## 
## [[3]]$`Significance.t-test:0.05`
##       1     2     3     4     5     6     7
## 1 FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## 2  TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE
## 3  TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE
## 4  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE
## 5  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE
## 6  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE
## 7  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE

Ecological detector

The ecological detector identifies the impact differences between two risk factors X1 ~ X2.

The function ecological_detector implement the ecological detector. In the following demo, the first parameter “incidence” represents explained variable, the second parameter c(“type”,“region”) represents explanatory variables, and the third parameter “CollectData” represent dataset. In the function, the F statistic is used to test the difference with the significant level of 0.05.

## $`Significance.F-test:0.05`
##         type region
## type   FALSE   TRUE
## region  TRUE  FALSE

If there are more than two variables, the function can be used as the following.

## $`Significance.F-test:0.05`
##         type region level
## type   FALSE   TRUE  TRUE
## region  TRUE  FALSE FALSE
## level   TRUE  FALSE FALSE

where, c(“type”,“region”,“level”) are field names of field for explanatory variables.

Tranform data from map to table format

If the input data is in table format, it can be directly used as input parameters in the above functions. However, if input data is map in shapefile format, the function named geodetector can be used to transform from shapefile map to table format, then the above function can be used. Please note that, these shapefile layers should have the same projected coordinate system.

Load maptools package:

Read data:

In the following demo, the first parameter “DiseaseData” represents shape file data storing explained variable, the second parameter c(SoilType,Watershed, Elevation) represents shape file data storing explanatory variables, and the third parameter c(‘incidence’, ‘type’, ‘region’, ‘level’) represent field names used in calculation in explained variable and explanatory variables, respectively.

##   incidence type region level
## 1      5.94    7      5     5
## 2      5.87    5      5     5
## 3      5.92    5      5     5
## 4      6.32    1      7     1
## 5      6.49    3      2     4
## 6      6.46    3      2     4

Using dataset CollectData calculated from maps2dataframe function, the following function can be calculated.

Risk detector:

## [[1]]
## [[1]]$`Risk Detector`
##   type Mean of explained variable
## 1    1                   6.340000
## 2    2                   6.687500
## 3    3                   6.583279
## 4    4                   5.843810
## 5    5                   6.347073
## 
## [[1]]$`Significance.t-test:0.05`
##       1     2     3     4     5
## 1 FALSE  TRUE  TRUE  TRUE FALSE
## 2  TRUE FALSE FALSE  TRUE  TRUE
## 3  TRUE FALSE FALSE  TRUE  TRUE
## 4  TRUE  TRUE  TRUE FALSE  TRUE
## 5 FALSE  TRUE  TRUE  TRUE FALSE
## [[1]]
## [[1]]$`Risk Detector`
##   type Mean of explained variable
## 1    1                   6.340000
## 2    2                   6.687500
## 3    3                   6.583279
## 4    4                   5.843810
## 5    5                   6.347073
## 
## [[1]]$`Significance.t-test:0.05`
##       1     2     3     4     5
## 1 FALSE  TRUE  TRUE  TRUE FALSE
## 2  TRUE FALSE FALSE  TRUE  TRUE
## 3  TRUE FALSE FALSE  TRUE  TRUE
## 4  TRUE  TRUE  TRUE FALSE  TRUE
## 5 FALSE  TRUE  TRUE  TRUE FALSE
## [[1]]
## [[1]]$`Risk Detector`
##   type Mean of explained variable
## 1    1                   6.340000
## 2    2                   6.687500
## 3    3                   6.583279
## 4    4                   5.843810
## 5    5                   6.347073
## 
## [[1]]$`Significance.t-test:0.05`
##       1     2     3     4     5
## 1 FALSE  TRUE  TRUE  TRUE FALSE
## 2  TRUE FALSE FALSE  TRUE  TRUE
## 3  TRUE FALSE FALSE  TRUE  TRUE
## 4  TRUE  TRUE  TRUE FALSE  TRUE
## 5 FALSE  TRUE  TRUE  TRUE FALSE
## [[1]]
## [[1]]$`Risk Detector`
##   type Mean of explained variable
## 1    1                   6.340000
## 2    2                   6.687500
## 3    3                   6.583279
## 4    4                   5.843810
## 5    5                   6.347073
## 
## [[1]]$`Significance.t-test:0.05`
##       1     2     3     4     5
## 1 FALSE  TRUE  TRUE  TRUE FALSE
## 2  TRUE FALSE FALSE  TRUE  TRUE
## 3  TRUE FALSE FALSE  TRUE  TRUE
## 4  TRUE  TRUE  TRUE FALSE  TRUE
## 5 FALSE  TRUE  TRUE  TRUE FALSE
## 
## 
## [[2]]
## [[2]]$`Risk Detector`
##   region Mean of explained variable
## 1      1                   6.167813
## 2      2                   6.813103
## 3      3                   6.474231
## 4      4                   6.728000
## 5      5                   5.910000
## 6      6                   5.845714
## 7      7                   6.494167
## 8      8                   6.360769
## 9      9                   6.579231
## 
## [[2]]$`Significance.t-test:0.05`
##       1     2     3     4     5     6     7     8     9
## 1 FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## 2  TRUE FALSE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE
## 3  TRUE  TRUE FALSE  TRUE  TRUE  TRUE FALSE FALSE FALSE
## 4  TRUE FALSE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE
## 5  TRUE  TRUE  TRUE  TRUE FALSE FALSE  TRUE  TRUE  TRUE
## 6  TRUE  TRUE  TRUE  TRUE FALSE FALSE  TRUE  TRUE  TRUE
## 7  TRUE  TRUE FALSE  TRUE  TRUE  TRUE FALSE FALSE FALSE
## 8  TRUE  TRUE FALSE  TRUE  TRUE  TRUE FALSE FALSE  TRUE
## 9  TRUE  TRUE FALSE  TRUE  TRUE  TRUE FALSE  TRUE FALSE
## 
## 
## [[3]]
## [[3]]$`Risk Detector`
##   level Mean of explained variable
## 1     1                   6.455882
## 2     2                   6.171111
## 3     3                   6.258108
## 4     4                   6.621364
## 5     5                   5.908889
## 6     6                   6.888636
## 7     7                   5.790000
## 
## [[3]]$`Significance.t-test:0.05`
##       1     2     3     4     5     6     7
## 1 FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## 2  TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE
## 3  TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE
## 4  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE
## 5  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE
## 6  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE
## 7  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE

factor detector:

## [[1]]
##      q-statistic   p-value
## type   0.3857168 0.3632363
## [[1]]
##      q-statistic   p-value
## type   0.3857168 0.3632363
## 
## [[2]]
##        q-statistic      p-value
## region   0.6377737 0.0001169914
## [[1]]
##      q-statistic   p-value
## type   0.3857168 0.3632363
## [[1]]
##      q-statistic   p-value
## type   0.3857168 0.3632363
## 
## [[2]]
##        q-statistic      p-value
## region   0.6377737 0.0001169914
## 
## [[3]]
##       q-statistic    p-value
## level   0.6067087 0.04080407

ecological detector:

## $`Significance.F-test:0.05`
##         type region
## type   FALSE   TRUE
## region  TRUE  FALSE
## $`Significance.F-test:0.05`
##         type region level
## type   FALSE   TRUE  TRUE
## region  TRUE  FALSE FALSE
## level   TRUE  FALSE FALSE

interaction detector:

##                     type            region
## type   0.385716842809428 0.735680548139531
## region 0.735680548139531 0.637773670070423
##                     type            region             level
## type   0.385716842809428 0.735680548139531 0.663523698335635
## region 0.735680548139531 0.637773670070423  0.71359677853471
## level  0.663523698335635  0.71359677853471 0.606708709727727

Results output

Results can be saved as CSV file, for example: