Spatial stratified heterogeneity (SSH), referring to the within strata are more similar than the between strata, such as landuse types and climate zones, is ubiquitous in spatial data. SSH instead of random is a set of information, which has been being a window for humans to understand the nature since Aristotle time. In another aspect, a model with global parameters would be confounded if input data is SSH, the problem dissolves if SSH is identified so simple models can be applied to each stratum separately. Note that the “spatial” here can be either geospatial or the space in mathematical meaning.
GeoDetector is a novel tool to investigate SSH: (1) measure and find SSH of a variable \(Y\); (2) test the power of determinant X of a dependent variable \(Y\) according to the consistency between their spatial distributions; and (3) investigate the interaction between two explanatory variables \(X1\) and \(X2\) to a dependent variable \(Y\). All of the tasks are implementable by the geographical detector q-statistic: \[\begin{equation} q=1- \frac{1}{N\sigma^2}\sum_{h=1}^{L}N_h\sigma_h^2 \end{equation}\]
where N and \(\sigma^2\) stand for the number of units and the variance of Y in study area, respectively; the population Y is composed of L strata (h = 1, 2, …, L), \(N_h\) and \(\sigma_h^2\) stand for the number of units and the variance of Y in stratum h, respectively. The strata of Y (red polygons in Figure 1) are a partition of Y, either by itself (h(Y) in Figure 1) or by an explanatory variable X which is a categorical variable (h(X) in Figure 1). X should be stratified if it is a numerical variable, the number of strata L might be 2-10 or more, according to prior knowledge or a classification algorithm.
Figure 1. Principle of GeoDetector
(Notation: Yi stands for the value of a variable Y at a sample unit i; h(Y) represents a partition of Y; h(X) represents a partition of an explanatory variable X. In GeoDetector, the terms “stratification”, “classification” and “partition” are equivalent.)
Interpretation of q value (please refer to Fig.1). The value of q \(\in\) [0, 1].
If Y is stratified by itself h(Y), then q = 0 indicates that Y is not SSH; q = 1 indicates that Y is SSH perfectly; the value of q indicates that the degree of SSH of Y is q.
If Y is stratified by an explanatory variable h(X), then q = 0 indicates that there is no association between Y and X; q = 1 indicates that Y is completely determined by X; the value of q-statistic indicates that X explains 100q% of Y. Please notice that the q-statistic measures the association between X and Y, both linearly and nonlinearly.
GeoDetector package includes five functions: factor_detector, interaction_detector, risk_detector, ecological_detector and GeoDetector. The first four functions implementing the calcution of factor detector, interaction detector, risk detector and ecological detector, which can be calculated using table data, e.g. csv format(Table 1). The last function GeoDetector is an auxiliary function, which can be used to implement the calculation for for shapefile format map data(Figure 2).
incidence | type | region | level |
---|---|---|---|
5.94 | 7 | 5 | 5 |
5.87 | 5 | 5 | 5 |
5.92 | 5 | 5 | 5 |
6.32 | 1 | 7 | 1 |
6.49 | 3 | 2 | 4 |
6.46 | 3 | 2 | 4 |
6.51 | 3 | 2 | 4 |
6.70 | 3 | 2 | 4 |
6.68 | 3 | 2 | 4 |
6.65 | 3 | 2 | 4 |
GeoDetector package depends on the following packages: rgeo, sp, maptools and rgdal, which should be installed in advance.
As a demo, neural-tube birth defects (NTD) Y and suspected risk factors or their proxies Xs in villages are provided, including data for the health effect layers and environmental factor layers, “elevation”, “soil type”, and “watershed”.
Figure 2. Demo data in GIS format (a)NTDs prevalence Y, (b)Elevation X1, (c)Soil types X2, (d)Watersheds X3
After download of GeoDetector package, using install.packages function to install it.
where, in the example the file path “./Geodector/GeoDetector_1.0-1.tar.gz” should be change to the data location in user’s computer.
Load package:
Read data in table format:
Data class:
## [1] "data.frame"
Field names:
## [1] "incidence" "type" "region" "level"
The factor detector q-statistic measures the SSH of a variable Y, or the determinant power of a covariate X of Y.
factor_detector implement the function of factor detector. In the following demo, the first parameter “incidence” represent explained variable, the second parameter “type” represent explanatory variable, and the third parameter” CollectData" represent dataset.
The output of the function include q statistic and the corresponding p value.
## [[1]]
## q-statistic p-value
## type 0.3857168 0.3632363
Another way also can be used to implement the function, in which the input parameters can be the index of each field. For example, in the following demo, the first parameter “1” represent explained variable in the first column of the dataset, the second parameter “2” represent explanatory variable in the second column of the dataset.
## [[1]]
## q-statistic p-value
## type 0.3857168 0.3632363
If there are more than one variable, the function can be used as the following. In which, c(“type”,“region”,“level”) and c(2,3,4) are field names and index of field for explanatory variables.
or
## [[1]]
## q-statistic p-value
## type 0.3857168 0.3632363
##
## [[2]]
## q-statistic p-value
## region 0.6377737 0.0001169914
##
## [[3]]
## q-statistic p-value
## level 0.6067087 0.04080407
The interaction detector reveals whether the risk factors X1 and X2 (and more X) have an interactive influence on a disease Y.
The function interaction_detector implement the interaction detector. In the following demo, the first parameter “incidence” represent explained variable, the second parameter c(“type”,“region”,“level”) represent explanatory variables, and the third parameter " CollectData " represent dataset.
## type region level
## type 0.385716842809428 0.735680548139531 0.663523698335635
## region 0.735680548139531 0.637773670070423 0.71359677853471
## level 0.663523698335635 0.71359677853471 0.606708709727727
The risk detector calculates the average values in each stratum of explanatory variable (X), and presents if there exists difference between two strata.
The function risk_detector implement the risk detector. In the following demo, the first parameter “incidence” represents explained variable, the second parameter “type” represents explanatory variables, and the third parameter “CollectData” represent dataset.
In the function, result information for each variable is presented in two parts.
The first part gives the average value of explained variable in each stratum of a explanatory variables.
The second part gives the statistically significant difference for the averages value between two strata; if there is a significant difference (t test with significant level of 0.05), the corresponding value is “TRUE”, else it is “FALSE”.
## [[1]]
## [[1]]$`Risk Detector`
## type Mean of explained variable
## 1 1 6.340000
## 2 2 6.687500
## 3 3 6.583279
## 4 5 5.843810
## 5 7 6.347073
##
## [[1]]$`Significance.t-test:0.05`
## 1 2 3 5 7
## 1 FALSE TRUE TRUE TRUE FALSE
## 2 TRUE FALSE FALSE TRUE TRUE
## 3 TRUE FALSE FALSE TRUE TRUE
## 5 TRUE TRUE TRUE FALSE TRUE
## 7 FALSE TRUE TRUE TRUE FALSE
Another way also can be used to implement the function, in which the input parameters can be the index of each field. For example, in the following demo, the the first parameter “1” represent explained variable in the first column of the dataset, the second parameter “2” represent explanatory variable in the second column of the dataset.
## [[1]]
## [[1]]$`Risk Detector`
## type Mean of explained variable
## 1 1 6.340000
## 2 2 6.687500
## 3 3 6.583279
## 4 5 5.843810
## 5 7 6.347073
##
## [[1]]$`Significance.t-test:0.05`
## 1 2 3 5 7
## 1 FALSE TRUE TRUE TRUE FALSE
## 2 TRUE FALSE FALSE TRUE TRUE
## 3 TRUE FALSE FALSE TRUE TRUE
## 5 TRUE TRUE TRUE FALSE TRUE
## 7 FALSE TRUE TRUE TRUE FALSE
If there are more than one variable, the function can be used as the following. In which, c(“type”,“region”,“level”) and c(2,3,4) are field names and index of field for explanatory variables.
or
## [[1]]
## [[1]]$`Risk Detector`
## type Mean of explained variable
## 1 1 6.340000
## 2 2 6.687500
## 3 3 6.583279
## 4 5 5.843810
## 5 7 6.347073
##
## [[1]]$`Significance.t-test:0.05`
## 1 2 3 5 7
## 1 FALSE TRUE TRUE TRUE FALSE
## 2 TRUE FALSE FALSE TRUE TRUE
## 3 TRUE FALSE FALSE TRUE TRUE
## 5 TRUE TRUE TRUE FALSE TRUE
## 7 FALSE TRUE TRUE TRUE FALSE
##
##
## [[2]]
## [[2]]$`Risk Detector`
## region Mean of explained variable
## 1 1 6.167813
## 2 2 6.813103
## 3 3 6.474231
## 4 4 6.728000
## 5 5 5.910000
## 6 6 5.845714
## 7 7 6.494167
## 8 8 6.360769
## 9 9 6.579231
##
## [[2]]$`Significance.t-test:0.05`
## 1 2 3 4 5 6 7 8 9
## 1 FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## 2 TRUE FALSE TRUE FALSE TRUE TRUE TRUE TRUE TRUE
## 3 TRUE TRUE FALSE TRUE TRUE TRUE FALSE FALSE FALSE
## 4 TRUE FALSE TRUE FALSE TRUE TRUE TRUE TRUE TRUE
## 5 TRUE TRUE TRUE TRUE FALSE FALSE TRUE TRUE TRUE
## 6 TRUE TRUE TRUE TRUE FALSE FALSE TRUE TRUE TRUE
## 7 TRUE TRUE FALSE TRUE TRUE TRUE FALSE FALSE FALSE
## 8 TRUE TRUE FALSE TRUE TRUE TRUE FALSE FALSE TRUE
## 9 TRUE TRUE FALSE TRUE TRUE TRUE FALSE TRUE FALSE
##
##
## [[3]]
## [[3]]$`Risk Detector`
## level Mean of explained variable
## 1 1 6.455882
## 2 2 6.171111
## 3 3 6.258108
## 4 4 6.621364
## 5 5 5.908889
## 6 6 6.888636
## 7 7 5.790000
##
## [[3]]$`Significance.t-test:0.05`
## 1 2 3 4 5 6 7
## 1 FALSE TRUE TRUE TRUE TRUE TRUE TRUE
## 2 TRUE FALSE FALSE TRUE TRUE TRUE TRUE
## 3 TRUE FALSE FALSE TRUE TRUE TRUE TRUE
## 4 TRUE TRUE TRUE FALSE TRUE TRUE TRUE
## 5 TRUE TRUE TRUE TRUE FALSE TRUE TRUE
## 6 TRUE TRUE TRUE TRUE TRUE FALSE TRUE
## 7 TRUE TRUE TRUE TRUE TRUE TRUE FALSE
The ecological detector identifies the impact differences between two risk factors X1 ~ X2.
The function ecological_detector implement the ecological detector. In the following demo, the first parameter “incidence” represents explained variable, the second parameter c(“type”,“region”) represents explanatory variables, and the third parameter “CollectData” represent dataset. In the function, the F statistic is used to test the difference with the significant level of 0.05.
## $`Significance.F-test:0.05`
## type region
## type FALSE TRUE
## region TRUE FALSE
If there are more than two variables, the function can be used as the following.
## $`Significance.F-test:0.05`
## type region level
## type FALSE TRUE TRUE
## region TRUE FALSE FALSE
## level TRUE FALSE FALSE
where, c(“type”,“region”,“level”) are field names of field for explanatory variables.
If the input data is in table format, it can be directly used as input parameters in the above functions. However, if input data is map in shapefile format, the function named geodetector can be used to transform from shapefile map to table format, then the above function can be used. Please note that, these shapefile layers should have the same projected coordinate system.
Load maptools package:
Read data:
In the following demo, the first parameter “DiseaseData” represents shape file data storing explained variable, the second parameter c(SoilType,Watershed, Elevation) represents shape file data storing explanatory variables, and the third parameter c(‘incidence’, ‘type’, ‘region’, ‘level’) represent field names used in calculation in explained variable and explanatory variables, respectively.
CollectData2 <- maps2dataframe(DiseaseData,c(SoilType,Watershed, Elevation),
namescolomn= c('incidence', 'type', 'region', 'level'))
head(CollectData)
## incidence type region level
## 1 5.94 7 5 5
## 2 5.87 5 5 5
## 3 5.92 5 5 5
## 4 6.32 1 7 1
## 5 6.49 3 2 4
## 6 6.46 3 2 4
Using dataset CollectData calculated from maps2dataframe function, the following function can be calculated.
Risk detector:
## [[1]]
## [[1]]$`Risk Detector`
## type Mean of explained variable
## 1 1 6.340000
## 2 2 6.687500
## 3 3 6.583279
## 4 4 5.843810
## 5 5 6.347073
##
## [[1]]$`Significance.t-test:0.05`
## 1 2 3 4 5
## 1 FALSE TRUE TRUE TRUE FALSE
## 2 TRUE FALSE FALSE TRUE TRUE
## 3 TRUE FALSE FALSE TRUE TRUE
## 4 TRUE TRUE TRUE FALSE TRUE
## 5 FALSE TRUE TRUE TRUE FALSE
## [[1]]
## [[1]]$`Risk Detector`
## type Mean of explained variable
## 1 1 6.340000
## 2 2 6.687500
## 3 3 6.583279
## 4 4 5.843810
## 5 5 6.347073
##
## [[1]]$`Significance.t-test:0.05`
## 1 2 3 4 5
## 1 FALSE TRUE TRUE TRUE FALSE
## 2 TRUE FALSE FALSE TRUE TRUE
## 3 TRUE FALSE FALSE TRUE TRUE
## 4 TRUE TRUE TRUE FALSE TRUE
## 5 FALSE TRUE TRUE TRUE FALSE
## [[1]]
## [[1]]$`Risk Detector`
## type Mean of explained variable
## 1 1 6.340000
## 2 2 6.687500
## 3 3 6.583279
## 4 4 5.843810
## 5 5 6.347073
##
## [[1]]$`Significance.t-test:0.05`
## 1 2 3 4 5
## 1 FALSE TRUE TRUE TRUE FALSE
## 2 TRUE FALSE FALSE TRUE TRUE
## 3 TRUE FALSE FALSE TRUE TRUE
## 4 TRUE TRUE TRUE FALSE TRUE
## 5 FALSE TRUE TRUE TRUE FALSE
## [[1]]
## [[1]]$`Risk Detector`
## type Mean of explained variable
## 1 1 6.340000
## 2 2 6.687500
## 3 3 6.583279
## 4 4 5.843810
## 5 5 6.347073
##
## [[1]]$`Significance.t-test:0.05`
## 1 2 3 4 5
## 1 FALSE TRUE TRUE TRUE FALSE
## 2 TRUE FALSE FALSE TRUE TRUE
## 3 TRUE FALSE FALSE TRUE TRUE
## 4 TRUE TRUE TRUE FALSE TRUE
## 5 FALSE TRUE TRUE TRUE FALSE
##
##
## [[2]]
## [[2]]$`Risk Detector`
## region Mean of explained variable
## 1 1 6.167813
## 2 2 6.813103
## 3 3 6.474231
## 4 4 6.728000
## 5 5 5.910000
## 6 6 5.845714
## 7 7 6.494167
## 8 8 6.360769
## 9 9 6.579231
##
## [[2]]$`Significance.t-test:0.05`
## 1 2 3 4 5 6 7 8 9
## 1 FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## 2 TRUE FALSE TRUE FALSE TRUE TRUE TRUE TRUE TRUE
## 3 TRUE TRUE FALSE TRUE TRUE TRUE FALSE FALSE FALSE
## 4 TRUE FALSE TRUE FALSE TRUE TRUE TRUE TRUE TRUE
## 5 TRUE TRUE TRUE TRUE FALSE FALSE TRUE TRUE TRUE
## 6 TRUE TRUE TRUE TRUE FALSE FALSE TRUE TRUE TRUE
## 7 TRUE TRUE FALSE TRUE TRUE TRUE FALSE FALSE FALSE
## 8 TRUE TRUE FALSE TRUE TRUE TRUE FALSE FALSE TRUE
## 9 TRUE TRUE FALSE TRUE TRUE TRUE FALSE TRUE FALSE
##
##
## [[3]]
## [[3]]$`Risk Detector`
## level Mean of explained variable
## 1 1 6.455882
## 2 2 6.171111
## 3 3 6.258108
## 4 4 6.621364
## 5 5 5.908889
## 6 6 6.888636
## 7 7 5.790000
##
## [[3]]$`Significance.t-test:0.05`
## 1 2 3 4 5 6 7
## 1 FALSE TRUE TRUE TRUE TRUE TRUE TRUE
## 2 TRUE FALSE FALSE TRUE TRUE TRUE TRUE
## 3 TRUE FALSE FALSE TRUE TRUE TRUE TRUE
## 4 TRUE TRUE TRUE FALSE TRUE TRUE TRUE
## 5 TRUE TRUE TRUE TRUE FALSE TRUE TRUE
## 6 TRUE TRUE TRUE TRUE TRUE FALSE TRUE
## 7 TRUE TRUE TRUE TRUE TRUE TRUE FALSE
factor detector:
## [[1]]
## q-statistic p-value
## type 0.3857168 0.3632363
## [[1]]
## q-statistic p-value
## type 0.3857168 0.3632363
##
## [[2]]
## q-statistic p-value
## region 0.6377737 0.0001169914
## [[1]]
## q-statistic p-value
## type 0.3857168 0.3632363
## [[1]]
## q-statistic p-value
## type 0.3857168 0.3632363
##
## [[2]]
## q-statistic p-value
## region 0.6377737 0.0001169914
##
## [[3]]
## q-statistic p-value
## level 0.6067087 0.04080407
ecological detector:
## $`Significance.F-test:0.05`
## type region
## type FALSE TRUE
## region TRUE FALSE
## $`Significance.F-test:0.05`
## type region level
## type FALSE TRUE TRUE
## region TRUE FALSE FALSE
## level TRUE FALSE FALSE
interaction detector:
## type region
## type 0.385716842809428 0.735680548139531
## region 0.735680548139531 0.637773670070423
## type region level
## type 0.385716842809428 0.735680548139531 0.663523698335635
## region 0.735680548139531 0.637773670070423 0.71359677853471
## level 0.663523698335635 0.71359677853471 0.606708709727727
Results can be saved as CSV file, for example: