An Introduction to SOHPIE

Seungjun Ahn

June 16th, 2023

We introduce a Statistical Approach via Pseudo-value Information and Estimation for Differential Network Analysis (SOHPIE; pronounced as “Sofie”) [1]. This is a regression modeling method for differential network (DN) analysis that can include covariate information in analyzing microbiome data.

Requirements

Please install these R packages prior to use SOHPIE-DNA.

# library(robustbase) # To fit a robust regression.
# library(parallel) # To use mclapply() when reestimating the association matrix.
# library(dplyr)  # For the convenience of tabulating p-values, coefficients, and q-values.
# library(fdrtool) # For false discovery rate control.
# library(gtools) # To estimate an association matrix via SparCC.
library(SOHPIE)

Example: Load the COPDGene study data from PRANA R package:

Two sample datasets are available in this package. One (combinedamgut) is from the American Gut Project [2] and contains 138 taxa and 268 subjects. In this user manual, the first 30 out of 138 taxa will be used for the simple demonstration purpose. The other (combineddietswap) is from the geographical epidemiology study of diet swap intervention [3] that includes 112 taxa with 37 subjects (20 African Americans from Pittsburgh and 17 rural South Africans). The full data of each study are available in the SpiecEasi and microbiome R packages, respectively.

set.seed(20050505)
data(combinedamgut) # A complete data containing columns with taxa and clinical covariates.

Data processing for the toy example using sample dataset from American Gut Project:

The main grouping variable will be the indicator variable for the status of living with a dog. After the data processing, the indices of subjects will be available for each ‘Not living with a dog (Group A)’ vs. ‘Living with a dog (Group B).’ We need these indices for the estimation of group-specific \(p \times p\) association matrices (and re-estimation of association matrices for pseudo-value calculations later).

# Note: Again, we will use a toy example with the first 30 out of 138 taxa.
OTUtab = combinedamgut[ , 8:37]

# Clinical/demographic covariates (phenotypic data):
# Note: All of these covariates in phenodat below will be included in the regression 
#       when you use SOHPIE_DNA function later. Please make sure 
#       phenodat below include variables that will be analyzed only.
phenodat = combinedamgut[, 1:7] # first column is ID, so not using it.
# Obtain indices of each grouping factor.
# In this example, a variable indicating the status of living with a dog was chosen (i.e. bin_dog).
# Accordingly, Groups A and B imply living without and with a dog, respectively.
newindex_grpA = which(combinedamgut$bin_dog == 0)
newindex_grpB = which(combinedamgut$bin_dog == 1)

Fit a pseudo-value regression via SOHPIE_DNA() function:

Upon our data processing step above is complete, we can then fit a pseudo-value regression using SOHPIE_DNA function. An important note! Please provide the object name of each OTU table and clinical/demographic data (i.e. metadata) separately in the function. In addition, you must indicate the object names of the indices for each group of a binary indicator variable that is used as a main predictor variable (e.g. living with a dog vs. without a dog).

SOHPIEres <- SOHPIE_DNA(OTUdat = OTUtab, clindat = phenodat, 
                        groupA = newindex_grpA, groupB = newindex_grpB)

Additional features available in SOHPIE package:

Now, I would like to show you that SOHPIE has some convenient tools/functions after fitting a pseudo-value regression. There are functions that you can quickly extract names of taxa that are significantly differentially connected (DC; DCtaxa_tab), as well as adjusted p-values (q-values; qval and qval_specific_var) and coefficient estimates (coeff and coeff_specific_var) of all variables that are considered in the regression or a specific variable.

# qval() function will get you a table with q-values.
qval(SOHPIEres)
#>             bin_dog        age        sex bin_floss bin_exercise cat_alcohol1
#> 326792 0.6718599537 0.03101449 0.06266118 0.9162595            1   0.52326081
#> 348374 0.5748463782 0.74520158 0.19217181 0.9600343            1   0.38502163
#> 181016 0.6307894498 0.73537257 0.31097113 0.9552732            1   0.69537716
#> 191687 0.6176794513 0.75533084 0.19065216 0.9475789            1   0.14040928
#> 305760 0.3901074464 0.30332590 0.05846960 0.8306534            1   0.14218172
#> 326977 0.3865110096 0.72056596 0.06122356 0.9307835            1   0.46242556
#> 194648 0.7171013989 0.66708462 0.21097982 0.6282862            1   0.63387885
#> 28186  0.6598291209 0.35869206 0.17999026 0.9616961            1   0.70635985
#> 541301 0.5466453444 0.70558475 0.32229419 0.6282862            1   0.11265657
#> 198941 0.6674430798 0.03955728 0.18871585 0.9654564            1   0.58934495
#> 353985 0.6246109890 0.27581278 0.28870141 0.9529683            1   0.59538815
#> 187524 0.6184338262 0.01049243 0.37866069 0.9398391            1   0.36751633
#> 182054 0.7192386938 0.14048573 0.17444083 0.9421658            1   0.72538489
#> 175537 0.5925609895 0.56989007 0.09704646 0.8213131            1   0.08461242
#> 9753   0.0007377574 0.28063614 0.17340556 0.8853829            1   0.14024956
#> 194211 0.3940861909 0.64146842 0.18019947 0.7018547            1   0.08905296
#> 188518 0.6178576071 0.70399261 0.06210556 0.8894388            1   0.66679831
#> 189396 0.5593156353 0.67969064 0.28480546 0.6282862            1   0.38300187
#> 90487  0.5818455566 0.01376920 0.06194247 0.9502697            1   0.37477822
#> 203708 0.6031857425 0.18927892 0.15567288 0.9068511            1   0.32739979
#> 173965 0.6776070018 0.71064343 0.05174975 0.9370214            1   0.59055892
#> 194661 0.5880918436 0.29693394 0.06244570 0.9548121            1   0.49558941
#> 512309 0.3999094450 0.67878505 0.13539747 0.8109234            1   0.74848265
#> 170124 0.7334064348 0.68746552 0.29771847 0.9366570            1   0.75842568
#> 216862 0.6282296460 0.71829359 0.11440726 0.9608466            1   0.71383068
#> 352304 0.3996296770 0.67747514 0.19140209 0.9033311            1   0.73118550
#> 191306 0.6777469157 0.62493368 0.19400474 0.9618242            1   0.72866880
#> 191541 0.5992836390 0.41855510 0.21526287 0.9500501            1   0.61848756
#> 191547 0.3349451520 0.56546219 0.17931233 0.9656134            1   0.62444641
#> 195493 0.6313144806 0.35204345 0.18474762 0.9563969            1   0.14443690
#>        cat_alcohol2 bin_migraine
#> 326792    0.9431084    0.7577956
#> 348374    0.9392681    0.8684028
#> 181016    0.9553457    0.8297496
#> 191687    0.9596005    0.8801983
#> 305760    0.8233684    0.7194888
#> 326977    0.9174052    0.8803676
#> 194648    0.9152999    0.7289431
#> 28186     0.6542158    0.5837704
#> 541301    0.9428093    0.5831110
#> 198941    0.6542158    0.7357105
#> 353985    0.9545933    0.1835796
#> 187524    0.9673489    0.3214464
#> 182054    0.9343531    0.1835796
#> 175537    0.9470631    0.8343346
#> 9753      0.9629066    0.6973559
#> 194211    0.9401837    0.6534395
#> 188518    0.9429131    0.8767364
#> 189396    0.9413717    0.8080775
#> 90487     0.9622269    0.1835796
#> 203708    0.9430064    0.7378143
#> 173965    0.9628792    0.8567233
#> 194661    0.9103574    0.6680625
#> 512309    0.9576967    0.6177373
#> 170124    0.9499444    0.8679927
#> 216862    0.9513115    0.8424656
#> 352304    0.8919443    0.8659532
#> 191306    0.9386446    0.4095100
#> 191541    0.9501993    0.1835796
#> 191547    0.9302687    0.8628035
#> 195493    0.9581127    0.8865934

qval_specific_var function will be useful to retrieve the q-values of a specific variable, bin_dog in this example.

# Create an object to keep the table with q-values.
qvaltab <- qval(SOHPIEres)
# Retrieve a vector of q-values for a single variable of interest.
qval_specific_var(qvaltab = qvaltab, varname = "bin_dog")
#>             bin_dog
#> 326792 0.6718599537
#> 348374 0.5748463782
#> 181016 0.6307894498
#> 191687 0.6176794513
#> 305760 0.3901074464
#> 326977 0.3865110096
#> 194648 0.7171013989
#> 28186  0.6598291209
#> 541301 0.5466453444
#> 198941 0.6674430798
#> 353985 0.6246109890
#> 187524 0.6184338262
#> 182054 0.7192386938
#> 175537 0.5925609895
#> 9753   0.0007377574
#> 194211 0.3940861909
#> 188518 0.6178576071
#> 189396 0.5593156353
#> 90487  0.5818455566
#> 203708 0.6031857425
#> 173965 0.6776070018
#> 194661 0.5880918436
#> 512309 0.3999094450
#> 170124 0.7334064348
#> 216862 0.6282296460
#> 352304 0.3996296770
#> 191306 0.6777469157
#> 191541 0.5992836390
#> 191547 0.3349451520
#> 195493 0.6313144806

DCtaxa_tab will return a list containing of (1) names and q-values of taxa that are significantly DC between two biological conditions and (2) names of DC taxa only.

# Please do NOT forget to provide the name of variable in DCtaxa_tab(groupvar = )
# and the level of significance (0.3 in this example).
DCtaxa_tab <- DCtaxa_tab(qvaltab = qvaltab, groupvar = "bin_dog", alpha = 0.3)
DCtaxa_tab
#> $DCtaxa_complete_tab
#>           bin_dog
#> 9753 0.0007377574
#> 
#> $DCtaxa_names_only
#> [1] "9753"

References

[1] Ahn S, Datta S. (2023). Differential Co-Abundance Network Analyses for Microbiome Data Adjusted for Clinical Covariates Using Jackknife Pseudo-Values. Under Review at \(\textit{BMC Bioinformatics}\).

[2] McDonald D. et al. (2018). American Gut: an Open Platform for Citizen Science Microbiome Research. \(\textit{mSystems}\). 3(3), e00031–18

[3] O’Keefe SJ. et al. (2015). Fat, fibre and cancer risk in African Americans and rural Africans. \(\textit{Nat Commun}\). 6, 6342