All we need to prepare include three parts:
library(data.table)
library(xQTLbiolinks)
library(stringr)
Prostate cancer is one of the most common cancers in men. Prostate cancer pathogenesis involves both heritable and environmental factors. The molecular events involved in the development or progression of prostate cancer are still unclear. In this example, we aim to figure out the causal variants and genes assocaited with prostate cancer, and to uncover potential molecular mechanisms of regulation.
For data preparation, we download summary statistics dataset of a GWAS study (GCST006085) of prostate cancer from GWAS category and load the dataset in R with data.table
package. Correspondingly, we chose tissue Porstate
for study. We retain the variants with dbSNP id (start with rs
), and a data.table object named gwasDF
of 13,498,990 (rows) x 5 (cols) is loaded.
<- fread("29892016-GCST006085-EFO_0001663-build37.f.tsv.gz")
gwasDF # extract columns.
<- gwasDF[str_detect(variant_id, "^rs"),.(rsid=variant_id, chrom=chromosome, position= base_pair_location, pValue=p_value, AF=effect_allele_frequency)]
gwasDF# tissue:
="Prostate" tissueSiteDetail
head(gwasDF)
#> chr position chr_position rsid beta se N p-value maf
#> 1: chr1 10177 chr1:10177 rs201752861 -0.0211 0.0136 140254 0.1217 0.6104
#> 2: chr1 11008 chr1:11008 rs575272151 -0.0204 0.0196 140254 0.2979 0.9058
#> 3: chr1 11012 chr1:11012 rs544419019 -0.0204 0.0196 140254 0.2979 0.9058
#> 4: chr1 13110 chr1:13110 rs540538026 -0.0377 0.0332 140254 0.2565 0.0589
#> 5: chr1 13116 chr1:13116 rs62635286 -0.0035 0.0178 140254 0.8447 0.8255
#> 6: chr1 13118 chr1:13118 rs200579949 -0.0035 0.0178 140254 0.8447 0.8255
Sentinel SNP is the most prominent signal within a given genome range, and is usually in high LD with causal variants. By default, xQTLbiolinks detect sentinel snps that with the p-value < 5e-8 and SNP-to-SNP distance > 10e6 bp. Note: For in this example, due to the inconsistent genome version between the GWAS dataset (GRCh37) and eQTL associations (GRCh38) from eQTL category, conversion of genome version is required, and can be conducted using xQTLanalyze_getSentinelSnp
with genomeVersion="grch37"
and grch37To38=TRUE
(package rtracklayer
is required):
<- xQTLanalyze_getSentinelSnp(gwasDF, centerRange=1e6,
sentinelSnpDF genomeVersion="grch37", grch37To38=TRUE)
A total of 94 sentinel SNPs are detected.
head(sentinelSnpDF)
#> rsid chr position pValue maf
#> 1: rs55664108 chr1 204587862 8.572e-25 0.7157
#> 2: rs35296356 chr1 150601662 3.355e-14 0.3384
#> 3: rs34579442 chr1 153927424 4.478e-14 0.6637
#> 4: rs146564277 chr1 155051619 1.577e-12 0.0301
#> 5: rs34848415 chr1 205762038 2.893e-09 0.5231
#> 6: rs56391074 chr1 87745032 1.659e-08 0.6298
Trait genes are genes that located in the range of 1Mb (default, can be changed with parameter detectRange
) of sentinel SNPs. In order to reduce the number of trait genes and thus reduce the running time, we take the overlap of eGenes and trait genes as the final output of the function xQTLanalyze_getTraits
:
<- xQTLanalyze_getTraits(sentinelSnpDF, detectRange=1e6, tissueSiteDetail=tissueSiteDetail) traitsAll
Totally, 898 associations between 835 traits genes and 92 sentinel SNPs are detected
head(example_Coloc_traitsAll)
#> chromosome geneStart geneEnd geneStrand geneSymbol gencodeId
#> 1: chr1 205086142 205122015 - RBBP5 ENSG00000117222.13
#> 2: chr1 205142505 205211566 - DSTYK ENSG00000133059.16
#> 3: chr1 205336065 205357090 - KLHDC8A ENSG00000162873.14
#> 4: chr1 204828651 205022822 + NFASC ENSG00000163531.15
#> 5: chr1 204190341 204196486 - KISS1 ENSG00000170498.8
#> 6: chr1 204198160 204214092 - GOLT1A ENSG00000174567.7
#> rsid position pValue maf
#> 1: rs55664108 204587862 8.572e-25 0.7157
#> 2: rs55664108 204587862 8.572e-25 0.7157
#> 3: rs55664108 204587862 8.572e-25 0.7157
#> 4: rs55664108 204587862 8.572e-25 0.7157
#> 5: rs55664108 204587862 8.572e-25 0.7157
#> 6: rs55664108 204587862 8.572e-25 0.7157
Following three steps of colocalization analysis are encapsulated in one function xQTLanalyze_coloc:
For above 835 trait genes, a for loop can be used to get these genes’ outputs of colocalization analysis (this may take several hours):
<- unique(traitsAll$gencodeId)
genesAll<- data.table()
colocResultAll #
for(i in 1:length(genesAll)){
<- xQTLanalyze_coloc(gwasDF,
colocResult genomeVersion = "grch37",
tissueSiteDetail=tissueSiteDetail)$coloc_Out_summary
genesAll[i], if(!is.null(colocResult)){ colocResultAll <- rbind(colocResultAll, colocResult)}
message(format(Sys.time(), "== %Y-%b-%d %H:%M:%S ")," == Id:",i,"/",length(genesAll)," == Gene:",genesAll[i])
}
In this case, we invoke the funciton coloc.abf
from package coloc
to estimate the posterior support of variants for each hypothesis: H0,H1,H2,H3,H4:
Output is a data.table that combined all results of coloc_Out_summary
of all genes. The posterior probability of variants to each hypothesis (H0-H4) are listed in colocResultAll
:
head(colocResultAll)
#> nsnps PP.H0.abf PP.H1.abf PP.H2.abf PP.H3.abf PP.H4.abf
#> 1: 5666 3.339109e-19 0.333928806 6.435451e-19 0.6435572 2.251401e-02
#> 2: 5509 6.902252e-20 0.069026225 9.265196e-19 0.9265649 4.408886e-03
#> 3: 5072 6.447026e-19 0.644737236 3.172030e-19 0.3171819 3.808084e-02
#> 4: 5863 4.318128e-22 0.000431836 9.994821e-19 0.9995357 3.245264e-05
#> 5: 5688 5.603139e-19 0.560343948 3.653347e-19 0.3652799 7.437616e-02
#> 6: 5661 3.421306e-19 0.342148992 6.147008e-19 0.6146906 4.316040e-02
#> traitGene
#> 1: ENSG00000117222.13
#> 2: ENSG00000133059.16
#> 3: ENSG00000162873.14
#> 4: ENSG00000163531.15
#> 5: ENSG00000170498.8
#> 6: ENSG00000174567.7
To save time and go through this case as soon as possible, you can get the above result directly with:
<- fread("https://raw.githubusercontent.com/dingruofan/exampleData/master/colocResultAll.txt") colocResultAll
We considered colocalization tests with a posterior probability of hypothesis 4 (PPH4.ABF) ≥ 0.75 as having strong or moderate evidence for colocalization.
<- colocResultAll[PP.H4.abf>0.75][order(-PP.H4.abf)] colocResultsig
There are 27 trait genes that are associated and share a single causal variant:
head(colocResultsig)
#> nsnps PP.H0.abf PP.H1.abf PP.H2.abf PP.H3.abf PP.H4.abf
#> 1: 7108 3.722120e-16 7.637674e-04 1.159909e-15 0.001382244 0.9978540
#> 2: 5034 2.405778e-32 1.349362e-05 8.271165e-30 0.003642818 0.9963437
#> 3: 6045 1.551367e-07 1.220850e-05 1.078421e-04 0.007494261 0.9923855
#> 4: 5389 1.594413e-24 5.835862e-07 2.875803e-20 0.009535535 0.9904639
#> 5: 6154 4.387457e-18 6.631626e-03 6.233624e-18 0.008437170 0.9849312
#> 6: 4616 1.043225e-19 1.214959e-02 6.298351e-20 0.006353677 0.9814967
#> traitGene
#> 1: ENSG00000137673.8
#> 2: ENSG00000167641.10
#> 3: ENSG00000184058.12
#> 4: ENSG00000115486.11
#> 5: ENSG00000179409.10
#> 6: ENSG00000277744.1
All these genes’ details can be fetched with xQTLquery_gene
:
<- xQTLquery_gene(colocResultsig$traitGene) outGenes
Add the value of PPH4 for each gene, and remove non-protein-coding genes:
<- merge(colocResultsig[,.(gencodeId= traitGene, PP.H4.abf)],
outGenes by="gencodeId", sort=FALSE)
outGenes[,.(geneSymbol, gencodeId, entrezGeneId, geneType)], <- outGenes[geneType =="protein coding"] outGenes
outGenes
#> gencodeId PP.H4.abf geneSymbol entrezGeneId geneType
#> 1: ENSG00000137673.8 0.9978540 MMP7 4316 protein coding
#> 2: ENSG00000167641.10 0.9963437 PPP1R14A 94274 protein coding
#> 3: ENSG00000184058.12 0.9923855 TBX1 6899 protein coding
#> 4: ENSG00000115486.11 0.9904639 GGCX 2677 protein coding
#> 5: ENSG00000179409.10 0.9849312 GEMIN4 50628 protein coding
#> 6: ENSG00000117280.12 0.9701717 RAB29 8934 protein coding
#> 7: ENSG00000184012.11 0.9652282 TMPRSS2 7113 protein coding
#> 8: ENSG00000069275.12 0.9625282 NUCKS1 64710 protein coding
#> 9: ENSG00000099331.13 0.9607259 MYO9B 4650 protein coding
#> 10: ENSG00000155749.12 0.9538274 ALS2CR12 130540 protein coding
#> 11: ENSG00000118961.14 0.9294589 LDAH 60526 protein coding
#> 12: ENSG00000101751.10 0.9271138 POLI 11201 protein coding
#> 13: ENSG00000167695.14 0.9203321 FAM57A 79850 protein coding
#> 14: ENSG00000172613.7 0.9099755 RAD9A 5883 protein coding
#> 15: ENSG00000180535.3 0.8941134 BHLHA15 168620 protein coding
#> 16: ENSG00000003400.14 0.8932643 CASP10 843 protein coding
#> 17: ENSG00000115648.13 0.8889021 MLPH 79083 protein coding
#> 18: ENSG00000162877.12 0.8849219 PM20D1 148811 protein coding
#> 19: ENSG00000204536.13 0.8733027 CCHCR1 54535 protein coding
#> 20: ENSG00000091844.7 0.8629044 RGS17 26575 protein coding
#> 21: ENSG00000083937.8 0.8388545 CHMP2B 25978 protein coding
#> 22: ENSG00000136819.15 0.8365468 C9orf78 51759 protein coding
#> 23: ENSG00000065060.16 0.7937810 UHRF1BP1 54887 protein coding
#> 24: ENSG00000198625.12 0.7695907 MDM4 4194 protein coding
#> gencodeId PP.H4.abf geneSymbol entrezGeneId geneType
Ridgeline plot can be used to compare the expressions among these genes:
xQTLvisual_genesExp(outGenes$geneSymbol, tissueSiteDetail=tissueSiteDetail)
Trait gene MMP7
that with the highest PPH4.ABF=0.9978
encodes a member of the peptidase M10 family of matrix metalloproteinases, which is involved in the breakdown of extracellular matrix in normal physiological processes, such as embryonic development, reproduction, and tissue remodeling, as well as in disease processes, such as arthritis and metastasis (“RefSeq,” n.d.). Prostate cancer can be promoted via MMP7-induced epithelial-to-mesenchymal transition by Interleukin-17 (Zhang et al. 2017). Resent literature has shown that serum MMP7 levels could guide metastatic therapy for prostate cancer (Tregunna 2020).
Expression of MMP7 in multiple tissues can be plotted with xQTLvisual_geneExpTissues
:
<- xQTLvisual_geneExpTissues("MMP7", log10y = TRUE) geneExpTissues
The number and significance of eQTLs in distinguished tissues are capable of showing a tissue-specific effect or a ubiquitous effect. The function xQTLvisual_eqtl
can be used to indicate whether the gene is widely regulation in various tissues.
xQTLvisual_eqtl("MMP7")
Besides, we provide functions xQTLvisual_locusCompare
and xQTLvisual_locusZoom
to visualize the colocalization between the GWAS and the eQTL dataset for a specified gene, we take the gene MMP7
as an example:
# Download all eQTL associations of gene MMP7 in prostate:
<- xQTLdownload_eqtlAllAsso(gene="MMP7",tissueLabel = tissueSiteDetail)
eqtlAsso # Merge the variants of GWAS and eQTL dataset by rsid:
<- merge(gwasDF[,-c("AF")], eqtlAsso[,.(rsid=snpId, pValue)],
gwasEqtldata by=c("rsid"), suffixes = c(".gwas",".eqtl"))
Five retained fields are required:
gwasEqtldata
#> rsid chrom position pValue.gwas pValue.eqtl
#> 1: rs10000104 chr4 10190142 0.2830 0.4986640
#> 2: rs10000248 chr4 10720872 0.0359 0.6089460
#> 3: rs10000318 chr4 11613397 0.5810 0.3734920
#> 4: rs10000369 chr4 11234631 0.0453 0.6642180
#> 5: rs10000399 chr4 11595842 0.2370 0.0948472
#> ---
#> 8007: rs9999345 chr4 10895379 0.0861 0.0100352
#> 8008: rs9999470 chr4 10315399 0.2300 0.0700527
#> 8009: rs9999523 chr4 11410463 0.9930 0.2258890
#> 8010: rs9999669 chr4 10852509 0.2470 0.0441530
#> 8011: rs9999767 chr4 10127270 0.8730 0.1902930
Visualization of p-value distribution and comparison of the signals of GWAS and eQTL:
xQTLvisual_locusCompare(gwasEqtldata[,.(rsid, pValue.eqtl)],
legend_position = "bottomright") gwasEqtldata[,.(rsid, pValue.gwas)],
Locuszoom plot of GWAS signals:
xQTLvisual_locusZoom(gwasEqtldata[,.(rsid, chrom, position, pValue.gwas)], legend=FALSE)
Locuszoom plot of eQTL signals:
xQTLvisual_locusZoom(gwasEqtldata[,.(rsid, chrom, position, pValue.eqtl)], legend=FALSE)
We can also combine locuscompare and locuszoom plot using function xQTLvisual_locusCombine
:
xQTLvisual_locusCombine(gwasEqtldata[,c("rsid","chrom", "position", "pValue.gwas", "pValue.eqtl")])
From the above figures, we can see that the SNP rs11568818
is potential causal variant, and we can use a violin plot to show the normalized effect size of it:
xQTLvisual_eqtlExp("rs11568818", "MMP7", tissueSiteDetail = tissueSiteDetail)
To gain insight into the function of trait genes with high PPH4, and explore the potential regulatory mechanism of the prostate cancer, we can conduct exploratory analysis, like co-expression analysis, and gene ontology enrichment analyses.
First we download expression profiles of the genes with higher value of PPH4 (>0.75) in prostate.
<- xQTLdownload_exp(outGenes$gencodeId, tissueSiteDetail=tissueSiteDetail, toSummarizedExperiment =FALSE) expMat
Pearson coefficient can be calculated with the expression matrix for each gene:
<- cor(t(expMat[,-1:-6]))
corDT colnames(corDT) <- outGenes$geneSymbol
rownames(corDT) <- outGenes$geneSymbol
R package corrplot
is used to display this correlation matrix:
library(corrplot)
corrplot(corDT, method="color",
type="upper",
order = "hclust",
addCoef.col = "#ff0099",
number.cex = 0.7)
R package clusterProfiler
is used for gene functional annotation:
library(clusterProfiler)
<- enrichGO(gene = as.character(outGenes$entrezGeneId),
ego OrgDb = org.Hs.eg.db,
ont= "BP",
pAdjustMethod ="none",
readable = TRUE)
dotplot(ego, showCategory=15)
Viral-related GO terms are enriched in above analysis, including "viral life cycle"
, "positive regulation of viral life cycle"
and "positive regulation of viral process"
. Previous studies have highlighten the role of viral infections in initiation or progression of prostate cancer. The presence of viruses such as human papillomavirus (HPV), herpesviruses including cytomegalovirus (CMV), human herpes simplex virus type 2 (HSV2), human herpesvirus type 8 (HHV8) and Epstein-Barr virus (EBV) can infect the prostate (Abidi et al. 2018). However, the causal variants’ genetic effects on the phenotype and whether the trait genes has a direct association with prostate carcinogenesis has not yet been established.