--- title: "gwascat: structuring and querying the NHGRI GWAS catalog" author: "Vincent J. Carey, stvjc at channing.harvard.edu" date: "`r format(Sys.time(), '%B %d, %Y')`" vignette: > %\VignetteEngine{knitr::rmarkdown} %\VignetteIndexEntry{gwascat: structuring and querying the NHGRI GWAS catalog} %\VignetteEncoding{UTF-8} output: BiocStyle::html_document: highlight: pygments number_sections: yes theme: united toc: yes bibliography: gwascat.bib --- ```{r setup,echo=FALSE,results="hide"} suppressMessages({ suppressPackageStartupMessages({ library(gwascat) library(dplyr) library(GenomeInfoDb) library(rtracklayer) library(ggbio) }) }) ``` # Introduction NHGRI maintains and routinely updates a database of selected genome-wide association studies. This document describes R/Bioconductor facilities for working with contents of this database. ## Installation The package can be installed using Bioconductor's \textit{BiocManager} package, with the sequence \begin{verbatim} library(BiocManager) BiocManager::install("gwascat") \end{verbatim} ## Attachment and access to documentation Once the package has been installed, use \verb+library(gwascat)+ to obtain interactive access to all the facilities. After executing this command, use \verb+help(package="gwascat")+ to obtain an overview. The current version of this vignette can always be accessed at www.bioconductor.org, or by suitably navigating the web pages generated with \texttt{help.start()}. ```{r getlib,echo=FALSE,results='hide'} library(gwascat) ``` ## Using tidy methods -- added August 2022 ```{r dotidy} gwcat = get_cached_gwascat() library(dplyr) gwcat |> arrange(DATE) ``` We can produce a GRanges in two forms. By default we get an mcols that has a small set of columns. Note that records that lack a `CHR_POS` value are omitted. Records that have complicated `CHR_POS` values, including semicolons or " x " notation are kept, but only the first position is retained. The `CHR_ID` field may have complicated character values, these are not normalized, and are simply used as `seqnames` "as is". ```{r getgr} gg = gwcat |> as_GRanges() gg ``` We can set the seqinfo as follows, retaining only records that employ standard chromosomes. ```{r dosei} library(GenomeInfoDb) data(si.hs.38) gg = keepStandardChromosomes(gg, pruning="coarse") seqlevels(gg) = seqlevels(si.hs.38) seqinfo(gg) = si.hs.38 gg ``` ## Getting a recent version of the GWAS catalog We use BiocFileCache to manage downloaded TSV from EBI's download site. The file is provided without compression, so prepare for 200+MB download if you are not working from a cache. There is no etag set, so you have to check for updates on your own. ```{r lkadasd} args(get_cached_gwascat) ``` This is converted to a manageable extension of GRanges using `process_gwas_dataframe`. ```{r lkasdasdasd} args(process_gwas_dataframe) ``` # Illustrations: computing ```{r lk1,echo=FALSE,results='hide'} if (length(grep("gwascat", search()))>0) detach("package:gwascat") ``` Available functions are: ```{r lk2} library(gwascat) objects("package:gwascat") ``` An extended GRanges instance with a sample of 50000 SNP-disease associations reported on 30 April 2020 is obtained as follows, with addresses based on the GRCh38 genome build. We use `gwtrunc` to refer to it in the sequel. ```{r lkgr} data(ebicat_2020_04_30) gwtrunc = ebicat_2020_04_30 ``` To determine the most frequently occurring traits in this sample: ```{r lktr} topTraits(gwtrunc) ``` For a given trait, obtain a GRanges with all recorded associations; here only three associations are shown: ```{r lklocs} subsetByTraits(gwtrunc, tr="LDL cholesterol")[1:3] ``` # Some visualizations ## Basic Manhattan plot A basic Manhattan plot is easily constructed with the ggbio package facilities. Here we confine attention to chromosomes 4:6. First, we create a version of the catalog with $-log_{10} p$ truncated at a maximum value of 25. ```{r lkm,results='hide'} requireNamespace("S4Vectors") mcols = S4Vectors::mcols mlpv = mcols(gwtrunc)$PVALUE_MLOG mlpv = ifelse(mlpv > 25, 25, mlpv) S4Vectors::mcols(gwtrunc)$PVALUE_MLOG = mlpv library(GenomeInfoDb) # seqlevelsStyle(gwtrunc) = "UCSC" # no more! seqlevels(gwtrunc) = paste0("chr", seqlevels(gwtrunc)) gwlit = gwtrunc[ which(as.character(seqnames(gwtrunc)) %in% c("chr4", "chr5", "chr6")) ] library(ggbio) mlpv = mcols(gwlit)$PVALUE_MLOG mlpv = ifelse(mlpv > 25, 25, mlpv) S4Vectors::mcols(gwlit)$PVALUE_MLOG = mlpv ```{r dorunap} autoplot(gwlit, geom="point", aes(y=PVALUE_MLOG), xlab="chr4-chr6") ``` \setkeys{Gin}{width=0.95\textwidth} \includegraphics{litman} ## Annotated Manhattan plot A simple call permits visualization of GWAS results for a small number of traits. Note the defaults in this call. ```{r runit,eval=TRUE} traitsManh(gwtrunc) ``` %\setkeys{Gin}{width=0.95\textwidth} %\includegraphics{annman} ## Integrative view of potential genetic determinants The following chunk uses GFF3 data on eQTL and related phenomena distributed at the GBrowse instance at eqtl.uchicago.edu. A request for all information at 43-45 Mb was made on 2 June 2012, yielding the GFF3 referenced below. Of interest are locations and scores of genetic associations with DNaseI hypersensitivity (scores identifying dsQTL, see Degner et al 2012). ```{r getgd} gffpath = system.file("gff3/chr17_43000000_45000000.gff3", package="gwascat") library(rtracklayer) c17tg = import(gffpath) ``` We make a Gviz DataTrack of the dsQTL scores. ```{r lkgvt} c17td = c17tg[ which(S4Vectors::mcols(c17tg)$type == "Degner_dsQTL") ] library(Gviz) dsqs = DataTrack( c17td, chrom="chr17", genome="hg19", data="score", name="dsQTL") ``` We start the construction of the graph here. ```{r dost} g2 = GRanges(seqnames="chr17", IRanges(start=4.3e7, width=2e6)) basic = gwcex2gviz(basegr = gwtrunc, contextGR=g2, plot.it=FALSE) ``` We also collect locations of eQTL in the Stranger 2007 multipopulation eQTL study. ```{r getstr} c17ts = c17tg[ which(S4Vectors::mcols(c17tg)$type == "Stranger_eqtl") ] eqloc = AnnotationTrack(c17ts, chrom="chr17", genome="hg19", name="Str eQTL") displayPars(eqloc)$col = "black" displayPars(dsqs)$col = "red" integ = list(basic[[1]], eqloc, dsqs, basic[[2]], basic[[3]]) ``` Now use Gviz. ```{r doplot} plotTracks(integ) ``` # SNP sets and trait sets ## SNPs by name We can regard the content of a SNP chip as a set of SNP, referenced by name. The pd.genomewidesnp.6 package describes the Affymetrix SNP 6.0 chip. We can determine which traits are associated with loci interrogated by the chip as follows. We work with a subset of the 1 million loci for illustration. The \texttt{locon6} data frame has information on 10000 probes, acquired through the following code (not executed here to reduce dependence on the pd.genomewidesnp.6 package, which is very large. ```{r dobig,eval=FALSE} library(pd.genomewidesnp.6) con = pd.genomewidesnp.6@getdb() locon6 = dbGetQuery(con, "select dbsnp_rs_id, chrom, physical_pos from featureSet limit 10000") ``` Instead use the serialized information: ```{r doloc} data(locon6) rson6 = as.character(locon6[[1]]) rson6[1:5] ``` We subset the GWAS ranges structure with rsids that are common to both the chip and the GWAS catalog. We then tabulate the diseases associated with the common loci. ```{r lkdtab} intr = gwtrunc[ intersect(getRsids(gwtrunc), rson6) ] sort(table(getTraits(intr)), decreasing=TRUE)[1:10] ``` ## Traits by genomic location We will assemble genomic coordinates for SNP on the Affymetrix 6.0 chip and show the effects of identifying the trait-associated loci with regions of width 1000bp instead of 1bp. The following code retrieves coordinates for SNP interrogated on 10000 probes (to save time) on the 6.0 chip, in a GRanges instance that was lifted over to GRCh38. The data statement is preceded by legacy code that produced an instance with hg19 coordinates. ```{r lkexp,keep.source=TRUE} #gr6.0 = GRanges(seqnames=ifelse(is.na(locon6$chrom),0,locon6$chrom), # IRanges(ifelse(is.na(locon6$phys),1,locon6$phys), width=1)) #S4Vectors::mcols(gr6.0)$rsid = as.character(locon6$dbsnp_rs_id) #seqlevels(gr6.0) = paste("chr", seqlevels(gr6.0), sep="") data(gr6.0_hg38) ``` Here we compute overlaps with both the raw disease-associated locus addresses, and with the locus address $\pm$ 500bp. ```{r dosub} ag = function(x) as(x, "GRanges") ovraw = suppressWarnings(subsetByOverlaps(ag(gwtrunc), gr6.0_hg38)) length(ovraw) ovaug = suppressWarnings(subsetByOverlaps(ag(gwtrunc+500), gr6.0_hg38)) length(ovaug) ``` To acquire the subset of the catalog to which 6.0 probes are within 500bp, use: ```{r dosub2} rawrs = mcols(ovraw)$SNPS augrs = mcols(ovaug)$SNPS gwtrunc[augrs] ``` Relaxing the intersection criterion in this limited case leads to a larger set of traits. ```{r lkrelax} nn = setdiff( getTraits(gwtrunc[augrs]), getTraits(gwtrunc[rawrs]) ) length(nn) head(nn) tail(nn) ``` # Counting alleles associated with traits We can use \texttt{riskyAlleleCount} to count risky alleles enumerated in the GWAS catalog. This particular function assumes that we have genotyped at the catalogued loci. Below we will discuss how to impute from non-catalogued loci to those enumerated in the catalog. ```{r lkcout} data(gg17N) # translated from GGdata chr 17 calls using ABmat2nuc gg17N[1:5,1:5] ``` This function can use genotype information in the A/B format, assuming that B denotes the alphabetically later nucleotide. Because we have direct nucleotide coding in our matrix, we set the \texttt{matIsAB} parameter to false in this call. ```{r dorun} h17 = riskyAlleleCount(gg17N, matIsAB=FALSE, chr="ch17", gwwl = gwtrunc) h17[1:5,1:5] table(as.numeric(h17)) ``` It is of interest to bind the counts back to the catalog data. ```{r domo} gwr = gwtrunc gwr = gwr[colnames(h17),] S4Vectors::mcols(gwr) = cbind(mcols(gwr), DataFrame(t(h17))) sn = rownames(h17) gwr[,c("DISEASE/TRAIT", sn[1:4])] ``` Now by programming on the metadata columns, we can identify individuals with particular risk profiles. # Formal management of trait vocabularies ## Diseases: Disease Ontology The Disease Ontology project @Osborne:2009p4571 formalizes a vocabulary for human diseases. Bioconductor's DO.db package is a curated representation. ```{r getdo} library(DO.db) DO() ``` All tokens of the ontology are acquired via: ```{r getallt} alltob = unlist(mget(mappedkeys(DOTERM), DOTERM)) allt = sapply(alltob, Term) allt[1:5] ``` Direct mapping from disease trait tokens in the catalog to this vocabulary succeeds for a modest proportion of records. ```{r dohit} cattra = mcols(gwtrunc)$`DISEASE/TRAIT` mat = match(tolower(cattra), tolower(allt)) catDO = names(allt)[mat] na.omit(catDO)[1:50] mean(is.na(catDO)) ``` Approximate matching of unmatched tokens can proceed by various routes. Some traits are not diseases, and will not be mappable using Disease Ontology. However, consider ```{r dogr} unique(cattra[is.na(catDO)])[1:20] nomatch = cattra[is.na(catDO)] unique(nomatch)[1:5] ``` Manual searching shows that a number of these have very close matches. ## Other phenotypic traits: Human Phenotype Ontology Bioconductor does not possess an annotation package for phenotype ontology, but the standardized OBO format can be parsed and modeled into a graph. ```{r dopar,cache=FALSE} hpobo = gzfile(dir(system.file("obo", package="gwascat"), pattern="hpo", full=TRUE)) HPOgraph = obo2graphNEL(hpobo) close(hpobo) ``` The phenotypic terms are obtained via: ```{r dohpot} requireNamespace("graph") hpoterms = unlist(graph::nodeData(HPOgraph, graph::nodes(HPOgraph), "name")) hpoterms[1:10] ``` Exact hits to unmatched GWAS catalog traits exist: ```{r lkint} intersect(tolower(nomatch), tolower(hpoterms)) ``` More work on formalization of trait terms is underway. %## {Curation of approximate matches} % NB the graph stuff can be used for other OBO without .db packages.. %The \textit{gwascat} package includes an OBO-formatted %image of the ontology and a model for it as a \texttt{graphNEL} instance %as defined in the Bioconductor \textit{graph} package \cite{carey2005}. %The graph model is constructed as follows. %< | [1] 3 [109789570, 109789570] * | A G [2] 3 [ 25922285, 25922285] * | G A [3] 3 [109529550, 109529550] * | T C [4] 3 [175055759, 175055759] * | T G [5] 3 [191912870, 191912870] * | C T ... ... ... ... ... ... ... [16] 3 [187716886, 187716886] * | A G [17] 3 [160820524, 160820524] * | G C [18] 3 [169518455, 169518455] * | T C [19] 3 [179172979, 179172979] * | G T [20] 3 [171785168, 171785168] * | G C CScore PHRED [1] -0.182763 3.110 [2] -0.289708 2.616 [3] 0.225373 5.216 [4] -0.205689 3.003 [5] -0.172189 3.161 ... ... ... [16] -0.019710 3.913 [17] -0.375183 2.235 [18] -0.695270 0.987 [19] -0.441673 1.949 [20] 0.231972 5.252 --- seqlengths: 1 2 3 4 ... 21 22 X 249250621 243199373 198022430 191154276 ... 48129895 51304566 155270560 ``` # Appendix: Adequacy of location annotation A basic question concerning the use of archived SNP identifiers is durability of the association between asserted location and SNP identifier. The \texttt{chklocs} function uses a current Bioconductor SNPlocs package to check this. For example, to verify that locations asserted on chromosome 20 agree between the Bioconductor dbSNP image and the gwas catalog, ```{r getrs,keep.source=TRUE,cache=FALSE,eval=FALSE} if ("SNPlocs.Hsapiens.dbSNP144.GRCh37" %in% installed.packages()[,1]) { library(SNPlocs.Hsapiens.dbSNP144.GRCh37) chklocs("20", gwtrunc) } ``` This is not a fast procedure. # Acknowledgment The development of this software was supported in part by Robert Gentleman and the Computational Biology Group of Genentech, Inc. # References \bibliography{gwascat}