---
title: "Using the NHLBI GRASP repository of GWAS test results with Bioconductor"
author: "Vincent J. Carey"
date: "October 9, 2014"
output:
  BiocStyle::pdf_document:
    toc: yes
  BiocStyle::html_document:
    number_sections: yes
    toc: yes
---
<!--
%\VignetteEngine{knitr::rmarkdown}
%\VignetteIndexEntry{grasp2db: interface to a high density GWAS catalogue}
-->

```{r style, echo = FALSE, results = 'asis'}
BiocStyle::markdown()
suppressPackageStartupMessages(library(grasp2db))
```

# Introduction

GRASP (Genome-Wide Repository of Associations Between SNPs and
Phenotypes) v2.0 was released in September 2014.  The [primary GRASP
web resource](http://apps.nhlbi.nih.gov/Grasp/Overview.aspx) includes
links to a web-based query interface.  This document describes a
Bioconductor package that replicates information in the v2.0 [textual
release](https://s3.amazonaws.com/NHLBI_public/GRASP/GraspFullDataset2.zip).

The primary reference for version 2 is: Eicher JD, Landowski C,
Stackhouse B, Sloan A, Chen W, Jensen N, Lien J-P, Leslie R, Johnson
AD (2014) GRASP v 2.0: an update to the genome-wide repository of
associations between SNPs and phenotypes. Nucl Acids Res, published
online Nov 26, 2014 PMID 25428361

From the main web page: 

GRASP includes all available genetic association results from papers, their supplements and web-based content meeting the following guidelines:

* All associations with P<0.05 from GWAS defined as >= 25,000 markers
  tested for 1 or more traits.

* Study exclusion criteria: CNV-only studies, replication/follow-up
  studies testing <25K markers, non-human only studies, article not in
  English, gene-environment or gene-gene GWAS where single SNP main
  effects are not given, linkage only studies, aCGH/LOH only studies,
  heterozygosity/homozygosity (genome-wide or long run) studies,
  studies only presenting gene-based or pathway-based results,
  simulation-only studies, studies which we judge as redundant with
  prior studies since they do not provide significant inclusion of new
  samples or exposure of new results (e.g., many methodological papers
  on the WTCCC and FHS GWAS).

* More detailed methods and resources used in constructing the catalog
  are described at [Methods &
  Resources](http://apps.nhlbi.nih.gov/Grasp/Resources.aspx).

* Terms of Use for GRASP data:
  [http://apps.nhlbi.nih.gov/Grasp/Terms.aspx](http://apps.nhlbi.nih.gov/Grasp/Terms.aspx)
 
* Medical disclaimer: [http://apps.nhlbi.nih.gov/Grasp/Overview.aspx]
(http://apps.nhlbi.nih.gov/Grasp/Overview.aspx)

# Installation

Install the package in Bioconductor version 3.1 or later using
`BiocInstaller::biocLite("grasp2db")`.

The first time the grasp2 data base is referenced, via the `GRASP2()`
function described below, a large (5.3Gb) file is downloaded to a
local cache using `r Biocpkg("AnnotationHub")`. This may take a
considerable length of time (10's of minutes, perhaps an hour or more
depending on internet connections). Subsequent uses refer to the
locally cached file, and are fast.

# Demonstration

## Attachment and messaging

Attach the package.

```{r loadup2}
library(grasp2db)
```

Workflows start with a reference to the data base.

```{r grasp2}
grasp2 <- GRASP2()
grasp2
```

There are three tables. The 'variant' table summarizes `r tbl(grasp2,
"variant") %>% nrow` variants. The 'study' table contains
`r tbl(grasp2, "study") %>% nrow` citations from which the data are
extracted; the 'variant' and 'study' tables are related by PMID
identifier. The 'count' table contains `r tbl(grasp2, "count") %>%
nrow` records summarizing SNPs observed in Discovery and Replication
samples from 12 distinct Populations; theh 'variant' and 'count'
tables are related by NHLBIkey.

## Indexed p-value bins

The database has been indexed on a number of fields, and
an integer rounding of $-\log_{10}$ of the quantity recorded
as the `Pvalue` of the association is available.

```{r lkp}
variant <- tbl(grasp2, "variant")
q1 = (variant %>% select(Pvalue, NegativeLog10PBin) %>%
       filter(NegativeLog10PBin > 8) %>% 
       summarize(maxp = max(Pvalue), n=n()))
q1
```

This shows that the quantities in NegativeLog10PBin are upper bounds
on the exponents of the p-values in the integer-labeled bins defined
by this quantity.

A useful utility from dplyr is the query explain method:

```{r lkex}
explain(q1)
```

## Tabulations

This query select variants with large effect from the 'variant' table,
and joins them to their published phenotypic effect in the 'study' table.

```{r lkp2}
study <- tbl(grasp2, "study")
large_effect <- 
    variant %>% select(PMID, SNPid_dbSNP134, NegativeLog10PBin) %>% 
        filter(NegativeLog10PBin > 5)
phenotype <-
    left_join(large_effect,
              study %>% select(PMID, PaperPhenotypeDescription))
phenotype
```

## Inspecting some relatively weak associations in asthma

```{r doasth}
lkaw <- semi_join(
    variant %>%
        filter(NegativeLog10PBin <= 4) %>% 
        select(PMID, chr_hg19, SNPid_dbSNP134, PolyPhen2),
    study %>% filter(PaperPhenotypeDescription == "Asthma"))
```

We materialize the filtered table into
a data.frame and check how many PolyPhen2 notations including
a substring of 'Damaging':

```{r dogre}
lkaw %>% filter(PolyPhen2 %like% "%amaging%")
```

# Quick view of the basic interfaces

## dplyr-oriented

```{r lkbasic}
grasp2
```

## RSQLite-oriented

```{r lkcon}
gcon = grasp2$con
library(RSQLite)
gcon
dbListTables(gcon)
```

Note that the package opens the SQLite data base in 'read-only' mode,
but updates are possible (e.g., directly opening a connection to the
data base without restricting access).  There may be implicit control
if the user does not have write access to the file.

# Some QC (Consistency check): Are NHGRI GWAS catalog loci included?

We have an image of the NHGRI GWAS catalog inheriting
from GRanges.
```{r getgw}
library(gwascat)
data(gwrngs19)  # hg19 addresses; NHGRI ships hg38
gwrngs19
```

We would like to verify that the majority of the variants enumerated
in the NHGRI catalog are also present in GRASP 2.0.  We supply a
function called checkAnti which obtains the anti-join between a
chromosome-specific slice of the NHGRI catalogue and the slice of
GRASP2 for the same chromosome.  We compute for chromosome 22 the
fraction of NHGRI records present in GRASP2.

```{r lkanti}
gr22 = variant %>% filter(chr_hg19 == "22")
abs22 = checkAnti("22")
1 - (abs22 %>% nrow()) / 
    (gr22 %>% count %>% collect %>% `[[`("n"))
```

The absent records can be seen to be relatively recent additions
to the NHGRI catalog.
```{r lkabs}
abs22
```