%\VignetteIndexEntry{dsQTL, data excerpt from Degner et al. 2012 Nature letter}
%\VignetteDepends{dsQTL, Biobase, SummarizedExperiment, GGBase, GGtools, rtracklayer}
%\VignetteKeywords{genetics, Expression Analysis}
%\VignettePackage{dsQTL}


%
% NOTE -- ONLY EDIT THE .Rnw FILE!!!  The .tex file is
% likely to be overwritten.
%
\documentclass[12pt]{article}

\usepackage{amsmath,pstricks}
\usepackage{auto-pst-pdf}
\usepackage[authoryear,round]{natbib}
\usepackage{hyperref}


\textwidth=6.2in
\textheight=8.5in
%\parskip=.3cm
\oddsidemargin=.1in
\evensidemargin=.1in
\headheight=-.3in

\newcommand{\scscst}{\scriptscriptstyle}
\newcommand{\scst}{\scriptstyle}


\newcommand{\Rfunction}[1]{{\texttt{#1}}}
\newcommand{\Robject}[1]{{\texttt{#1}}}
\newcommand{\Rpackage}[1]{{\textit{#1}}}
\newcommand{\Rmethod}[1]{{\texttt{#1}}}
\newcommand{\Rfunarg}[1]{{\texttt{#1}}}
\newcommand{\Rclass}[1]{{\textit{#1}}}

\textwidth=6.2in

\bibliographystyle{plainnat} 
 
\begin{document}
%\setkeys{Gin}{width=0.55\textwidth}

\title{dsQTL: exploring DNA-variants associated with DNaseI hypersensitivity}
\author{VJ Carey}
\maketitle

\section{Introduction}

Degner et al. (2012) publish information on associations
between DNA variants (SNP, SNV, and indels) and DNaseI hypersensitivity
measures acquired via DNase-Seq.

This package includes information from the Chicago group on
normalized DNase-seq data, and genotype data
from chromosome 2 only.

NOTES ADDED JANUARY 2016.  The DHStop5\_hg19 SummarizedExperiment
instance is the primary data resource of use at this date, although
most of the legacy computations described but not run throughout
the later parts of this vignette can be made to work.

With the SummarizedExperiment, analysis tools in gQTLstats can be
used to perform dsQTL identification, with genotype information
housed in tabix-indexed VCF.  This is a highly scalable approach.

Here is how we can search for dsQTL by probing into the 1000 genomes
VCF in an Amazon S3 bucket
for the YRI subject assayed for DNaseI hypersensitivity.
We use the first 5 assay results recorded for chromosome 17.
This approach assumes internet connectivity and hence is
not evaluated on the build system.
<<doone,eval=FALSE>>=
library(geuvPack)
library(dsQTL)
tf17 = gtpath(17, useS3=TRUE)
data(DHStop5_hg19)
dhs17 = DHStop5_hg19[ seqnames(DHStop5_hg19) == "chr17", ]
seqlevelsStyle(dhs17) = "NCBI"
library(gQTLstats)
c17_1 = cisAssoc(dhs17[1:5,], tf17, cisradius=5000)
c17_1
@

Remaining material is not executed.

\section{The basic data structure}



<<lkd,eval=FALSE>>=
library(dsQTL)
data(DSQ_17)
metadata(DSQ_17)
metadata(DSQ_17)[[1]]
@

We use summarized experiment structure for the assay data, but the imputed
genotype data are kept separate, in the package, in the inst/parts folder.

The data structure on chr2, which will be used to reproduce some findings,
is more mature
<<lkd2,eval=FALSE>>=
data(DSQ_2)
names(assays(DSQ_2))
assays(DSQ_2)[[1]][1:5,1:5]
rowRanges(DSQ_2)
@

To implement the GGBase protocol for on-the-fly generation of smlSet instances
from getSS queries, we have an ExpressionSet instance with specific names.
<<lkd3,eval=FALSE>>=
data(eset, package="dsQTL")
ex
@ 


The genotype data supplied by Degner et al are imputed to 1000 genomes
haplotypes, and are reals
in [0,2].   For simplicity the current image of the data uses the
rounding of the fractional genotypes x with round(x,0).

The feature data refer to the retained 100bp segments that
were summarized for DNaseI hypersensitivity and found to lie in the
uppermost 5\% of the distribution.
<<lkf,eval=FALSE>>=
library(Biobase)
fData(ex)[1:5,,drop=FALSE]
@


We can get the integrated container as
<<domo,eval=FALSE>>=
library(GGBase)
ds2 = getSS("dsQTL", "roundGT_2")
@
the name indicates that we simply rounded the imputed fractional genotypes
to nearest integer.

A very restricted search is:
<<dose,cache=TRUE,eval=FALSE>>=
# need to get rid of SNPlocs package getSNPlocs
getSNPlocs = dsQTL::getSNPlocs  # force
library(GGtools)
#library(parallel)
#options(mc.cores=12)
n1 = best.cis.eQTLs(smpack="dsQTL", radius=2000, geneannopk="dsQTL",
  snpannopk="dsQTL", chrnames="2", smchrpref="roundGT_",
  smFilter = function(x) GTFfilter(x, lower=0.05)[23810:23830,], 
#  geneApply=mclapply)
  geneApply=lapply)
<<lkpr,eval=FALSE>>=
n1
@

<<lkhi,fig=TRUE,eval=FALSE>>=
plot_EvG(probeId("dhs_2_45370802"), rsid("chr2.45370846"), getSS("dsQTL", "roundGT_2",
  wrapperEndo=function(x){annotation(x)="dsQTL"; x}))
@

\section{Provenance}

\subsection{Normed DNaseI hypersensitivity scores}
The dsQTL package data structures \verb+DSQ_2+ and \verb+DSQ_17+
are generated from GEO GSE31388, from which a collection of 70 compressed
BED format files were acquired Aug 9 2011.  These are imported
using the rtracklayer package to obtain location and score information
for all the recorded DNaseI hypersensitivity assay results.  For
example, after import for NA19257, we have

\begin{verbatim}
Browse[1]> x
[1] "NA19257"
Browse[1]> tmp
RangedData with 1465907 rows and 3 value columns across 22 spaces
           space               ranges   |        name       score   strand
        <factor>            <IRanges>   | <character>   <numeric> <factor>
1           chr1       [  402,   501]   |         NOT -0.67088720        +
2           chr1       [  502,   601]   |         NOT -1.69969288        +
3           chr1       [  602,   701]   |         NOT  0.13520754        +
...          ...                  ... ...         ...         ...      ...
1465905    chr22 [49571602, 49571701]   |         NOT  0.62742318        +
1465906    chr22 [49575102, 49575201]   |         NOT -0.09417379        +
1465907    chr22 [49581602, 49581701]   |         NOT -0.29496269        +
\end{verbatim}

The scores for locations on chromosome 2 were collected using
<<showp1,eval=FALSE>>=
proc1 = function(x) {
 library(rtracklayer)
 tmp = import(paste(x, ".qnorm.bed.gz", sep=""))
 stt = split(tmp, space(tmp))
 obn = paste(x, "_dsq_chr2", sep="")
 assign(obn, stt[["chr2"]])
 save(list=obn, file=paste(obn, ".rda", sep=""))
 NULL
}
@
The regions and scores reported are described in the GEO metadata as
\begin{quotation}
We also provide BED file format data for each individual for the top 5\% of the genome in terms of total sensitivity. This data was mapped to hg18 using a custom read-mapping algorithm which we describe in detail in the associated publication. Measures of DNase sensitivity were quantile normalized within each individual to a standard normal distribution. Each individual was corrected for GC bias and the top 4 principle (sic) components were removed from the data (See manuscript).
\end{quotation}

Score data were structured as a matrix with columns corresponding to
Yoruba HapMap subject, and rows corresponding to reported
hypersensitivity regions.  

The \texttt{RangedSummarizedExperiment} container is used to unite
range and score data in the assays component, and allied metadata
are available in metadata and colData components.

\subsection{Genotype data}

Textual representation of the allelic doses is provided at
\url{http://eqtl.uchicago.edu/dsQTL_data/GENOTYPES/}.
As of Oct 2012, these were rounded to allele counts to
allow use of snpMatrix representation for chromosome 2 genotypes; 
propagation of
dosage fractions will be undertaken in late 2012.



\section{Session information}
<<lks,eval=TRUE>>=
sessionInfo()
@


\end{document}