%\VignetteIndexEntry{Notes for eSet developers}
%\VignetteDepends{chopsticks}
%\VignetteKeywords{Expression Analysis}
%\VignettePackage{Biobase}

%
% NOTE -- ONLY EDIT THE .Rnw FILE!!!  The .tex file is
% likely to be overwritten.
%
\documentclass[12pt]{article}

\usepackage{amsmath,pstricks}
% With MikTeX 2.9, using pstricks also requires using auto-pst-pdf or running
% pdflatex will fail. Note that using auto-pst-pdf requires to set environment
% variable MIKTEX_ENABLEWRITE18=t on Windows, and to set shell_escape = t in
% texmf.cnf for Tex Live on Unix/Linux/Mac.
\usepackage{auto-pst-pdf}
\usepackage[authoryear,round]{natbib}
\usepackage{hyperref}


\textwidth=6.2in
\textheight=8.5in
%\parskip=.3cm
\oddsidemargin=.1in
\evensidemargin=.1in
\headheight=-.3in

\newcommand{\scscst}{\scriptscriptstyle}
\newcommand{\scst}{\scriptstyle}


\newcommand{\Rfunction}[1]{{\texttt{#1}}}
\newcommand{\Robject}[1]{{\texttt{#1}}}
\newcommand{\Rpackage}[1]{{\textit{#1}}}
\newcommand{\Rmethod}[1]{{\texttt{#1}}}
\newcommand{\Rfunarg}[1]{{\texttt{#1}}}
\newcommand{\Rclass}[1]{{\textit{#1}}}

\textwidth=6.2in

\bibliographystyle{plainnat} 
 
\begin{document}
%\setkeys{Gin}{width=0.55\textwidth}

\title{Lab exercises.  Rare variant concepts and tools with Bioconductor.}
\author{VJ Carey}
\maketitle

\section{Acknowledgment}

The material covered in this lab depends heavily on software designs and
creations of Martin Morgan and Sean Davis; the \Rpackage{IRanges},
\Rpackage{GenomicRanges}, \Rpackage{BSgenome} and \Rpackage{AnnotationDbi}
infrastructures are also critical for allowing these problems to be stated
and solved concisely, so great thanks are due to Patrick Aboyoun,
Marc Carlson, Mike Lawrence and Herv\'e Pages.

\section{Resources}

Using the 0.1.6 version of samtools, we created a pileup of 1000 genomes
reads from NA19240's solexa image:
\begin{verbatim}
samtools pileup -cvf \ 
  human_b36_female.fa NA19240.chrom17.SLX.maq.SRP000032.2009_07.bam > \
  n240_17.pup
\end{verbatim}

This text file is available to you for parsing as follows:
<<lkd1-1>>=
library(ind1KG)
library(Rsamtools)
@
<<workaround,echo=FALSE>>=
readPileup <- selectMethod("readPileup", "connection")
@
<<lkd1-2>>=
pup17 <- gzfile(system.file("pileups/n240_17.pup.gz", package="ind1KG"))
c17p.i <- readPileup(pup17, variant="indel")
levels(seqnames(c17p.i))
seqlevels(c17p.i) = gsub("17", "chr17", seqlevels(c17p.i))
c17p.i
@

Information on dbSNP SNP locations is available in 
<<li>>=
library(SNPlocs.Hsapiens.dbSNP.20090506)
c6 <- getSNPlocs("chr6")
head(c6, 5)
@
Resource-oriented exercise: After reviewing the `Setup' material in the next
section, estimate the frequencies of dbSNP-catalogued SNP per base pair in
intronic vs. exonic DNA on chromosome 17. Estimate frequencies stratified by
GC content (i.e., tabulate by 0, 1, 2 bases G or C in SNP).

Special design exercise (attempt only after all other exercises below have
been solved correctly): Consider alternative representations of the SNP
location/value data. The allele data could be represented as a single
\Rclass{DNAString}, and the location information as an \Rclass{IRanges}
instance. Assess the resource consumption and query resolution performance
of these representations in comparison to the existing data.frame. Consider
also a representation rooted in an RDBMS such as SQLite.

\section{Exercises}

\subsection{Check for coding indels and SNP}

\begin{itemize}
\item Setup:  According to data distributed by Shendure, NA19240 has two
copies of a triple insert in the coding region of CDRT4. Verify.
<<getg>>=
library(org.Hs.eg.db)
egid <- get("CDRT4", revmap(org.Hs.egSYMBOL))
kgid <- get(egid, org.Hs.egUCSCKG)
library(GenomicFeatures)
library(TxDb.Hsapiens.UCSC.hg18.knownGene)
txdb = TxDb.Hsapiens.UCSC.hg18.knownGene
txloc <- transcripts(txdb)
cdrt4txloc <- txloc[elementMetadata(txloc)$tx_name %in% kgid]
subsetByOverlaps(c17p.i, cdrt4txloc)
@
On the basis of current annotation, none of these indels are in exons:
<<lkm>>=
cdrt4txid <- as.character(elementMetadata(cdrt4txloc)$tx_id)
cdrt4exloc <- exonsBy(txdb)[cdrt4txid]
subsetByOverlaps(c17p.i, cdrt4exloc)
@

\item Exercise: Write a function with parameters identifying a
\Rclass{GRanges} instance generated from a pileup, a gene symbol, a
variant type, and a specification of feature scope, that reports on the
variants present in the gene. Discuss infelicities of data structure in
the code segment above that should be ameliorated to simplify solution
of this exercise.
\item Exercise: Whether or not you solve the previous exercise,
characterize the variants in gene MYH3 for NA19240 in some concise way.
It is advisable to focus on SNP; show that there are coding SNP present
for this individual that are not identified in dbSNP.
\item Exercise: Introduce and justify a mechanism for filtering variant
reporting using quality information.

\end{itemize}

\subsection{Compare Solexa calls with Sanger sequencing}

\begin{itemize}
\item Setup: The 4 million phase II HapMap genotype calls for NA19240 are
available to you in package hmyriB36. A selection confined to chromosome 6
is available in the \Rpackage{ind1KG} package.
<<lkd>>=
library(ind1KG)
library(chopsticks)
data(yri240_6)
yri240_6$hm
head(yri240_6$supp, 10)
@
\item Exercise: Assess how many of the MAQ-based SNP calls using the
chromosome 6 pileup data are found at dbSNP locations. Is the distribution
of quality scores for variants identified at  dbSNP locations similar to
that of putatively de novo variants?
\end{itemize}

\subsection{\textit{de novo} SNPs in probes: effects on expression microarrays}

Exercise: Acquire the probe sequences for the Illumina Human v1 expression
array, perhaps by inverting the nuids found in the lumiHumanIDMapping
metadata package.
<<lklu>>=
library(lumiHumanIDMapping)
con <- lumiHumanIDMapping_dbconn()
dbListTables(con)
dbGetQuery(con, "select * from HumanWG6_V1 limit 5")
@
Determine the genomic positions of all probes interrogating genes on
chromosome 17 using \Rpackage{Biostrings} \Rfunction{matchPDict} against
the consensus genomic sequence for chromosome 17. Find all probes (on chr17)
corresponding to sequence for which NA19240 is found by MAQ to harbor a
variant (use the pileup noted previously). We will call these probes
``associated with sequence variants''. Compute expression Z-scores for
expression levels obtained for NA19240 using mean and standard deviation
based on log expression for the 89 individuals in hmyriB36 excluding NA19240.
Can the distribution of expression Z-scores for probes associated with
sequence variants be distinguished from the distribution of expression
Z-scores for probes not associated with sequence variants.

Extra credit extension: Some probes define sequence associated with splice
junctions. These 50mers will not align to consensus genomic sequence, but
will align once introns are removed. Can you identify probes associated with
splice junctions that are also associated with sequence variants? Does the
expression Z-score for splice-junction-associated probes differ in
distribution from the general distribution of expression Z-scores?

Additional exercises. Retrieve the SOLiD or 454-based short read archives
for NA19240 and check the consistency of conclusions obtained in prior
Solexa-based exercises with results based on these platforms.

\section{Session information}
<<ll>>=
sessionInfo()
@

\end{document}