\documentclass[10pt]{article}

%\VignetteEngine{knitr::knitr}
%\VignetteIndexEntry{gcat Package}

\usepackage{fullpage}
\usepackage{hyperref}

\title{Genotype Conditional Association Test Vignette}
\author{Wei Hao, Minsun Song, John D. Storey}
\date{\today}

\begin{document}
\maketitle

\section{Introduction}

The \texttt{gcatest} package provides an implementation of the Genotype 
Conditional Association Test (GCAT)~\cite{song_testing_2015}.
GCAT is a test for genetic association that is powered by Logistic Factor 
Analysis (LFA)~\cite{hao_probabilistic_2016}. 
LFA is a method of modeling population structure in a genome wide association 
study. GCAT performs a test for association between each SNP and a trait (either
quantitative or binary).
We have shown that GCAT is robust to confounding from population structure.

\section{Sample usage}

We include a sample dataset with the package. \texttt{sim\_geno} is a simulated
genotype matrix. \texttt{sim\_trait} is a simulated trait. 
There are 10,000 SNPs and 1,000 individuals. The first five SNPs are associated
with the trait. 
This simulations were done under the Pritchard-Stephens-Donnelly model with
$K=3$, with Dirichlet parameter $\alpha=0.1$ and variance allotment in the trait
corresponding to $5\%$ genetic, $5\%$ environmental, and $90\%$ noise. 
This dataset is simulated under identical parameters as the PSD simulation in
Figure 2 of the paper~\cite{song_testing_2015}, except that we have adjusted the
size of the simulation to be appropriate for a small demo.

<<preamble,warning=F>>=
library(lfa)
library(gcatest)
dim(sim_geno)
length(sim_trait)
@

\subsection{\texttt{gcat}}

The first step of \texttt{gcat} is to estimate the logistic factors:

<<lfa>>=
LF <- lfa(sim_geno, 3)
dim(LF)
@

Then, we call the \texttt{gcat} function, which returns a vector of p-values:

<<gcat>>=
gcat_p <- gcat(sim_geno, LF, sim_trait)
@

We can look at the p-values for the associated SNPs:

<<gcat2>>=
gcat_p[1:5]
@

And also plot the histogram of the unassociated SNPs:

<<fig1, fig.height=3>>=
library(ggplot2)
dat <- data.frame(p = gcat_p[6:10000])
ggplot(dat, aes(p, after_stat(density))) + geom_histogram(binwidth=1/20) + theme_bw()
@

\section{Data Input}

The \texttt{genio} package provides the function \texttt{read\_plink} for 
parsing PLINK binary genotypes (extension: \texttt{.bed}) into an R object of
the format needed for the \texttt{gcat} function.
A \texttt{BEDMatrix} object (from the eponymous function and package) is also 
supported, and can result in reduced memory usage (at a small runtime penalty).

\bibliographystyle{plain}
\bibliography{gcatest}

\end{document}