\documentclass[10pt]{article}

%\VignetteEngine{knitr::knitr}
%\VignetteIndexEntry{lfa Package}

\usepackage{fullpage}
\usepackage{hyperref}

\title{Logistic Factor Analysis Vignette}
\author{Wei Hao, Minsun Song, John D. Storey}
\date{\today}

\begin{document}
\maketitle

\section{Introduction}

Logistic Factor Analysis (LFA)~\cite{hao_probabilistic_2016}. Briefly, LFA fits
a latent variable model on categorical (i.e. SNP genotypes coded as 0, 1, and 2)
data by modeling the logit transformed binomial parameters in terms of latent
variables. The resulting ``logistic factors'' are analagous to principal
components, but fit into a convenient likelihood based model. As a result, the
logistic factors can power a number of other analyses.

\section{Sample usage}

We include a sample real dataset with the package as the variable 
\texttt{hgdp\_subset}---a small subset of the HGDP genotypes. The row 
names are the rsids for the SNPs and the column names are coarse 
geographical labels for the individuals.

<<preamble>>=
library(lfa)
dim(hgdp_subset)
@

\subsection{\texttt{lfa}}

The \texttt{lfa} function has two required arguments. The first is the 
genotype matrix, and the second is the number of logistic factors 
including the intercept.

<<lfa>>=
LF <- lfa(hgdp_subset, 4)
dim(LF)
head(LF)
@

We can plot the first two logistic factors and color by geographical 
information:

<<lfa_plot, fig.width=5, fig.height=4.5>>=
dat <- data.frame(LF[,1], LF[,2], colnames(hgdp_subset))
colnames(dat) = c("LF1", "LF2", "geo")
library(ggplot2)
ggplot(dat, aes(LF1, LF2, color=geo)) + geom_point() + theme_bw() +
    coord_fixed(ratio=(max(dat[,1])-min(dat[,1]))/(max(dat[,2])-min(dat[,2])))
@

One aspect of \texttt{lfa} is that the return value is a matrix of 
logistic factors, thus, an important part of subsequent analysis is to 
keep your matrix of logistic factors to pass as an argument.

\subsection{\texttt{af}}

Given a genotype matrix and logistic factors, the \texttt{af} function 
computes the individual-specific allele frequencies

<<af1>>=
allele_freqs <- af(hgdp_subset, LF)
allele_freqs[1:5, 1:5]
@

Since the calculation is independent at each locus, you can pass a 
subset of the genotype matrix as an argument if you aren't interested 
in all the SNPs. 

<<af2>>=
subset <- af(hgdp_subset[15:25,], LF)
subset[1:5,1:5]
@

Given the allele frequencies, you can do some other interesting 
calculations---for example, compute the log-likelihood for each SNP.

<<af3>>=
ll <- function(snp, af){
    -sum(snp*log(af) + (2-snp)*log(1-af))
}
log_lik <- sapply(1:nrow(hgdp_subset), function(i) {ll(hgdp_subset[i,], 
    allele_freqs[i,])})
which(max(log_lik) == log_lik)
@

\section{Data Input}

The best way to load genotypes is by using the function \texttt{read\_plink}, 
from the \texttt{genio} package, which assumes that you have binary PLINK 
formatted genotypes. The binary PLINK format uses files: a \texttt{.bed} for 
the genotypes, a \texttt{.bim} for the genotype information, and a 
\texttt{.fam} for the individuals information.
\texttt{read\_plink} takes as an argument the prefix for your three files.
A \texttt{BEDMatrix} object (from the eponymous function and package) is also 
supported, and can result in reduced memory usage (at a small runtime penalty).

\bibliographystyle{plain}
\bibliography{lfa}

\end{document}