---
title: "ASAFE (Ancestry Specific Allele Frequency Estimation)"
author: "Qian Zhang"
date: "`r Sys.Date()`"
output: 
  pdf_document:
    toc: true
    number_sections: true
vignette: >
  %\VignetteIndexEntry{ASAFE (Ancestry Specific Allele Frequency Estimation)}
  %\VignetteEngine{knitr::rmarkdown}
  \usepackage[utf8]{inputenc}
---

# Introduction: What ASAFE does

The ASAFE (Ancestry Specific Allele Frequency Estimation) package contains a collection of functions that can be used to carry out an EM algorithm to estimate ancestry-specific allele frequencies for a bi-allelic genetic marker (e.g. a SNP) from genotypes and ancestry pairs, when each diploid individual's genotype phase relative to ancestry pair is unknown. If there are three ancestries, a = 0, 1, or 2 (e.g. African, European, or Native American), ASAFE functions can be used to estimate three probabilities, ${\textrm{P(Allele 1} | \textrm{Ancestry a), a} \in\{0,1,2\}}$, at each marker. ASAFE function algorithm_1snp_wrapper() can be applied to a matrix of ancestries and a matrix of genotypes for 3-way admixed diploid individuals at bi-allelic markers. Ancestries at different markers need not be phased with respect to each other, and genotypes at different markers need not be phased with respect to each other. Denoting each marker's alleles 0 and 1, algorithm_1snp_wrapper() outputs estimates of three ancestry-specific allele 1 frequencies for each marker.

## ASAFE in the Context of a Larger Genetic Analysis Workflow

The following file in the ASAFE R package gives a diagram illustrating a genetic analysis workflow involving ASAFE: inst/ASAFE_Visual.pdf. An example script that performs two steps in the workflow, involving phasing of admixed genotypes with BEAGLE and then obtainment of local ancestry estimates and re-phased genotypes via RFMIX, is here: inst/scripts/bgl_then_rfmix.sh.

# Input Files

## Ancestry File

Your ancestry file should give admixed individuals' phased ancestries
as a rectangular matrix with the following rows, columns, and entries:

### Rows

* 1st row: Header line with column names 
* Subsequent rows: 1 row corresponds to 1 marker

### Columns

* 1st column: Marker ID
* 1 column per chromosome, with two consecutive columns per individual, corresponding to the individual's pair of homologous chromosomes. 
For example, if there are 3 admixed individuals, 
then the first row that gives column names might be

Marker  ADM1    ADM1    ADM2    ADM2    ADM3    ADM3	

* Columns should be separated by whitespace (i.e. spaces or tabs)

### Entries

* For an entry that is not in the Marker ID column,
an entry can take value 0, 1, or 2, 
which are arbitrary labels for three ancestries

To read your file into R, use a command like this: 

```R

ancestries <- read.table(file = "your_ancestry_file.txt", header = TRUE)

```

We have provided a subset of the full data set of simulated ancestries that was used in the ASAFE paper. [1] 
This data set is stored as a matrix adm_ancestries_test. 

## Genotype File

Your genotype file should give unphased admixed individuals' genotypes
as a rectangular matrix with the following rows, columns, and entries:

### Rows

* 1st row: Header line with column names 
* Subsequent rows: 1 row corresponds to 1 marker

### Columns 

* 1st column: Marker ID
* 1 column per person. For example, if there are 3 admixed individuals, then the first row that gives column names might be

ID ADM1 ADM2 ADM3
 
* Columns should be separated by whitespace (i.e. spaces or tabs)
* Individuals must be listed in the same order in the genotype file as in the ancestry file

### Entries

* For an entry that is not in the Marker ID column,
an entry can take value 0/0, 0/1, 1/0, or 1/1, where 0 and 1 are arbitrary labels for a bi-allelic SNP's two alleles
* A slash "/" indicates an unphased genotype, so 0/1 and 1/0 are the same unphased genotype

To read your file into R, use a command like this: 

```R

genotypes <- read.table(file = "your_genotypes_file.txt", header = TRUE)

```

We have provided a subset of the full data set of simulated genotypes that was used in the ASAFE paper. [1] This data set is stored as a matrix adm_genotypes_test. 

For details about data matrices that come with the ASAFE package, load the ASAFE package into R with the command "library(ASAFE)" and type "?" followed immediately by the name of the matrix, for example "?adm_ancestries_test".

# Functions

These are the functions you might want to use:

* algorithm_1snp()
* algorithm_1snp_wrapper() 
    
For information about these functions
(e.g. their inputs, outputs, and examples for usage),
see the function's man page by doing the following:
Load the ASAFE package into R with the command "library(ASAFE)" and type "?" followed immediately by the name of the function, for example "?algorithm_1snp".

If you are interested in other functions that are not listed above,
for instance functions that the above functions call,
see the .R file that implements the function you're interested in 
for comments describing the function.

# Reproducibility

Because Bioconductor requires that an R package pass "R CMD check" and "R CMD build" in less than 5 minutes each, and use at most 2 Gb of RAM to run code in this vignette, this Reproducibility section has been cut. Please see the version of the R package on GitHub (http://biostatqian.github.io/ASAFE/) for a vignette with a complete Reproducibility section.

# Try ASAFE Out on a Small Data Set

We demonstrate how ASAFE package functions can be used in an analysis, by generating ancestry-specific allele 1 frequency estimates for a small data set. Ancestry and genotype data for the SNPs are respectively contained in matrices adm_ancestries_test and adm_genotypes_test. 

```{r}
# Clear workspace and load ASAFE
rm(list=ls())
library(ASAFE)

# adm_ancestries_test is a matrix with
# Rows: Markers
# Columns: Marker ID, individuals' chromosomes' ancestries
# (e.g. ADM1, ADM1, ADM2, ADM2, and etc.)

# adm_genotypes_test is a matrix with
# Rows: Markers
# Columns: Marker ID, individuals' genotypes (a1/a2)
# (e.g. ADM1, ADM2, ADM3, and etc.)

# Making the rsID column row names

row.names(adm_ancestries_test) <- adm_ancestries_test[,1]
row.names(adm_genotypes_test) <- adm_genotypes_test[,1]

adm_ancestries_test <- adm_ancestries_test[,-1]
adm_genotypes_test <- adm_genotypes_test[,-1]

# alleles_list is a list of lists.
# Outer list elements correspond to SNPs.
# Inner list elements correspond to 250 people's alleles 
# with no delimiter separating alleles.
alleles_list <- apply(X = adm_genotypes_test, MARGIN = 1, 
                      FUN = strsplit, split = "/")

# Creates a matrix: 
# Alleles for chromosomes (ADM1, ADM1, ..., ADM250, ADM250) x (SNPs)
alleles_unlisted <- sapply(alleles_list, unlist)

# Change elements of the matrix to numeric
alleles <- apply(X = alleles_unlisted, MARGIN = 2, as.numeric)

# Apply the EM algorithm to each SNP to obtain
# ancestry-specific allele frequency estimates for all SNPs in
# matrices alleles and adm_ancestries_test.
#
# Columns correspond to markers. 
# Rows correspond to ancestries 0, 1, and then 2.
# Entries in rows 2 through 4
# give P(Allele 1 | Ancestry a), a = 0, 1, or 2 for a marker.

adm_estimates_test <- sapply(X = 1:ncol(alleles), FUN = algorithm_1snp_wrapper,
                        alleles = alleles, ancestries = adm_ancestries_test)

adm_estimates_test
```

# Citation

ASAFE: Ancestry-Specific Allele Frequency Estimation
Qian S. Zhang; Brian L. Browning; Sharon R. Browning
Bioinformatics 2016;
doi: 10.1093/bioinformatics/btw220