---
title: "single-sperm-co-analysis"
output: BiocStyle::html_document
bibliography: ref.bib
vignette: >
  %\VignetteIndexEntry{single-sperm-co-analysis}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

```{r setup}
suppressPackageStartupMessages({
  library(comapr)
  library(SummarizedExperiment)
  })
```

# Introduction

In the previous vignette, we demonstrated how to calculate genetic distances
using genotyped markers from a group of samples.

[Genetic distance calculate from genotype shifts of markers](getStarted.html)

In this document, we will focus on building individualized genetic maps from
output files of `sscocaller` which is available at 
[here](https://gitlab.svi.edu.au/biocellgen-public/sscocaller).

# Locate file path

The `comapr` package includes a list of toy example output files from the `sscocaller`
command line tool. The follow code will get the file path that we will use later.

```{r}
demo_path <-paste0(system.file("extdata",package = "comapr"),"/")
```

We can see that we have two samples (individual donors) and each of them has 
haplotype states inferred for chr1 to chr5.

```{r}
list.files(demo_path)
```

# File information

- *.mtx
  - sparse matrix with columns corresponding to the list of sperm cell barcodes 
  and rows corresponding to the list of SNP positions in VCF file
  - {sample}_chr1_altCount.mtx, a sparse mtx with entries representing alternative allele counts
  - {sample}_chr1_totalCount.mtx, a sparse mtx with entries representing total allele counts
  - {sample}_chr1_vi.mtx, a sparse mtx with entries representing inferred viterbi state (haplotype state)


- {sample}_chr1_snpAnnot.txt, the SNP positions and allele
- {sample}_chr1_SegInfo.txt, statistics of viterbi state segments in text file format.
   It contains consecutive viterbi states for each chromosome with statistics including, 
   starting SNP position, ending SNP position, the number of SNPs supporting the segment, 
   the log likelihood ratio of the viterbi segment and the inferred hidden state.

# Diagnostic functions


`comapr` provides quality-check functions for examining SNP coverage per chr and 
per cell, chromosome segregation pattern checks, and summary statics for 
filtering low confidence crossovers.

## perCellChrQC

```{r}
pcQC <- perCellChrQC(sampleName="s1",chroms=c("chr1"),
                     path=paste0(demo_path,"/"),
                     barcodeFile=NULL)
```

# Input parsing

Input-parsing functions are implemented in `comapr` to construct 
`RangedSummarizedExpriment` object that parses files generated 
from `sscocaller` and filter out low-confidence COs at the same time. For the
demo dataset, these filters do not have any effects:

```
minSNP = 0, 
minlogllRatio = 50,
bpDist = 100,
maxRawCO=10,
minCellSNP = 1
```

# Construct `RangedSummarizedExpriment` object 

We first construct `RangedSummarizedExpriment` object from parsing output files
from `sscocaller` and filter out low-confidence crossovers.

```{r}
s1_rse_state <- readHapState("s1",chroms=c("chr1","chr2"),
                             path=demo_path,barcodeFile=NULL,
                             minSNP = 0,
                             minlogllRatio = 50,
                             bpDist = 100,
                             maxRawCO=10,
                             minCellSNP = 1)

s2_rse_state <- readHapState("s2",chroms=c("chr1","chr2"),
                             path=demo_path,
                             barcodeFile=NULL,
                             minSNP = 0,
                             minlogllRatio = 50,
                             bpDist = 100,
                             maxRawCO=10,
                             minCellSNP = 1)
```


```{r}
s1_rse_state
```

The Viterbi states for SNP markers are stored in the "assay" slot:

```{r}
assay(s1_rse_state)
```

The `rowRanges` contains the SNP's positions:

```{r}
rowRanges(s1_rse_state)
```

## Formate sample group factor

We have read in the Viterbi states for cells from two samples: s1 and s2. We now
combine them into one data object for downstream analysis.

## Add sample group factor

```{r}
colData(s1_rse_state)

colData(s1_rse_state)$sampleGroup <- "s1"

colData(s2_rse_state)$sampleGroup <- "s2"

```

## Combine two groups

We now call `combineHapState` function to combine sample s1 and sample s2
into one `RangedSummarizedExperiment` object.


```{r}
twoSample <- combineHapState(s1_rse_state,
                             s2_rse_state)
```

Now the `assay` data slot contains the Viterbi states across SNP positions for
the combined samples.

```{r}
twoSample <- combineHapState(s1_rse_state,s2_rse_state)
```


Now the `twoSample` object contains the cells from both samples with `assay` slot
contains the Viterbi states and `rowRanges` contains position annotaitons for
the list SNP markers.

```{r}

assay(twoSample)
```

# Count crossovers 

The `countCOs` function can then find out the crossover intervals according to 
the Viterbi states for each cell and the number of crossovers per cell is then
calculated.

# Count crossovers for SNP intervals

`countCOs` function will find the crossover intervals for each samples and 
attribute the observed crossovers from each sample to the corresponding intervals.

```{r}
cocounts <- countCOs(twoSample)
```

The `rowRanges` from the resulting data object now denotes the crossover interval
and the `assay` slot now contains the number of crossovers in each cell.


Now `rowRanges` contains the intervals

```{r}
rowRanges(cocounts)
```

The colData slot still contains the annotations of each cell.

```{r}
colData(cocounts)
```

# Calculate genetic distances

The genetic distances can be calculated by using mapping functions such as the
Kosambi or Haldane \cite{} and `assay` slot contains the number of crossovers 
found in each sample across these intervals.

```{r}
assay(cocounts)
```


# Calculate genetic distances

`calGeneticDist` function will convert the raw crossover frequencies to genetic
distances via selected mapping function (ie. kosambi or haldane).

```{r}
dist_gr <- calGeneticDist(co_count = cocounts, mapping_fun = "k")
dist_gr
```

The genetic distances for each interval are stored in `rowData`.

```{r}
rowData(dist_gr)
```

The above genetic distances have been calculated using all samples. We can also
specify the group factor so that we can calculate genetic distances for different
sample groups:

```{r}
## sampleGroup is a column in the colData slot
dist_gr <- calGeneticDist(co_count = cocounts,
                          group_by  = "sampleGroup",
                          mapping_fun = "k")

```

Now the group/Sample specific distances are stored in `rowData`

```{r}
rowData(dist_gr)$kosambi
```

# Plot whole genome genetic distances

We construct a `GRange` object from the `dist_gr` first.

```{r}
p_gr <- granges(dist_gr)
mcols(p_gr) <- rowData(dist_gr)$kosambi


```

We can plot whole-genome genetic distances

```{r}
plotWholeGenome(p_gr)

```

We can also do per chromosome plot

```{r}
plotGeneticDist(p_gr,chr = "chr1",cumulative = TRUE)

```




# Group differences

`bootstrapDist` function generates the sampling distribution of the difference
in total genetic distances for the two groups under comparisons.

```{r}
set.seed(100)
bootsDiff <- bootstrapDist(co_gr = cocounts,B=10,
                           group_by = "sampleGroup")
```

```{r}
hist(bootsDiff)
```

From the `bootsDiff` data object, we can find a 95% confidence interval to test
whether the difference is statistically significant. 

```{r}
quantile(bootsDiff,c(0.025,0.975),na.rm =TRUE)
```


An alternative re-sampling method, `permuteDist`, can be applied to generate a
null distribution for the group difference by reshuffling the group labels across
the samples.

```{r}
set.seed(100)
perms <- permuteDist(cocounts,B=1000,group_by = "sampleGroup")
```

 A p-value can be calculated using the `statmod::permp` function [@Phipson2010-xi].

```{r}
statmod::permp(x = sum(perms$permutes>=perms$observed_diff,
                       na.rm = TRUE),
               nperm = sum(!is.na(perms$permutes)),
               n1 = perms$nSample[1],
               n2 = perms$nSample[2])

```

# Session info

```{r}
sessionInfo()
```