---
title: "`FeatSeekR` user guide"
author: 
-   name: Tuemay Capraz
    affiliation: European Molecular Biology Laboratory, Heidelberg
    email: tuemay.capraz@embl.de
package: FeatSeekR
date: "`r Sys.Date()`"
output:  
    BiocStyle::html_document:
        toc_float: true
vignette: >
    %\VignetteIndexEntry{`FeatSeekR` user guide}
    %\VignetteEngine{knitr::rmarkdown}
    %\VignettePackage{FeatSeekR-vignette}
    %\VignetteEncoding{UTF-8}
---


```{r setup, message=FALSE}
library(FeatSeekR)
library(DmelSGI)
library(pheatmap)
library(SummarizedExperiment)
```


# Introduction

A fundamental step in many analyses of high-dimensional data is dimension 
reduction. Feature selection is one approach to dimension reduction whose 
strengths include interpretability, conceptual simplicity, transferability 
and modularity.
Here, we introduce the `FeatSeekR` algorithm, which selects features based on 
the consistency of their signal across replicates and their non-redundancy.
It takes a 2 dimensional array (features x samples) of replicated measurements
and returns a `r Biocpkg("SummarizedExperiment")` object storing the selected 
features ranked by reproducibility.

# Installation

```{r, eval=FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("FeatSeekR")
```

# Feature selection on simulated data

Here we simulate a data set with features generated by orthogonal latent 
factors. Features derived from the same latent factor are highly redundant and 
form distinct clusters. The function \code{simData} simulates 10 redundant 
features per latent factor. Replicates are generated by adding independent 
Gaussian noise.

```{r simulate data}
set.seed(111)
# simulate data with 500 conditions, 3 replicates and 5 latent factors 
conditions <- 500
latent_factors <- 5
replicates <- 3

# simData generates 10 features per latent_factor, so choosing latent_factors=5
# will generate 50 features.
# we simulate samples from 500 independent conditions per replicate. setting 
# conditions=500 and replicates=3 will generate 1500 samples, leading to 
# final data dimensions of 50 features x 1500 samples
sim <- simData(conditions=conditions, n_latent_factors=latent_factors,
                replicates=replicates)

# show that simulated data dimensions are indeed 50 x 1500 
dim(assay(sim, "data"))

# calculate the feature correlation for first replicate
data <- t(assay(sim, "data"))
cor <- cor(data, use = "pairwise.complete.obs")

# plot a heatmap of the features and color features according to their 
# generating latent factors
anno <- data.frame(Latent_factor = as.factor(rep(1:5, each=10)))
rownames(anno) <- dimnames(sim)[[1]]
colors        <- c("red", "blue", "darkorange", "darkgreen", "black")
names(colors) <- c("1", "2", "3", "4", "5")
anno_colors <- list(Latent_factor = colors)
range <- max(abs(cor))
pheatmap(cor, treeheight_row = 0 , treeheight_col = 0, 
        show_rownames = FALSE, show_colnames = FALSE,
        breaks = seq(-range, range, length.out = 100), cellwidth = 6, 
        cellheight = 6, annotation_col = anno, annotation_colors = anno_colors, 
        fontsize = 8)
```
We first plot the correlation matrix of the data to visualize feature 
redundancy. As intended by the simulation, the features derived from the 
same latent factor cluster together. This suggests that the true dimension is
indeed lower than the number of features.
We now run `FeatSeekR` to rank the features based on their uniqueness and
reproducibility.

```{r plot top 5}
# select the top 5 features
res <- FeatSeek(sim, max_features=5)

# plot a heatmap of the top 5 selected features 
plotSelectedFeatures(res)
```
We again visualize the selected features by plotting their correlation matrix.
As expected, the top 5 selected features are each from a different latent 
factor and low correlated. This suggests that we were able to obtain a 
compressed version of the data, while keeping most of the contained 
information.

# Selecting image features from the `DmelSGI` package

Here we use `FeatSeekR` to rapidly identify unique features with reproducible 
signal between measurements in an image dataset from the `r Biocpkg("DmelSGI")` 
package. The authors of `r Biocpkg("DmelSGI")` performed combinatorial gene 
knock-outs using siRNA, 
followed by imaging of the cells. The resulting images were segmented and 
features were extracted using the `r Biocpkg("EBImage")` package. Here, 
conditions refer to different gene knock-outs, features to the extracted image 
features and replicates to repeated measurements of the individual conditions.


```{r, load_data}
# load data from DmelSGI package
data("subSampleForStabilitySelection", package="DmelSGI")
data <- subSampleForStabilitySelection$D
# dimensions are conditions, features, replicates
data <- aperm(data, c(1, 3, 2))
# set feature names
dimnames(data)[[2]] <- subSampleForStabilitySelection$phenotype
# bind samples and create condition factor
conds <- rep(seq_len(dim(data)[1]), 2)
data<- rbind(data[, , 1], data[, , 2])
# show final data dimensions
dim(data)
```

The input data has 3000 samples, 162 features and 2
replicates. Again, we plot the correlation matrix of the data to explore the 
structure of the features.

```{r plot data}
# calculate correlation matrix of the first 50 features of one of the replicates
cor_mat <- cor(data[, 1:50, drop=FALSE])

# plot correlation matrix, omitting featurenames
pheatmap(cor_mat, show_rownames=FALSE, show_colnames=FALSE,
    treeheight_row=0, treeheight_col=0)
```

Analogous to the idealized simulated example, the extracted features formed 
groups of high correlation within and lower correlation between. This supports 
the idea that the effective dimension of the data matrix is substantially lower 
than the number of features and that feature selection is a plausible approach 
to these data. We apply `FeatSeek` to identify unique features with high 
replicate consistency.

```{r select_features}
# run FeatSeekR and rank up to 20 features based on their replicate 
# reproducibility and uniqueness
max_features <- 30
res <- FeatSeek(t(data), 
        conditions=conds, 
        max_features=max_features,
        verbose=TRUE)
```

In order determine the ideal number of selected features we can have a look at 
the fraction of explained variance per additionally selected feature.

```{r inspect_selection}
# plotVarianceExplained plots the fraction of explained variance per 
# additionally selected feature, ranked by FeatSeek.
plotVarianceExplained(res)
```

The increase in explained variance seems to flatten out at around 70%.
We therefore select the number of features, that explain at least 70% of the 
total variance and plot their correlation matrix.

```{r plot selection}
# get number of features which explain at least 70% of the total variance
n_feat <- min(which(rowData(res)$explained_variance > 0.7))

# plot the top n_feat features based on the ranking by FeatSeek
plotSelectedFeatures(res, n_features=n_feat)
```

The low correlation between the top selected features confirm their low 
redundancy. Using `FeatSeekR` we were able to reduce the dimension of the data 
to 17 features, while still being able to explain 70% of the variance of the 
original data.


# Session Info

```{r}
sessionInfo()
```