---
title: "Introduction"
author: "Dany Mukesha"
date: "`r Sys.Date()`"
abstract:
    Genetic algorithms (GAs) are optimization techniques inspired by 
    the process of natural selection and genetics. They operate by evolving 
    a population of candidate solutions over successive generations, 
    with each individual representing a potential solution to the optimization
    problem at hand. Through the application of genetic operators 
    such as selection, crossover, and mutation, genetic algorithms 
    iteratively improve the population, eventually converging towards 
    optimal or near-optimal solutions.
    
    In the field of genomics, where data sets are often large, complex, 
    and high-dimensional, genetic algorithms offer a good approach 
    for addressing optimization challenges such as feature selection, 
    parameter tuning, and model optimization. By harnessing the power 
    of evolutionary principles, genetic algorithms can effectively explore 
    the solution space, identify informative features, and optimize 
    model parameters, leading to improved accuracy and interpretability 
    in genomic data analysis.
    
    The BioGA package extends the capabilities of genetic algorithms 
    to the realm of genomic data analysis, providing a suite of functions 
    optimized for handling high throughput genomic data. Implemented in C++ 
    for enhanced performance, BioGA offers efficient algorithms for tasks 
    such as feature selection, classification, clustering, and more. 
    By integrating seamlessly with the Bioconductor ecosystem, 
    BioGA empowers researchers and analysts to leverage the power 
    of genetic algorithms within their genomics workflows, facilitating 
    the discovery of biological insights from large-scale genomic data sets.
output: 
    BiocStyle::html_document
vignette: >
  %\VignetteIndexEntry{Introduction}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>"
)
```

<br>

# Getting Started

## Installation

To install this package, start R (version "4.4") and enter:

```{r installation , eval=FALSE}
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("BioGA")
```

You can also install the package directly from GitHub 
using the `devtools` package:

```{r installation_from_github , eval=FALSE}
devtools::install_github("danymukesha/BioGA")
```

In this vignette, we illustrate the usage of BioGA for genetic algorithm 
optimization in the context of high throughput genomic data analysis. 
We showcase its interoperability with Bioconductor classes, demonstrating
how genetic algorithm optimization can be seamlessly integrated into 
existing genomics pipelines for improved analysis and interpretation.

The BioGA package provides a  set of functions for genetic algorithm 
optimization tailored for analyzing high throughput genomic data. 
This vignette demonstrates the usage of BioGA in the context of selecting
the best combination of genes for predicting a certain trait, 
such as disease susceptibility.

## Overview

Genomic data refers to the genetic information stored in an organism's DNA. 
It includes the sequence of nucleotides (adenine, thymine, cytosine, 
and guanine) that make up the DNA molecules. Genomic data can provide valuable 
insights into various biological processes, such as gene expression, 
genetic variation, and evolutionary relationships.

Genomic data in this context could consist of gene expression profiles 
measured across different individuals (e.g., patients).

- Each row in the genomic_data matrix represents a gene, and each column 
represents a patient sample.

- The values in the matrix represent the expression levels of each gene in 
each patient sample.

Here's an example of genomic data:

```
      Sample 1   Sample 2   Sample 3   Sample 4
Gene1    0.1        0.2        0.3        0.4
Gene2    1.2        1.3        1.4        1.5
Gene3    2.3        2.2        2.1        2.0
```

In this example, each row represents a gene (or genomic feature), and 
each column represents a sample. The values in the matrix represent 
some measurement of gene expression, such as mRNA levels or protein abundance,
in each sample.

For instance, the value 0.1 in Sample 1 for Gene1 indicates the expression 
level of Gene1 in Sample 1. Similarly, the value 2.2 in Sample 2 for Gene3 
indicates the expression level of Gene3 in Sample 2.

Genomic data can be used in various analyses, including genetic association 
studies, gene expression analysis, and comparative genomics. In the context 
of the `evaluate_fitness_cpp` function, genomic data is used to calculate 
fitness scores for individuals in a population, typically in the context 
of genetic algorithm optimization.

The population represents a set of candidate combinations of genes that 
could be predictive of the trait.
Each individual in the population is represented by a binary vector indicating
the presence or absence of each gene.
For example, an individual in the population might be represented as 
[1, 0, 1],
indicating the presence of Gene1 and Gene3 but the absence of Gene2.
The population undergoes genetic algorithm operations such as selection, 
crossover, mutation, and replacement to evolve towards individuals with higher
predictive power for the trait.

## Example Scenario 

Consider an example scenario of using genetic algorithm optimization to select
the best combination of genes for predicting a certain trait, such as disease
susceptibility.

```{r}
# Load necessary packages
library(BioGA)
library(SummarizedExperiment)

# Define parameters
num_genes <- 1000
num_samples <- 10

# Define parameters for genetic algorithm
population_size <- 100
generations <- 20
mutation_rate <- 0.1

# Generate example genomic data using SummarizedExperiment
counts <- matrix(rpois(num_genes * num_samples, lambda = 10),
    nrow = num_genes
)
rownames(counts) <- paste0("Gene", 1:num_genes)
colnames(counts) <- paste0("Sample", 1:num_samples)

# Create SummarizedExperiment object
se <-
  SummarizedExperiment::SummarizedExperiment(assays = list(counts = counts))

# Convert SummarizedExperiment to matrix for compatibility with BioGA package
genomic_data <- assay(se)
```

In this example, `counts` is a matrix representing the counts 
of gene expression levels across different samples. Each row corresponds 
to a gene, and each column corresponds to a sample. We use the 
`SummarizedExperiment` class to store this data, which is 
common Bioconductor class for representing rectangular feature x sample data,
such as RNAseq count matrices or microarray data.

```{r}
head(genomic_data)
```

## Initialization

```{r}
# Initialize population (select the number of canditate you wish `population`)
population <- BioGA::initialize_population_cpp(genomic_data,
    population_size = 5
)
```

The population represents a set of candidate combinations of genes that 
could be predictive of the trait. Each individual in the population is 
represented by a binary vector indicating the presence or absence of 
each gene.
For example, an individual in the population might be represented 
as [1, 0, 1], indicating the presence of Gene1 and Gene3 but the absence
of Gene2. 
The population undergoes genetic algorithm operations such as selection, 
crossover, mutation, and replacement to evolve towards individuals 
with higher predictive power for the trait.

## Genetic Algorithm Optimization

```{r}
# Initialize fitness history
fitness_history <- list()

# Initialize time progress
start_time <- Sys.time()

# Run genetic algorithm optimization
generation <- 0
while (TRUE) {
    generation <- generation + 1

    # Evaluate fitness
    fitness <- BioGA::evaluate_fitness_cpp(genomic_data, population)
    fitness_history[[generation]] <- fitness

    # Check termination condition
    if (generation == generations) { # defined number of generations
        break
    }

    # Selection
    selected_parents <- BioGA::selection_cpp(population,
        fitness,
        num_parents = 2
    )

    # Crossover and Mutation
    offspring <- BioGA::crossover_cpp(selected_parents, offspring_size = 2)
    # (no mutation in this example)
    mutated_offspring <- BioGA::mutation_cpp(offspring, mutation_rate = 0)

    # Replacement
    population <- BioGA::replacement_cpp(population, mutated_offspring,
        num_to_replace = 1
    )

    # Calculate time progress
    elapsed_time <- difftime(Sys.time(), start_time, units = "secs")

    # Print time progress
    cat(
        "\rGeneration:", generation, "- Elapsed Time:",
        format(elapsed_time, units = "secs"), "     "
    )
}
```

## Fitness Calculation

The fitness calculation described in the provided code calculates a measure 
of dissimilarity between the gene expression profiles of individuals 
in the population and the genomic data. This measure of dissimilarity, 
or "fitness", quantifies how well the gene expression profile of an individual
matches the genomic data.

Mathematically, the fitness calculation can be represented as follows:

Let:

- \( g_{ijk} \) be the gene expression level of gene \( j \) 
in individual \( i \) and sample \( k \) from the genomic data.

- \( p_{ij} \) be the gene expression level of gene \( j \) 
in individual \( i \) from the population.

- \( N \) be the number of individuals in the population.

- \( G \) be the number of genes.

- \( S \) be the number of samples.

Then, the fitness \( F_i \) for individual \( i \) in the population can be 
calculated as the sum of squared differences between the gene expression 
levels of individual \( i \) and the corresponding gene expression levels 
in the genomic data, across all genes and samples:
$$
 F_i = \sum_{j=1}^{G} \sum_{k=1}^{S} (g_{ijk} - p_{ij})^2 
$$

This fitness calculation aims to minimize the overall dissimilarity between 
the gene expression profiles of individuals in the population and 
the genomic data. Individuals with lower fitness scores are considered to have
gene expression profiles that are more similar to the genomic data and 
are therefore more likely to be selected for further optimization 
in the genetic algorithm.

```{r}
# Plot fitness change over generations
BioGA::plot_fitness_history(fitness_history)
```

This showcases the integration of genetic algorithms with genomic 
data analysis and highlights the potential of genetic algorithms
for feature selection in genomics.

Here's how BioGA could work in the context of high throughput genomic data 
analysis:

1. **Problem Definition**: BioGA starts with a clear definition of the 
problem to be solved. This could include tasks such as identifying genetic 
markers associated with a particular disease, optimizing gene expression 
patterns, or clustering genomic data to identify patterns or groupings.

2. **Representation**: Genomic data would need to be appropriately represented
for use within the genetic algorithm framework. This might involve encoding 
the data in a suitable format, such as binary strings representing genes 
or chromosomes.

3. **Fitness Evaluation**: BioGA would define a fitness function that 
evaluates how well a particular solution performs with respect to the 
problem being addressed. In the context of genomic data analysis, 
this could involve measures such as classification accuracy, correlation
with clinical outcomes, or fitness to a particular model.

4. **Initialization**: The algorithm would initialize a population of 
candidate solutions, typically randomly or using some heuristic method. 
Each solution in the population represents a potential solution 
to the problem at hand.

5. **Genetic Operations**: BioGA would apply genetic operators such as 
selection, crossover, and mutation to evolve the population over successive
generations. Selection identifies individuals with higher fitness to serve
as parents for the next generation. Crossover combines genetic material
from two parent solutions to produce offspring. Mutation introduces
random changes to the offspring to maintain genetic diversity.

6. **Termination Criteria**: The algorithm would continue iterating through
generations until a termination criterion is met. This could be a maximum
number of generations, reaching a satisfactory solution, or convergence
of the population.

7. **Result Analysis**: Once the algorithm terminates, BioGA would analyze
the final population to identify the best solution(s) found. This could
involve further validation or interpretation of the results in the context
of the original problem.

Other applications of BioGA in genomic data analysis could include genome-wide
association studies (GWAS), gene expression analysis, pathway analysis,
and predictive modeling for personalized medicine, among others.
By leveraging genetic algorithms, BioGA offers a powerful approach
to exploring complex genomic datasets and identifying meaningful patterns
and associations.


<details>

<summary>**Session Info**</summary>

```{r sessioninfo}
sessioninfo::session_info()
```

</details>