---
title: "Introducing the raer package"
author: 
  - name: Kent Riemondy
    affiliation: University of Colorado School of Medicine
  - name: Kristen Wells-Wrasman
    affiliation: University of Colorado School of Medicine
  - name: Jay Hesselberth
    affiliation: University of Colorado School of Medicine
date: '`r Sys.Date()`'
output:
  BiocStyle::html_document
package: raer 
bibliography: ref.bib  
vignette: >
  %\VignetteIndexEntry{Introduction}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}    
---

```{r, echo=FALSE, results="hide"}
knitr::opts_chunk$set(message = FALSE, warning = FALSE)
```

# Introduction

The *raer* (RNA Adenosine editing in R) package provides tools to characterize 
A-to-I editing in single cell and bulk RNA-sequencing datasets. Both novel and 
known editing sites can be detected and quantified beginning with BAM alignment 
files. At it's core the *raer* package uses the pileup routines from the HTSlib
C library (@Bonfield2021-yo) to identify candidate RNA editing sites, and 
leverages the annotation resources in the Bioconductor ecosystem to further 
characterize and identify high-confidence RNA editing sites.  

Here we demonstrate how to use the *raer* package to a) quantify RNA editing 
sites in droplet scRNA-seq dataset,  b) identify editing sites with condition 
specific editing in bulk RNA-seq data, and c) predict novel editing sites from 
bulk RNA-seq. 

# Installation

The `raer` package can be installed from Bioconductor using `BiocManager`. 

```{r, eval = FALSE}
if (!require("BiocManager", quietly = TRUE)) {
    install.packages("BiocManager")
}

BiocManager::install("raer")
```

Alternatively `raer` can be installed from github using
`BiocManager::install("rnabioco/raer")`.

# Characterizing RNA editing sites in scRNA-seq data

Here we will use the *raer* package to examine RNA editing in droplet-based 
single cell RNA-seq data. `pileup_cells()` enables quantification of edited and 
non-edited bases at specified sites from scRNA-seq data. 

For this example we will examine a scRNA-seq dataset from human PBMC cells 
provided by [10x Genomics](https://www.10xgenomics.com/resources/datasets/10k-human-pbmcs-3-v3-1-chromium-x-with-intronic-reads-3-1-high). The single cell data was aligned and processed using the 
10x Genomics cellranger pipeline. 

The PBMC scRNA-seq dataset from 10x Genomics, along with other
needed files will downloaded and cached using  `pbmc_10x()`from the *raerdata* 
ExperimentHub package. For this vignette, the BAM file was subset to retain 
2 million alignments that overlap human RNA editing sites on chromosome 16.

`pbmc_10x()` returns a list containing a `BamFile` object, a `GRanges` object 
with known RNA editing sites from the `REDIportal` database @Mansi2021-za, and 
a `SingleCellExperiment` populated with the gene expression data and cell type 
annotations. 

```{r message=FALSE}
library(raer)
library(raerdata)

pbmc <- pbmc_10x()

pbmc_bam <- pbmc$bam
editing_sites <- pbmc$sites
sce <- pbmc$sce
```

This dataset contains T-cell, B-cells, and monocyte cell populations. 

```{r}
library(scater)
library(SingleCellExperiment)
library(GenomeInfoDb)
plotUMAP(sce, colour_by = "celltype")
```

## Specifying sites to quantify

Next we'll select editing sites to quantify. For this analysis we will use RNA 
editing sites cataloged in the REDIportal database @Mansi2021-za.

```{r}
editing_sites
```

The sites to quantify are specified using a custom formatted GRanges object 
with 1 base intervals, a strand (+ or -), and supplemented with metadata 
columns named `REF` and `ALT` containing the reference and alternate base to 
query. In this case we are only interested in A->I editing, so we set the ref 
and alt to `A` and `G`. Note that the `REF` and `ALT` bases are in reference to 
strand. For a `-` strand interval the bases should be the complement of the `+` 
strand bases. Also note that these bases can be stored as traditional character
vectors or as `Rle()` objects to save memory.  

```{r}
editing_sites$REF <- Rle("A")
editing_sites$ALT <- Rle("G")
editing_sites
```

## Quantifying sites in single cells using *pileup_cells*

`pileup_cells()` quantifies edited and non-edited UMI counts per cell barcode, 
then organizes the site counts into a `SingleCellExperiment` object. 
`pileup_cells()` accepts a `FilterParam()` object that specifies parameters for 
multiple read-level and site-level filtering and processing options. Note that 
`pileup_cells()` is strand sensitive by default, so it is important to ensure 
that the strand of the input sites is correctly annotated, and that the 
`library-type` is set correctly for the strandedness of the sequencing library. 
For 10x Genomics data, the library type is set to `fr-second-strand`, 
indicating that the strand of the BAM alignments is the same strand as the RNA. 
See [quantifying Smart-seq2 scRNA-seq libraries](#ss2) for an example of using 
*pileup_cells()* to handle unstranded data and data from libraries that produce
1 BAM file for each cell.  

To exclude duplicate reads derived from PCR, `pileup_cells()` can use a UMI 
sequence, supplied via the `umi_tag` argument, to only count 1 read for each 
CB-UMI pair at each editing site position. Note however that by default the 
`bam_flags` argument for the `FilterParam` class is set to **include** duplicate
reads when using `pileup_cells()`. Droplet single cell libraries produce 
multiple cDNA fragments from a single reverse transcription event. The cDNA 
fragments have different alignment positions due to fragmentation despite being
derived from a single RNA molecule. scRNA-seq data processed by cellranger from
10x Genomics will set the "Not primary alignment" BAM flag for every read except
one read for each UMI. If duplicates are removed based on this BAM flag, then 
only 1 representative fragment for a single UMI will be examined, which will 
exclude many valid regions.

To reduce processing time many functions in the *raer* package operate in 
parallel across multiple chromosomes. To enable parallel processing, a 
`BiocParallel` backend can be supplied via the `BPPARAM` argument (e.g. 
`MultiCoreParam()`). 

```{r pileup_cells}
outdir <- file.path(tempdir(), "sc_edits")
cbs <- colnames(sce)

params <- FilterParam(
    min_mapq = 255, # required alignment MAPQ score
    library_type = "fr-second-strand", # library type
    min_variant_reads = 1
)

e_sce <- pileup_cells(
    bamfile = pbmc_bam,
    sites = editing_sites,
    cell_barcodes = cbs,
    output_directory = outdir,
    cb_tag = "CB",
    umi_tag = "UB",
    param = params
)
e_sce
```

The outputs from `pileup_cells()` are a `SingleCellExperiment` object populated 
with  `nRef` and `nAlt` assays containing the base counts for the reference 
(unedited) and alternate (edited) alleles at each position. 

The sparseMatrices are also written to files, at a directory specified by 
`output_directory`, which can be loaded into R using the `read_sparray()` 
function.

```{r}
dir(outdir)
read_sparray(
    file.path(outdir, "counts.mtx.gz"),
    file.path(outdir, "sites.txt.gz"),
    file.path(outdir, "barcodes.txt.gz")
)
```

Next we'll filter the single cell editing dataset to find sites with an editing 
event in at least 5 cells and add the editing counts to the gene expression 
SingleCellExperiment as an `altExp()`. 

```{r}
e_sce <- e_sce[rowSums(assays(e_sce)$nAlt > 0) >= 5, ]
e_sce <- calc_edit_frequency(e_sce,
    edit_from = "Ref",
    edit_to = "Alt",
    replace_na = FALSE
)
altExp(sce) <- e_sce[, colnames(sce)]
```

With the editing sites added to the gene expression SingleCellExperiment we can 
use plotting and other methods previously developed for single cell analysis. 
Here we'll visualize editing sites with the highest edited read counts.

```{r}
to_plot <- rownames(altExp(sce))[order(rowSums(assay(altExp(sce), "nAlt")),
    decreasing = TRUE
)]

lapply(to_plot[1:5], function(x) {
    plotUMAP(sce, colour_by = x, by_exprs_values = "nAlt")
})
```

Alternatively we can view these top edited sites as a Heatmap, showing the 
average number of edited reads per site in each cell type. 

```{r, fig.height = 7}
altExp(sce)$celltype <- sce$celltype

plotGroupedHeatmap(altExp(sce),
    features = to_plot[1:25],
    group = "celltype",
    exprs_values = "nAlt"
)
```

*raer* provides additional tools to examine cell type specific editing. 

- `find_scde_sites()` will perform statistical testing to identify sites with 
different editing frequencies between clusters/cell types.  

- `calc_scAEI()` will calculate the Alu Editing Index (`AEI`) metric in single 
cells.   

**If the editing sites of interest are not known, we recommend the following 
approach. First, treat the single cell data as a bulk RNA-seq experiment, and 
follow approaches described in the [Novel editing site detection](#novel) to 
identify putative editing sites. Then query these sites in single cell mode 
using `pileup_cells()`**

## Quantifying sites in Smart-seq2 libaries {#ss2}

`pileup_cells()` can also process Smart-seq2 single cell libraries. These 
datasets typically store data from each cell in separate BAM files and the 
library type for these alignments are generally unstranded. To process these 
datasets the `library-type` should be set to `unstranded`, and the reference 
editing sites need to be reported all on the `+` strand. 

For example, the editing sites on the minus strand will need to be complemented 
(set as T -> C rather than A -> G). Additionally the `umi_tag` and `cb_tag` 
arguments should be set as follows to disable UMI and cell barcode detection. 

To illustrate this functionality, we will reprocess the 10x Genomics pbmc 
dataset, treating the data as mock Smart-seq2 data from 3 cells.

```{r}
is_minus <- strand(editing_sites) == "-"
editing_sites[is_minus]$REF <- "T"
editing_sites[is_minus]$ALT <- "C"
strand(editing_sites[is_minus]) <- "+"

fp <- FilterParam(
    library_type = "unstranded",
    min_mapq = 255,
    min_variant_reads = 1
)

ss2_bams <- c(pbmc_bam, pbmc_bam, pbmc_bam)
cell_ids <- c("cell1", "cell2", "cell3")

pileup_cells(
    bamfiles = ss2_bams,
    cell_barcodes = cell_ids,
    sites = editing_sites,
    umi_tag = NULL, # no UMI tag in most Smart-seq2 libraries
    cb_tag = NULL, # no cell barcode tag
    param = fp,
    output_directory = outdir
)
```


# Quantifying RNA editing sites in bulk RNA-Seq {#bulk}

Next we will perform a reanalysis of a published bulk RNA-seq dataset using the 
*raer* package. The `pileup_sites()` function enable quantification of base 
counts from bulk RNA-seq data and can be used to identify novel sites 
(see [Novel editing site detection](#novel)).

For this reanalysis, we will examine a bulk RNA-seq dataset from accession [GSE99249](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE99249), which 
consists of RNA-seq data from ADAR1 mutants and control human cell lines, 
conditionally treated with Interferon-Beta. We will examine data from two 
genotypes, ADAR1 WT and KO, both treated with Interferon-B, with triplicate 
samples. 

Aligned BAM files and other necessary files have been preprocessed for this 
vignette and are available using `GSE99249()` from the `raerdata` package. 
Calling `GSE99249()` will downloaded and cache the necessary files and return 
a list containing the data. 

```{r}
ifnb <- GSE99249()
names(ifnb)
```

`bams` contains a vector of `BamFile` objects with the paths to each BAM file. 
These BAM files are a subset of the full BAM files, containing alignments from 
chromosome 18.

```{r}
bam_files <- ifnb$bams
names(bam_files)
```

To quantify editing sites we will need a FASTA file to compare read alignments 
to the reference sequence. For space reasons we'll use a FASTA file containing 
only chromosome 18 for this demo. 

```{r}
fafn <- ifnb$fasta
```

We will again use the database of known human editing sites from REDIPortal, 
only processing those from `chr18`. 

```{r}
editing_sites <- ifnb$sites
chr_18_editing_sites <- keepSeqlevels(editing_sites, "chr18",
    pruning.mode = "coarse"
)
```

## Generate editing site read counts using *pileup_sites*

The `pileup_sites()` function will process BAM files and calculate base counts 
at each supplied position. The `FilterParam()` will again be used to specify 
parameters to exclude reads and bases based on commonly used filters for 
detecting RNA-editing events. Specific regions can also be queried using the
`region` argument which accepts a samtools style region specification string 
(e.g. `chr` or `chr:start-end`).

```{r}
fp <- FilterParam(
    only_keep_variants = TRUE, # only report sites with variants
    trim_5p = 5, # bases to remove from 5' or 3' end
    trim_3p = 5,
    min_base_quality = 30, # minimum base quality score
    min_mapq = 255, # minimum MAPQ read score
    library_type = "fr-first-strand", # library type
    min_splice_overhang = 10 # minimum required splice site overhang
)

rse <- pileup_sites(bam_files,
    fasta = fafn,
    sites = chr_18_editing_sites,
    chroms = "chr18",
    param = fp
)

rse
```


Pileup data is stored in a `RangedSummarizedExperiment` object which facilitates
comparisons across samples and conveniently stores genomic coordinate 
information. The `rowData()` and `rowRanges()` slots are populated with the 
reference base (`REF`) and information related to each editing site, and 
similarly the `colData()` slot can be used to store sample metadata. 

The base counts and other information are stored in different assays within the 
object. `REF` and `ALT` bases and base count data are all provided in a stand 
specific fashion depending on the supplied `library-type` parameter. The `REF` 
and `ALT` bases are in reference to the strand. 

```{r}
assays(rse)
assay(rse, "nA")[1:5, ]
assay(rse, "nG")[1:5, ]
```

Next we'll add sample information which will be needed for identify sites with 
differential editing frequencies across genotypes.

```{r}
colData(rse)$treatment <- "Interferon beta"
colData(rse)$genotype <- factor(rep(c("ADAR1KO", "Wildtype"), each = 3))
colData(rse)
```

## Prepare for differential editing 

*raer* provides the `calc_edit_frequency` function to calculate the editing 
percentage and read depth at each position. With the `drop = TRUE` argument we 
will also exclude non-adenosine sites. The editing frequencies will not be used 
for differential editing analysis, which will be conducted using the raw counts,
however these are useful for filtering and visualization. `calc_edit_frequency` 
will add two additional assays to the object, the editing frequency 
(`edit_freq`) and read `depth`, both computed based on the  `edit_to` and 
`edit_from` counts. 


```{r}
rse <- calc_edit_frequency(rse,
    edit_from = "A",
    edit_to = "G",
    drop = TRUE
)
```

We'll next filter to exclude low frequency editing events. For this analysis we 
require that an editing site shows editing in at least 1 sample and has at 
least 5 counts in each sample.

```{r}
has_editing <- rowSums(assay(rse, "edit_freq") > 0) >= 1
has_depth <- rowSums(assay(rse, "depth") >= 5) == ncol(rse)

rse <- rse[has_editing & has_depth, ]
rse
```

Once the object has been filtered, we will transform it into an alternative data
structure for differential editing analysis that contains an assay with read 
counts of both the `ALT` and `REF` alleles in a single matrix.

```{r}
deobj <- make_de_object(rse, min_prop = 0.05, min_samples = 3)

assay(deobj, "counts")[1:3, c(1, 7, 2, 8)]
```

## Run differential editing

At this stage, you can use the object to perform differential yourself or use 
`find_de_sites()` to use `edgeR` or `DESeq2` to identify condition specific 
editing events. For differential editing, we use the design 
`design <- ~0 + condition:sample + condition:count` and perform testing to 
compare the edited read counts against unedited read counts. 

```{r}
deobj$sample <- factor(deobj$sample)
de_results <- find_de_sites(deobj,
    test = "DESeq2",
    sample_col = "sample",
    condition_col = "genotype",
    condition_control = "Wildtype",
    condition_treatment = "ADAR1KO"
)
```

This returns a list containing the dds object, the full results, the significant
results, and the model matrix. 

```{r}
de_results$sig_results[1:5, ]
```

```{r, fig.height=7, fig.width=5}
library(ComplexHeatmap)
top_sites <- rownames(de_results$sig_results)[1:20]

Heatmap(assay(rse, "edit_freq")[top_sites, ],
    name = "editing frequency",
    column_labels = paste0(rse$genotype, "-", rse$treatment)
)
```

As anticipated the top identified sites are those with greatly reduced editing 
in the ADAR1KO samples.

## Examine overall editing activites using the Alu Editing Index

For some studies it is informative to assess the overall ADAR editing activity 
in addition to examining individual editing sites. The **Alu Editing Index** 
(AEI), developed by @Roth2019-yu, is a metric that summarizes that amount of 
editing occurring at ALU elements which account for the vast majority of A-to-I 
editing (> 99%) in humans.

*raer* provides `calc_AEI()`, based on this approach, to calculate the AEI 
metric. Many of the same parameters used for `pileup_sites()` are available in
`calc_AEI()`. 

First we will use the `AnnotationHub` package to obtain coordinates for ALU 
elements in the human genome. For this example we will only examine a subset of
ALUs on `chr18`. We will also use a `SNPlocs` package, based on the dbSNP 
database, to exclude any SNPs overlapping the ALU elements from the AEI 
calculation. The SNP coordinates are `NCBI` based, whereas the `ALU` elements 
are based on `hg38`, we will therefore convert between the two as needed to 
obtain SNP and ALU element coordinates based on `hg38`. 


```{r}
library(AnnotationHub)
library(SNPlocs.Hsapiens.dbSNP144.GRCh38)

ah <- AnnotationHub()
rmsk_hg38 <- ah[["AH99003"]]

alus <- rmsk_hg38[rmsk_hg38$repFamily == "Alu", ]
alus <- alus[seqnames(alus) == "chr18", ]
alus <- keepStandardChromosomes(alus)
alus <- alus[1:1000, ]

seqlevelsStyle(alus) <- "NCBI"
genome(alus) <- "GRCh38.p2"

alu_snps <- get_overlapping_snps(alus, SNPlocs.Hsapiens.dbSNP144.GRCh38)

seqlevelsStyle(alu_snps) <- "UCSC"
alu_snps[1:3, ]

seqlevelsStyle(alus) <- "UCSC"
alus[1:3, ]
```

`calc_AEI()` returns a matrix containing the AEI calculated for all allelic 
combinations and a more detailed table containing values for each chromosome.

```{r}
alu_index <- calc_AEI(bam_files,
    fasta = fafn,
    snp_db = alu_snps,
    alu_ranges = alus,
    param = fp
)
names(alu_index)
```

```{r}
Heatmap(alu_index$AEI,
    name = "AEI",
    row_labels = rse$genotype[match(rownames(alu_index$AEI), rse$sample)]
)
```


The AEI in the `Wildtype` samples is highest for `A-to-G`, and sharply 
reduced in the `ADAR1KO` samples as expected.

# Novel RNA editing site detection {#novel}

Next we will demonstrate how to identify novel RNA editing sites using the 
*raer* package. It is best practice to have a matched DNA sequencing dataset to
exclude sample specific genetic variation and common alignment artifacts. 
However, high confidence editing sites can also be identified if the dataset 
contains many samples and there are high coverage SNP databases for the organism
queried. Additionally high confidence editing sites can also be identified if a
dataset contains a sample with reduced or absent ADAR activity. A false-positive
rate estimate can be obtained by examining the proportion of A->I editing sites 
recovered, relative to other variants, (e.g. G->C, C->A).

In this analysis a published RNA-seq and whole genome sequencing dataset will be
analyzed. High coverage whole-genome sequencing was conducted [ERR262997](https://www.ebi.ac.uk/ena/browser/view/ERR262997?show=reads) along 
with paired-end RNA-seq [SRR1258218](https://www.ebi.ac.uk/ena/browser/view/SRR1258218?show=reads)
in a human cell line (`NA12878`).

Aligned BAM files, a genome FASTA file, and a GRanges object containing SNPs 
corresponding to the first 1Mb region of chr4 have been prepared for this 
vignette and can be downloaded and cached using `NA12878()`.

```{r}
rna_wgs <- NA12878()
names(rna_wgs)
```

Additionally we will use the following additional annotation resources:  
  
- A database of known SNPs, for example from the 
`SNPlocs.Hsapiens.dbSNP144.GRCh38` package. Due to space and memory 
constraints in this vignette we will only examine SNPs from the first 1Mb
region of chr4.  

- `TxDb.Hsapiens.UCSC.hg38.knownGene`, a database of transcript models. 
Alternatively these can be generated from a `.gtf` file using 
`makeTxDbFromGRanges()` from the `txdbmaker` package.   

- RepeatMasker annotations, which can be obtained from the `AnnotationHub()` 
for hg38, as shown in the [bulk RNA-seq](#bulk) tutorial. 

```{r}
library(TxDb.Hsapiens.UCSC.hg38.knownGene)

txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene
chr4snps <- rna_wgs$snps
```

The `pileup_sites()` function accept multiple BAM files, here we supply one 
from RNA-seq, and one from whole genome sequencing. A subset of the filtering 
parameters (`FilterParam()`) can accept multiple arguments matched to each of 
the input BAM files. This allows us to have distinct settings for the WGS and 
RNA-seq BAM files.

```{r}
bams <- rna_wgs$bams
names(bams) <- c("rna", "dna")
fp <- FilterParam(
    min_depth = 1, # minimum read depth across all samples
    min_base_quality = 30, # minimum base quality
    min_mapq = c(255, 30), # minimum MAPQ for each BAM file
    library_type = c("fr-first-strand", "unstranded"), # sample library-types
    trim_5p = 5, # bases to trim from 5' end of alignment
    trim_3p = 5, # bases to trim from 3' end of alignment
    indel_dist = 4, # ignore read if contains an indel within distance from site
    min_splice_overhang = 10, # required spliced alignment overhang
    read_bqual = c(0.25, 20), # fraction of the read with base quality
    only_keep_variants = c(TRUE, FALSE), # report site if rnaseq BAM has variant
    report_multiallelic = FALSE, # exclude sites with multiple variant alleles
)

rse <- pileup_sites(bams,
    fasta = rna_wgs$fasta,
    chroms = "chr4",
    param = fp
)

rse
```


Next we filter to keep those sites with a variant in the RNA-seq, but no variant
in the DNA-seq, and a minimum of 5 reads covering the site in the DNA-seq. The 
DNA-seq data is unstranded, and therefore will be reported on the "+" strand 
whereas the RNA-seq data will be reported on expressing RNA strand. 
We therefore use `subsetByOverlaps(..., ignore.strand = TRUE)` to retain sites
passing these DNA-seq based filters independent of strand. 

```{r}
to_keep <- (assay(rse, "nRef")[, "dna"] >= 5 &
    assay(rse, "ALT")[, "dna"] == "-")

rse <- subsetByOverlaps(rse, rse[to_keep, ], ignore.strand = TRUE)
nrow(rse)
```

Next we filter to remove any multiallelic sites. These sites are stored as 
comma-separated strings in the `ALT` assay (e.g. `G,C`). Non-variant sites are 
stored as `-`. `filter_multiallelic()` will remove any sites that have multiple 
variants in the samples present in the `summarizedExperiment` object. It will 
add a new column to the `rowData()` to indicate the variant for each site, and 
will calculate an `edit_freq` assay with variant allele frequencies for each 
sample. 

```{r}
rse <- filter_multiallelic(rse)
rse <- calc_edit_frequency(rse)
rowData(rse)
```


Next we'll remove sites in simple repeat regions. We will add repeat information
to the `rowData()` using the `annot_from_gr()` function.

```{r}
# subset both to chromosome 4 to avoid warning about different seqlevels
seqlevels(rse, pruning.mode = "coarse") <- "chr4"
seqlevels(rmsk_hg38, pruning.mode = "coarse") <- "chr4"

rse <- annot_from_gr(rse,
    rmsk_hg38,
    cols_to_map = c("repName", "repClass", "repFamily")
)

rowData(rse)[c("repName", "repFamily")]
```


```{r}
rse <- rse[!rowData(rse)$repFamily %in% c("Simple_repeat", "Low_complexity")]
```

Next we'll remove sites adjacent to other sites with different variant types. 
For example if an A->G variant is located proximal to a C->T variant then the 
variants will be removed.  

```{r}
seqlevels(txdb, pruning.mode = "coarse") <- "chr4"
rse <- filter_clustered_variants(rse, txdb, variant_dist = 100)
rse
```

Next, we'll annotate if the site is a known SNP and remove any known SNPs. If 
using a SNPlocs package you can use the `annot_snps()` function, which also 
allows one to compare the variant base to the SNP variant base. Here we will 
use the `annot_from_gr()` function to annotate using the `chr4snps` object 
and coarsely remove any editing sites overlapping the same position as a SNP. 

```{r}
rse <- annot_from_gr(rse, chr4snps, "name")
rowData(rse)[c("name")]

rse <- rse[is.na(rowData(rse)$name), ]
rse
```

Lastly, we'll further filter the edit sites to require that the editing 
frequency is > 0.05 and that at least 2 reads support the editing site. 

```{r}
to_keep <- assay(rse, "edit_freq")[, 1] > 0.05
rse <- rse[to_keep, ]

rse <- rse[assay(rse, "nAlt")[, 1] >= 2]
```

With the above filtering approach we obtain a set of putative editing sites. 
The specificity of the filtering can be estimated by examining the number of 
A-to-G changes relative to other variants. A-to-I RNA editing is more common 
than other types of editing (e.g. C->U editing by APOBEC enzymes) in human 
datasets so the majority of the variants should by A-to-G. In this vignette 
data all of the identified sites are A-to-G.

```{r}
rowRanges(rse)
```

Finally once a set of sites has been identified, additional packages in the 
Bioconductor ecosystem, such as the [VariantAnnotation](https://bioconductor.org/packages/release/bioc/html/VariantAnnotation.html)
package, can be used to determine the genomic context and potential molecular 
consequences of the editing event. 


# R session information

```{r}
sessionInfo()
```