This vignette describes some of the more advanced parameters and functions of the rgenie package.
If you have a question that is not covered here, please contact me at jeremy37 at gmail.com.
library(rgenie)
library(readr, quietly = T) # Not required, but we use it below
library(magrittr, quietly = T)
First we load the MUL1 example, which targeted a 5’ UTR SNP in MUL1, and has 8 cDNA and 4 gDNA PCR replicates.
download_example(dir = "~/genie_example", name = "MUL1")
regions = readr::read_tsv("~/genie_example/MUL1/mul1.genie_regions.tsv")
replicates = readr::read_tsv("~/genie_example/MUL1/mul1.genie_replicates.tsv")
head(regions)
#> # A tibble: 1 x 9
#> name sequence_name start end highlight_site cut_site hdr_allele_prof…
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 MUL1… MUL1 1 175 135 133 ---------------…
#> # … with 2 more variables: wt_allele_profile <chr>, ref_sequence <chr>
head(replicates)
#> # A tibble: 6 x 5
#> name replicate type vp_extraction bam
#> <chr> <chr> <chr> <dbl> <chr>
#> 1 MUL1_rs6700… c1.1 cDNA 1 MUL1_rs6700034_cDNA_rep1_pcr1.sort…
#> 2 MUL1_rs6700… c1.2 cDNA 1 MUL1_rs6700034_cDNA_rep1_pcr2.sort…
#> 3 MUL1_rs6700… c1.3 cDNA 1 MUL1_rs6700034_cDNA_rep1_pcr3.sort…
#> 4 MUL1_rs6700… c1.4 cDNA 1 MUL1_rs6700034_cDNA_rep1_pcr4.sort…
#> 5 MUL1_rs6700… c2.5 cDNA 2 MUL1_rs6700034_cDNA_rep2_pcr5.sort…
#> 6 MUL1_rs6700… c2.6 cDNA 2 MUL1_rs6700034_cDNA_rep2_pcr6.sort…
We remove 1 replicate that we discovered in the Introductory vignette was of poor quality.
The sequence to be matched for the HDR and WT alleles is determined by the hdr_allele_profile
and wt_allele_profile
values in the regions
parameter, as well as by the parameters required_match_left
and required_match_right
.
We first do a grep analysis.
setwd("~/genie_example/MUL1/")
grep_results = grep_analysis(regions,
replicates,
required_match_left = 10,
required_match_right = 10,
min_mapq = 0,
quiet = T)
grep_result = grep_results[[1]]
The result shows the sequences that were searched for to match HDR and WT alleles.
grep_result$hdr_seq_grep
#> [1] "ACCGCACCCCACCGGCCTAAC"
grep_result$wt_seq_grep
#> [1] "ACCGCACCCCCCCGGCCTAAC"
If more than one nucleotide is specified in an allele profile, then the region that must exactly match extends to the left of the leftmost position by required_match_left
bases, and to the right of the rightmost position by required_match_right
bases. Here we set a required nucleotide match at position 132 in the amplicon for the HDR allele, which is 3 to the left of the SNP position, 135.
setwd("~/genie_example/MUL1/")
new_hdr_allele_profile = regions$hdr_allele_profile
substr(new_hdr_allele_profile, 132, 132) = "C"
grep_results = grep_analysis(regions %>% dplyr::mutate(hdr_allele_profile = new_hdr_allele_profile),
replicates,
required_match_left = 10,
required_match_right = 10,
min_mapq = 0,
quiet = T)
grep_result = grep_results[[1]]
We can see in the results that the grep sequences for the HDR allele is extended left by 3 nucleotides.
grep_result$hdr_seq_grep
#> [1] "AGGACCGCACCCCACCGGCCTAAC"
grep_result$wt_seq_grep
#> [1] "ACCGCACCCCCCCGGCCTAAC"
The same principle applies in a deletion analysis. The only difference in a deletion analysis is that the relevant bases must also be aligned in the correct positions. Therefore it may be possible to use a shorter “required match” region in a deletion analysis, which would therefore include reads which have mismatches outside of the “required match” region. If using lenient options for the required match parameters, we recommend comparing the results to those obtained with a longer required match region, to ensure that the results are consistent.
It is possible to use asymmetric values for required_match_left
and required_match_right
.
The deletion_analysis
function has an option crispr_del_window
, which determines the window around the cut site (cut site +/- crispr_del_window
) within which any deletion is considered a CRISPR deletion. This is an important concept.
It would be possible to simply consider that ANY deletion in a read within the amplicon may relate to the CRISPR-Cas9 editing that was done, and to consider such reads as deletion reads. This can be achieved by setting crispr_del_window = NA
. However, there are two drawbacks to this:
Recommendation: We recommend using the widest crispr_del_window
possible that avoids including too many spurious deletions. A deletion which overlaps any part of the CRISPR deletion window (cut site +/- crispr_del_window
) will be considered as a deletion read. Deletions which do not overlap this region will not be considered as deletions, and the reads may be considered as either WT or HDR reads. This allows you to effectively ignore spurious deletions that occur far from the site of interest.
The common expectation is that deletions which occur as a result of a double-strand break (DSB) and non-homologous end joining (NHEJ) will emanate from the cut site. However, our experience is that a fraction of reads will have deletions that begin and end as many as 20 bp or farther from the cut site. If a small value is used for crispr_del_window
, then these will NOT be considered deletions (even though they are Cas9-induced), and the reads may be considered as HDR or WT. This is the reason that by default a very large value is used for crispr_del_window
.
Note that this concept of counting deletions as CRISPR-induced or not is distinct from the “deletion span” concept discussed below.
In some cases we may only be interested in small deletions that occur within a window near the SNP site. That is, we wish to exclude large deletions from our analysis of the effect size attributable to deletions. One way to achieve this is to define the “deletion span” in which to count deletions. This is done with the parameters del_span_start
and del_span_end
, which define a window relative to the highlight_site
value defined by the region of interest. Deletions which are entirely contained within this window are counted towards the “deletion window” statistical analysis. Note that deletions outside of this window will still be considered as deletions (and will never by HDR or WT alleles), unlike the crispr_del_window
parameter described above.
setwd("~/genie_example/MUL1/")
del_results = rgenie::deletion_analysis(regions,
replicates,
del_span_start = -20,
del_span_end = 20,
quiet = TRUE, allele_profile = T)
rgenie::deletion_alleles_plot(del_results[[1]])
Deletions that are included in the deletion span are highlighted in red in the deletion alleles plot above.
The statistics for these deletions are reported in the deletion summary plot, as the second DEL/WT value on the right-hand side, with the relevant range indicated.
They are also reported in the region_stats
result object.
Effect size and p value for deletions within the “del_span” window:
del_results[[1]]$region_stats$del_window_effect
#> [1] 2.340977
del_results[[1]]$region_stats$del_window_pval
#> [1] 0.0001832801
You can see that when a smaller del_span is used, fewer deletions are included and the estimated effect has changed, although it is consistent with the previous estimate in this case since most of the reads come from short deletions within 10 bp of the SNP site.
setwd("~/genie_example/MUL1/")
del_results = rgenie::deletion_analysis(regions,
replicates,
del_span_start = -10,
del_span_end = 10,
quiet = TRUE)
rgenie::deletion_alleles_plot(del_results[[1]])
Effect size and p value for deletions within the smaller “del_span” window:
Here are a few notes about the differences between these two types of analysis.
crispr_del_window
, noted above) cannot be considered as HDR or WT. In a grep analysis such reads could still potentially match the HDR or WT sequence.min_aligned_bases
parameter. (Deleted bases are no considered as matching.)Different stages of the GenIE experimental process can contribute to noise in the quantification of alleles. Our experience is that the largest contribution is from the PCR itself, which is why it is important to do multiple replicates.
In our paper, we also performed multiple replicates of different steps:
You could estimate the variance attributable to any experimental step for which you have multiple replicates in each category, such as multiple PCR replicates from distinct RNA extractions. This is done using the result object after performing a deletion analysis.
vc = rgenie::get_variance_components(del_results[[1]],
replicates,
allele_min_reads = 100,
allele_min_fraction = 0.001)
#> Region MUL1_rs6700034, cDNA: fitting variance components with model:
#> UDP_fraction ~ (1|extraction)
#> Dividing work into 90 chunks...
#>
#> Total:6 s
#> Warning in fit_variance_components(replicate.udp.spread.df =
#> replicate.udp.type.spread.df %>% : Column vp_extraction cannot be used
#> in variance components analysis since it has only one level for region
#> MUL1_rs6700034.
#> Warning in fit_variance_components(replicate.udp.spread.df =
#> replicate.udp.type.spread.df %>% : Region MUL1_rs6700034: no replicate variables
#> are available to use in variance components analysis.
The replicates
table must have columns with names that begin with “vp_”, and which have values for the features you want to estimate variance components for. In this example, two RNA extractions were done, and so the cDNA replicates have one of two values for the “vp_extraction” column.
The get_variance_components
function returns the variance components partitioning for each unique allele (including HDR, WT, and all deletions) which has at least allele_min_reads
across all replicates, and which comprises at least allele_min_fraction
as a fraction of all reads. Variance components are computed separately for cDNA and gDNA.
vc$vp_cDNA %>% dplyr::select(name, frac, extraction, Residuals, udp)
#> # A tibble: 90 x 5
#> name frac extraction Residuals udp
#> <chr> <dbl> <dbl> <dbl> <chr>
#> 1 MUL1_rs67… 0.00107 0 1 ------------------------------------…
#> 2 MUL1_rs67… 0.0899 0 1 ------------------------------------…
#> 3 MUL1_rs67… 0.0525 0 1 ------------------------------------…
#> 4 MUL1_rs67… 0.00224 0 1 ------------------------------------…
#> 5 MUL1_rs67… 0.00272 0 1 ------------------------------------…
#> 6 MUL1_rs67… 0.00388 0.0902 0.910 ------------------------------------…
#> 7 MUL1_rs67… 0.00129 0 1 ------------------------------------…
#> 8 MUL1_rs67… 0.00111 0 1 ------------------------------------…
#> 9 MUL1_rs67… 0.00130 0 1 ------------------------------------…
#> 10 MUL1_rs67… 0.00161 0 1 ------------------------------------…
#> # … with 80 more rows
The “udp” column gives the allele profile, and “frac” is the average allele fraction across replicates.
These values can also be plotted.
rgenie::variance_components_plot(vc)
#> Warning: Removed 15 rows containing missing values (geom_point).
Each point in the plot is a distinct allele, and if you included multiple “vp_” columns, then each will be shown.
In this example, we could not estimate variance components for gDNA, because only one gDNA extraction was done.
You can safely ignore the following warning that you might get from the plotting function: Removed 13 rows containing missing values (geom_point).
You can estimate the power to detect a change in transcription for a given allele relative to WT, based on:
Alleles with higher abundance typically have lower noise, which we characterise by the coefficient of variation (standard deviation / mean) of their allele fraction across replicates. We then fit a curve to be able to predict the coefficient of variation for any allele at a given fraction of reads.
A call to power_plots
returns three plots. The first of these, cv_plot
shows the fit between allele read fraction and coefficient of variation (CV).
plots = rgenie::power_plots(del_results[[1]])
#> Warning: Multiple drawing groups in `geom_function()`. Did you use the correct
#> `group`, `colour`, or `fill` aesthetics?
#> Warning: Multiple drawing groups in `geom_function()`. Did you use the correct
#> `group`, `colour`, or `fill` aesthetics?
#> Warning: Multiple drawing groups in `geom_function()`. Did you use the correct
#> `group`, `colour`, or `fill` aesthetics?
plots$cv_plot
The HDR and WT alleles are shown with a triangle and square, respectively. Separate curves are fit for cDNA and gDNA, since noise levels are usually higher in cDNA, especially when amplifying an intronic region (nascent transcript).
The power_plot
shows the estimated power to detect an effect of a given size and allele fraction (read %). Here the effect size refers to the fold increase in expression in RNA relative to gDNA. (We have not shown power estimates for a similar fold decrease in expression in RNA.)
The replicate_allocation_plot
estimates how replicates should be allocated between cDNA and gDNA, assuming a given number of replicates. In general our experience is that you should have at least twice the number of cDNA replicates as gDNA, as long as you have at least 3 gDNA replicates, since allele fraction estimates in cDNA are higher. In practice we have typically conducted 8 cDNA and 4 gDNA PCR replicates for each targeted region.
You can get tables with the data shown in the plots by a call to power_analysis
.