VariantExperiment 1.0.0
This vignette is about the conversion methods and statistical
functions that are enabled on VariantExperiment objects, for the
analysis of genotyping or DNA-seq data sets. If you want to learn more
about the implementation of the VariantExperiment class, and basic
methods, please refer to the other vignette.
The package of VariantExperiment is implemented to represent VCF/GDS
files using standard SummarizedExperiment metaphor. It is a container
for high-through genetic/genomic data with GDS back-end, and is
interoperable with the statistical functions/methods that are
implemented in SeqArray and SeqVarTools that are designed for GDS
data. The SummarizedExperiment metaphor also gets the benefit of
common manipulations within Bioconductor ecosystem that are more
user-friendly.
First, we load the package into R session.
library(VariantExperiment)To take advantage of the functions and methods that are defined on
SummarizedExperiment, from which the VariantExperiment extends, we
have defined coercion methods from VCF and GDS to
VariantExperiment.
VCF to VariantExperimentThe coercion function of makeVariantExperimentFromVCF could
convert the VCF file directly into VariantExperiment object. To
achieve the best storage efficiency, the assay data are saved in
GDSArray format, and the annotation data are saved in
DelayedDataFrame format (with no option of ordinary DataFrame),
which could be retrieved by rowData() for feature related
annotations and colData() for sample related annotations (Only when
sample.info argument is specified).
vcf <- SeqArray::seqExampleFileName("vcf")
ve <- makeVariantExperimentFromVCF(vcf)
ve## class: VariantExperiment 
## dim: 1348 90 
## metadata(0):
## assays(3): genotype/data phase/data annotation/format/DP/data
## rownames(1348): 1 2 ... 1347 1348
## rowData names(14): ID ALT ... info_GP info_BN
## colnames(90): NA06984 NA06985 ... NA12891 NA12892
## colData names(0):Internally, the VCF file was converted into a on-disk GDS file,
which could be retrieved by:
gdsfile(ve)## [1] "/tmp/Rtmp79zWzD/file5f7c73531798/se.gds"assay data is in GDSArray format:
assay(ve, 1)## <1348 x 90 x 2> array of class GDSArray and type "integer":
## ,,1
##      NA06984 NA06985 NA06986 NA06989 ... NA12889 NA12890 NA12891 NA12892
##    1       3       3       0       3   .       0       0       0       0
##    2       3       3       0       3   .       0       0       0       0
##  ...       .       .       .       .   .       .       .       .       .
## 1347       0       0       0       0   .       0       0       0       0
## 1348       3       3       0       3   .       3       3       3       3
## 
## ,,2
##      NA06984 NA06985 NA06986 NA06989 ... NA12889 NA12890 NA12891 NA12892
##    1       3       3       0       3   .       0       0       0       0
##    2       3       3       0       3   .       0       0       0       0
##  ...       .       .       .       .   .       .       .       .       .
## 1347       0       0       0       0   .       0       0       0       0
## 1348       3       3       1       3   .       3       3       3       3feature-related annotation is in DelayedDataFrame format:
rowData(ve)## DelayedDataFrame with 1348 rows and 14 columns
##               ID            ALT            REF       QUAL     FILTER
##       <GDSArray> <DelayedArray> <DelayedArray> <GDSArray> <GDSArray>
## 1    rs111751804              C              T        NaN       PASS
## 2    rs114390380              A              G        NaN       PASS
## 3      rs1320571              A              G        NaN       PASS
## ...          ...            ...            ...        ...        ...
## 1346   rs8135982              T              C        NaN       PASS
## 1347 rs116581756              A              G        NaN       PASS
## 1348   rs5771206              G              A        NaN       PASS
##         info_AA    info_AC    info_AN    info_DP   info_HM2   info_HM3
##      <GDSArray> <GDSArray> <GDSArray> <GDSArray> <GDSArray> <GDSArray>
## 1             T          4        114       3251          0          0
## 2             G          1        106       2676          0          0
## 3             A          6        154       7610          1          1
## ...         ...        ...        ...        ...        ...        ...
## 1346          C         11        142        823          0          0
## 1347          G          1        152       1257          0          0
## 1348          G          1          6         48          0          0
##         info_OR     info_GP    info_BN
##      <GDSArray>  <GDSArray> <GDSArray>
## 1                 1:1115503        132
## 2                 1:1115548        132
## 3                 1:1120431         88
## ...         ...         ...        ...
## 1346            22:45312345        116
## 1347            22:45312409        132
## 1348            22:50616806        114User could also have the opportunity to save the sample related
annotation info directly into the VariantExperiment object, by
providing the file path to the sample.info argument, and then
retrieve by colData().
sampleInfo <- system.file("extdata", "Example_sampleInfo.txt",
                          package="VariantExperiment")
ve <- makeVariantExperimentFromVCF(vcf, sample.info = sampleInfo)## Warning in (function (node, name, val = NULL, storage = storage.mode(val), :
## Missing characters are converted to "".colData(ve)## DelayedDataFrame with 90 rows and 1 column
##             family
##         <GDSArray>
## NA06984       1328
## NA06985           
## NA06986      13291
## ...            ...
## NA12890       1463
## NA12891           
## NA12892Arguments could be specified to take only certain info columns or format columns from the vcf file.
ve1 <- makeVariantExperimentFromVCF(vcf, info.import=c("OR", "GP"))
rowData(ve1)## DelayedDataFrame with 1348 rows and 7 columns
##               ID            ALT            REF       QUAL     FILTER
##       <GDSArray> <DelayedArray> <DelayedArray> <GDSArray> <GDSArray>
## 1    rs111751804              C              T        NaN       PASS
## 2    rs114390380              A              G        NaN       PASS
## 3      rs1320571              A              G        NaN       PASS
## ...          ...            ...            ...        ...        ...
## 1346   rs8135982              T              C        NaN       PASS
## 1347 rs116581756              A              G        NaN       PASS
## 1348   rs5771206              G              A        NaN       PASS
##         info_OR     info_GP
##      <GDSArray>  <GDSArray>
## 1                 1:1115503
## 2                 1:1115548
## 3                 1:1120431
## ...         ...         ...
## 1346            22:45312345
## 1347            22:45312409
## 1348            22:50616806In the above example, only 2 info entries (“OR” and “GP”) are read
into the VariantExperiment object.
The start and count arguments could be used to specify the start
position and number of variants to read into Variantexperiment
object.
ve2 <- makeVariantExperimentFromVCF(vcf, start=101, count=1000)
ve2## class: VariantExperiment 
## dim: 1000 90 
## metadata(0):
## assays(3): genotype/data phase/data annotation/format/DP/data
## rownames(1000): 101 102 ... 1099 1100
## rowData names(14): ID ALT ... info_GP info_BN
## colnames(90): NA06984 NA06985 ... NA12891 NA12892
## colData names(0):For the above example, only 1000 variants are read into the
VariantExperiment object, starting from the position of 101.
GDS to VariantExperimentThe coercion function of makeVariantExperimentFromGDS coerces
GDS files into VariantExperiment objects directly, with the assay
data saved as GDSArray, and the rowData()/colData() in
DelayedDataFrame by default (with the option of ordinary DataFrame
object).
gds <- SeqArray::seqExampleFileName("gds")
ve <- makeVariantExperimentFromGDS(gds)
ve## class: VariantExperiment 
## dim: 1348 90 
## metadata(0):
## assays(3): genotype/data phase/data annotation/format/DP/data
## rownames(1348): 1 2 ... 1347 1348
## rowData names(14): ID ALT ... info_GP info_BN
## colnames(90): NA06984 NA06985 ... NA12891 NA12892
## colData names(1): familyrowData(ve)## DelayedDataFrame with 1348 rows and 14 columns
##               ID            ALT            REF       QUAL     FILTER
##       <GDSArray> <DelayedArray> <DelayedArray> <GDSArray> <GDSArray>
## 1    rs111751804              C              T        NaN       PASS
## 2    rs114390380              A              G        NaN       PASS
## 3      rs1320571              A              G        NaN       PASS
## ...          ...            ...            ...        ...        ...
## 1346   rs8135982              T              C        NaN       PASS
## 1347 rs116581756              A              G        NaN       PASS
## 1348   rs5771206              G              A        NaN       PASS
##         info_AA    info_AC    info_AN    info_DP   info_HM2   info_HM3
##      <GDSArray> <GDSArray> <GDSArray> <GDSArray> <GDSArray> <GDSArray>
## 1             T          4        114       3251          0          0
## 2             G          1        106       2676          0          0
## 3             A          6        154       7610          1          1
## ...         ...        ...        ...        ...        ...        ...
## 1346          C         11        142        823          0          0
## 1347          G          1        152       1257          0          0
## 1348          G          1          6         48          0          0
##         info_OR     info_GP    info_BN
##      <GDSArray>  <GDSArray> <GDSArray>
## 1                 1:1115503        132
## 2                 1:1115548        132
## 3                 1:1120431         88
## ...         ...         ...        ...
## 1346            22:45312345        116
## 1347            22:45312409        132
## 1348            22:50616806        114colData(ve)## DelayedDataFrame with 90 rows and 1 column
##             family
##         <GDSArray>
## NA06984       1328
## NA06985           
## NA06986      13291
## ...            ...
## NA12890       1463
## NA12891           
## NA12892Arguments could be specified to take only certain annotation columns
for features and samples. All available data entries for
makeVariantExperimentFromGDS arguments could be retrieved by the
showAvailable() function with the gds file name as input.
showAvailable(gds)## CharacterList of length 4
## [["name"]] genotype/data phase/data annotation/format/DP/data
## [["rowDataColumns"]] ID ALT REF QUAL FILTER
## [["colDataColumns"]] family
## [["infoColumns"]] AA AC AN DP HM2 HM3 OR GP BNNote that the infoColumns from gds file will be saved as columns
inside the rowData(), with the prefix of
“info_”. rowDataOnDisk/colDataOnDisk could be set as FALSE to
save all annotation data in ordinary DataFrame format.
ve3 <- makeVariantExperimentFromGDS(gds,
                                    rowDataColumns = c("ID", "ALT", "REF"),
                                    infoColumns = c("AC", "AN", "DP"),
                                    rowDataOnDisk = TRUE,
                                    colDataOnDisk = FALSE)
rowData(ve3)  ## DelayedDataFrame object ## DelayedDataFrame with 1348 rows and 6 columns
##               ID            ALT            REF    info_AC    info_AN
##       <GDSArray> <DelayedArray> <DelayedArray> <GDSArray> <GDSArray>
## 1    rs111751804              C              T          4        114
## 2    rs114390380              A              G          1        106
## 3      rs1320571              A              G          6        154
## ...          ...            ...            ...        ...        ...
## 1346   rs8135982              T              C         11        142
## 1347 rs116581756              A              G          1        152
## 1348   rs5771206              G              A          1          6
##         info_DP
##      <GDSArray>
## 1          3251
## 2          2676
## 3          7610
## ...         ...
## 1346        823
## 1347       1257
## 1348         48colData(ve3)  ## DataFrame object## DataFrame with 90 rows and 1 column
##              family
##         <character>
## NA06984        1328
## NA06985            
## NA06986       13291
## ...             ...
## NA12890        1463
## NA12891            
## NA12892VariantExperiment supports basic subsetting operations using [,
[[, $, and ranged-based subsetting operations using
subsetByOverlap.
NOTE that after a set of lazy operations, you need to call
saveVariantExperiment function to synchronize the on-disk file
associated with the in-memory representation by providing a file
path. Statistical functions could only work on synchronized
VariantExperiment object, or error will return.
Refer to the “VariantExperiment-class” vignette for more details.
Many statistical functions and methods are defined on
VariantExperiment objects, most of which has their generic defined
in Bioconductor package of SeqArray and SeqVarTools. These
functions could be called directly on VariantExperiment object as
input, with additional arguments to specify based on user’s need. More
details please refer to the vignettes of SeqArray and
SeqVarTools.
Here is a list of the statistical functions with brief description:
| statistical functions | Description | 
|---|---|
| seqAlleleFreq | Calculates the allele frequencies | 
| seqAlleleCount | Calculates the allele counts | 
| seqMissing | Calculates the missing rate for variant/sample | 
| seqNumAllele | Calculates the number of alleles (for ref/alt allele) | 
| hwe | Exact test for Hardy-Weinberg equilibrium on Single-Nucleotide Variants | 
| inbreedCoeff | Calculates the inbreeding coefficient by variant/sample | 
| pca | Calculates the eigenvalues and eignevectors with Principal Component Analysis | 
| titv | Calculate transition/transversion ratio overall or by sample | 
| refDosage | Calculate the dosage of reference allele (matrix with integers of 0/1/2) | 
| altDosage | Calculate the dosage of alternative allele (matrix with integers of 0/1/2) | 
| countSingletons | Count singleton variants for each sample | 
| heterozygosity | Calculate heterozygosity rate by sample or by variants | 
| homozygosity | Calculate homozygosity rate by sample or by variants | 
| meanBySample | Calculate the mean value of a variable by sample over all variants | 
| isSNV | Flag a single nucleotide variant | 
| isVariant | Locate which samples are variant for each site | 
Here are some examples in calculating the sample missing rate, hwe, titv ratio and the count of singletons for each sample.
## sample missing rate
mr.samp <- seqMissing(ve, per.variant = FALSE)
head(mr.samp)## [1] 0.083086053 0.096439169 0.009643917 0.094213650 0.129821958 0.022997033## hwe
hwe <- hwe(ve)
head(hwe)##   variant.id nAA nAa naa     afreq p            f
## 1          1  53   4   0 0.9649123 1 -0.036363636
## 2          2  52   1   0 0.9905660 1 -0.009523810
## 3          3  71   6   0 0.9610390 1 -0.040540541
## 4          4   1  16  56 0.1232877 1 -0.013888889
## 5          5  76  13   0 0.9269663 1 -0.078787879
## 6          6  88   1   0 0.9943820 1 -0.005649718## titv ratio by sample / overall
titv <- titv(ve, by.sample=TRUE)
head(titv)## [1] 4.352941 3.791667 3.439394 3.568966 3.750000 3.646154titv(ve, by.sample=FALSE)## [1] 3.562712## countSingletons
countSingletons(ve)##  [1]  2  2  7  5  1  5  3  5  2  2  8  2  4  3  6  5  9  5  3  5  5  5  7  0
## [25]  5  8 13  9  1  6  2  8  2  6  0  5  4  7  7  3  1  9  0  0  5  3  4  8
## [49]  6  9  6  4  7  8  5  7  3  5  2  2  6  8  2  6  4  3  7  4  3  3  8  2
## [73]  8  6  1  5  1  9  8  5  0  1  2  2  0  1  0 10  2  4As we have noted in the other vignette, after the subsetting by
[, $ or Ranged-based operations, we need to save the new
VariantExperiment object to synchronize the gds file (on-disk)
associated with the subset of data (in-memory representation) before
any statistical analysis. Otherwise, an error will be returned.
As a feature addition, we want to add the option of VCFArray in
saving the assay data in the step of
makeVariantExperimentFromVCF. We also seek to implement the
SQLDataFrame in representation of the annotation data. We also
plan to connect Bioconductor package VariantAnnotation to
implement the variant filtering and annotation functions based on
VariantExperiment format, and with that, to develop a pipeline for
using VariantExperiment object as the basic data structure for
DNA-sequencing data analysis.
sessionInfo()## R version 3.6.1 (2019-07-05)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.3 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.10-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.10-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] VariantExperiment_1.0.0     DelayedDataFrame_1.2.0     
##  [3] GDSArray_1.6.0              gdsfmt_1.22.0              
##  [5] SummarizedExperiment_1.16.0 DelayedArray_0.12.0        
##  [7] BiocParallel_1.20.0         matrixStats_0.55.0         
##  [9] Biobase_2.46.0              GenomicRanges_1.38.0       
## [11] GenomeInfoDb_1.22.0         IRanges_2.20.0             
## [13] S4Vectors_0.24.0            BiocGenerics_0.32.0        
## [15] BiocStyle_2.14.0           
## 
## loaded via a namespace (and not attached):
##  [1] mitml_0.3-7            Rcpp_1.0.2             lattice_0.20-38       
##  [4] tidyr_1.0.0            Biostrings_2.54.0      assertthat_0.2.1      
##  [7] zeallot_0.1.0          digest_0.6.22          pan_1.6               
## [10] R6_2.4.0               jomo_2.6-10            backports_1.1.5       
## [13] evaluate_0.14          pillar_1.4.2           zlibbioc_1.32.0       
## [16] rlang_0.4.1            minqa_1.2.4            nloptr_1.2.1          
## [19] rpart_4.1-15           Matrix_1.2-17          rmarkdown_1.16        
## [22] splines_3.6.1          lme4_1.1-21            stringr_1.4.0         
## [25] GWASExactHW_1.01       RCurl_1.95-4.12        broom_0.5.2           
## [28] compiler_3.6.1         xfun_0.10              pkgconfig_2.0.3       
## [31] mgcv_1.8-30            htmltools_0.4.0        nnet_7.3-12           
## [34] tidyselect_0.2.5       tibble_2.1.3           GenomeInfoDbData_1.2.2
## [37] bookdown_0.14          crayon_1.3.4           dplyr_0.8.3           
## [40] MASS_7.3-51.4          bitops_1.0-6           grid_3.6.1            
## [43] nlme_3.1-141           lifecycle_0.1.0        magrittr_1.5          
## [46] SeqVarTools_1.24.0     stringi_1.4.3          XVector_0.26.0        
## [49] SNPRelate_1.20.0       mice_3.6.0             generics_0.0.2        
## [52] vctrs_0.2.0            boot_1.3-23            tools_3.6.1           
## [55] glue_1.3.1             purrr_0.3.3            survival_2.44-1.1     
## [58] yaml_2.2.0             SeqArray_1.26.0        BiocManager_1.30.9    
## [61] logistf_1.23           knitr_1.25