--- title: "CpG sets" date: "`r Sys.Date()`" output: BiocStyle::html_document: toc: true # table of content true vignette: > %\VignetteIndexEntry{2. CpG sets} %\VignetteEngine{knitr::rmarkdown} editor_options: chunk_output_type: console --- # CpG sets RaMWAS calculates CpG scores and performs further analyses at a set of CpGs (or locations in general) defined by the user via `filecpgset` parameter. The `filecpgset` parameter must point to an .rds file (a file saved using `saveRDS` function), with the set of locations stored as a `list` with one sorted vector of CpG locations per chromosome. ```{r CpGsetExample} cpgset = list( chr1 = c(12L, 57L, 123L), chr2 = c(45L, 95L, 99L, 111L), chr3 = c(22L, 40L, 199L, 211L)) ``` In practice, the set should depend on the reference genome and can include CpGs created by common SNPs. Optionally, parameter `filenoncpgset`, can point to a file storing vetted locations away from any CpGs. # Downloadable CpG sets Our CpG sets include all common CpGs that are identified by combining reference genome sequence data with SNP information as SNPs can often create or destroy CpGs in the reference. Our sets exclude CpGs with unreliable coverage estimates due to poor alignment (e.g. CpG in repetitive elements) as indicated by our *in silico* experiment ([details below](#insilico)). Code | Super\ Population | hg19 no\ QC | hg19 with\ QC | hg38 no\ QC | hg38 with\ QC :---:|:---|:---:|:---:|:---:|:---: ALL | All samples | [28.4M](http://www.people.vcu.edu/~ashabalin/RaMWAS/cpgset_ALL_hg19_MAF_0.01_chr1-22.rds "28,368,579 CpGs, 31.1 MB file") | [28.0M](http://www.people.vcu.edu/~ashabalin/RaMWAS/cpgset_ALL_hg19_MAF_0.01_chr1-22_bowtie2_75bp.rds "27,985,891 CpGs, 30.8 MB file") | [29.5M](http://www.people.vcu.edu/~ashabalin/RaMWAS/cpgset_ALL_hg38_MAF_0.01_chr1-22.rds "29,469,696 CpGs, 32.3 MB file") | [27.8M](http://www.people.vcu.edu/~ashabalin/RaMWAS/cpgset_ALL_hg38_MAF_0.01_chr1-22_bowtie2_75bp.rds "27,789,753 CpGs, 30.7 MB file") AFR | African | [28.7M](http://www.people.vcu.edu/~ashabalin/RaMWAS/cpgset_AFR_hg19_MAF_0.01_chr1-22.rds "28,697,922 CpGs, 31.5 MB file") | [28.3M](http://www.people.vcu.edu/~ashabalin/RaMWAS/cpgset_AFR_hg19_MAF_0.01_chr1-22_bowtie2_75bp.rds "28,312,550 CpGs, 31.1 MB file") | [29.8M](http://www.people.vcu.edu/~ashabalin/RaMWAS/cpgset_AFR_hg38_MAF_0.01_chr1-22.rds "29,797,951 CpGs, 32.6 MB file") | [28.1M](http://www.people.vcu.edu/~ashabalin/RaMWAS/cpgset_AFR_hg38_MAF_0.01_chr1-22_bowtie2_75bp.rds "28,108,517 CpGs, 31.0 MB file") AMR | Ad\ Mixed\ American | [28.1M](http://www.people.vcu.edu/~ashabalin/RaMWAS/cpgset_AMR_hg19_MAF_0.01_chr1-22.rds "28,083,940 CpGs, 30.9 MB file") | [27.7M](http://www.people.vcu.edu/~ashabalin/RaMWAS/cpgset_AMR_hg19_MAF_0.01_chr1-22_bowtie2_75bp.rds "27,703,792 CpGs, 30.5 MB file") | [29.2M](http://www.people.vcu.edu/~ashabalin/RaMWAS/cpgset_AMR_hg38_MAF_0.01_chr1-22.rds "29,185,829 CpGs, 32.0 MB file") | [27.5M](http://www.people.vcu.edu/~ashabalin/RaMWAS/cpgset_AMR_hg38_MAF_0.01_chr1-22_bowtie2_75bp.rds "27,515,085 CpGs, 30.4 MB file") EAS | East Asian | [27.8M](http://www.people.vcu.edu/~ashabalin/RaMWAS/cpgset_EAS_hg19_MAF_0.01_chr1-22.rds "27,806,911 CpGs, 30.6 MB file") | [27.4M](http://www.people.vcu.edu/~ashabalin/RaMWAS/cpgset_EAS_hg19_MAF_0.01_chr1-22_bowtie2_75bp.rds "27,429,290 CpGs, 30.2 MB file") | [28.9M](http://www.people.vcu.edu/~ashabalin/RaMWAS/cpgset_EAS_hg38_MAF_0.01_chr1-22.rds "28,909,399 CpGs, 31.8 MB file") | [27.2M](http://www.people.vcu.edu/~ashabalin/RaMWAS/cpgset_EAS_hg38_MAF_0.01_chr1-22_bowtie2_75bp.rds "27,248,407 CpGs, 30.1 MB file") EUR | European | [27.9M](http://www.people.vcu.edu/~ashabalin/RaMWAS/cpgset_EUR_hg19_MAF_0.01_chr1-22.rds "27,924,665 CpGs, 30.7 MB file") | [27.5M](http://www.people.vcu.edu/~ashabalin/RaMWAS/cpgset_EUR_hg19_MAF_0.01_chr1-22_bowtie2_75bp.rds "27,546,110 CpGs, 30.4 MB file") | [29.0M](http://www.people.vcu.edu/~ashabalin/RaMWAS/cpgset_EUR_hg38_MAF_0.01_chr1-22.rds "29,027,050 CpGs, 31.9 MB file") | [27.4M](http://www.people.vcu.edu/~ashabalin/RaMWAS/cpgset_EUR_hg38_MAF_0.01_chr1-22_bowtie2_75bp.rds "27,361,657 CpGs, 30.2 MB file") SAS | South Asian | [28.0M](http://www.people.vcu.edu/~ashabalin/RaMWAS/cpgset_SAS_hg19_MAF_0.01_chr1-22.rds "27,979,088 CpGs, 30.8 MB file") | [27.6M](http://www.people.vcu.edu/~ashabalin/RaMWAS/cpgset_SAS_hg19_MAF_0.01_chr1-22_bowtie2_75bp.rds "27,599,943 CpGs, 30.4 MB file") | [29.1M](http://www.people.vcu.edu/~ashabalin/RaMWAS/cpgset_SAS_hg38_MAF_0.01_chr1-22.rds "29,081,038 CpGs, 31.9 MB file") | [27.4M](http://www.people.vcu.edu/~ashabalin/RaMWAS/cpgset_SAS_hg38_MAF_0.01_chr1-22_bowtie2_75bp.rds "27,413,926 CpGs, 30.3 MB file") --- | Reference\ Genome | [26.8M](http://www.people.vcu.edu/~ashabalin/RaMWAS/cpgset_ref_hg19_chr1-22.rds "26,752,702 CpGs, 29.6 MB file") | [26.4M](http://www.people.vcu.edu/~ashabalin/RaMWAS/cpgset_ref_hg19_chr1-22_bowtie2_75bp.rds "26,396,375 CpGs, 29.2 MB file") | [27.9M](http://www.people.vcu.edu/~ashabalin/RaMWAS/cpgset_ref_hg38_chr1-22.rds "27,852,739 CpGs, 30.7 MB file") | [27.1M](http://www.people.vcu.edu/~ashabalin/RaMWAS/cpgset_ref_hg38_chr1-22_bowtie2_75bp.rds "27,083,315 CpGs, 30.0 MB file") Table: CpG sets for human genome (autosomes only). **Note:** SNPs were obtained from the 1000 Genomes super populations (Phase 3 data, [more info](http://www.internationalgenome.org/category/population/)). Only SNPs with minor allele frequency above 1% are included. *In silico* alignment experiments assumed 75 bp single-end reads and alignment with [Bowtie 2](http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml). Genome | No\ QC | With\ QC :---:|:---:|:---: GRCm38.p4 | [22.6M](http://www.people.vcu.edu/~ashabalin/RaMWAS/cpgset_GRCm38.p4.rds "22,607,414 CpGs, 26.1 MB file") | [21.7M](http://www.people.vcu.edu/~ashabalin/RaMWAS/cpgset_GRCm38.p4_bowtie2_75bp.rds "21,689,478 CpGs, 25.1 MB file") GRCm38.p5 | [22.7M](http://www.people.vcu.edu/~ashabalin/RaMWAS/cpgset_GRCm38.p5.rds "22,651,390 CpGs, 26.1 MB file") | [21.7M](http://www.people.vcu.edu/~ashabalin/RaMWAS/cpgset_GRCm38.p5_bowtie2_75bp.rds "21,726,529 CpGs, 25.2 MB file") Table: CpG sets for mouse genome. # Constructing a custom CpG set ## Constructing a CpG set for a reference genome A CpG set can be constructed from a reference genome with the `getCpGsetCG` function. The functions can use any genome available in Bioconductor as `BSGenome` class. Additional genomes can be loaded using `readDNAStringSet` function from .fa files. ```{r loadPackages, echo=FALSE, warning=FALSE, message=FALSE} suppressPackageStartupMessages(library(ramwas)) suppressPackageStartupMessages(library(BSgenome.Ecoli.NCBI.20080805)) ``` ```{r cpgsFromGenome, warning=FALSE, message=FALSE} library(ramwas) library(BSgenome.Ecoli.NCBI.20080805) cpgset = getCpGsetCG(BSgenome.Ecoli.NCBI.20080805) # First 10 CpGs in NC_008253: print(cpgset$NC_008253[1:10]) ``` For a genome with injected SNPs, we provide the function `getCpGsetALL` for also finding CpGs that can be created by the SNPs. The example below uses all SNPs from dbSNP144 for listing CpGs in human genome. We do NOT advice using all dbSNP144 SNPs, as it causes a large number of CpGs that almost never occur in the population. ```{r getCpGsetALL1, eval=FALSE, warning=FALSE, message=FALSE} library(BSgenome.Hsapiens.UCSC.hg19) library(SNPlocs.Hsapiens.dbSNP144.GRCh37) genome = injectSNPs(Hsapiens, "SNPlocs.Hsapiens.dbSNP144.GRCh37") cpgset = getCpGsetALL(genome) # Number of CpGs with all SNPs injected in autosomes sum(sapply(cpgset[1:22], length)) ```{r echo=FALSE} 42841152 ``` The code above shows that using all dbSNP144 SNPs we get over 42 million CpGs instead of about 29 million when using only SNPs with minor allele frequency above 1%. In outbred population such as humans it's reasonable to ignore rare CpG-SNPs because they would have low power to detect associations. To exclude rare CpG-SNPs, we need allele frequency information. Unfortunately, (to our knowledge) Bioconductor packages with SNP information do not contain SNP allele frequencies. To alleviate this problem, we provide a way to inject SNP information from 1000 Genomes data or any other VCF. First, the VCF files, obtained from the 1000 Genomes project (or other sources), need to be processed by [`vcftools`](https://vcftools.github.io/man_latest.html) command `--counts`. Note that `vcftools` is an independent software, not part of RaMWAS. > `vcftools --gzvcf ALL.chr22.phase3.vcf.gz \` > ` --counts \` > ` --out count_ALL_chr22.txt` RaMWAS provides the function `injectSNPsMAF` to read in the generated allele count files, select common SNPs, and inject them in the reference genome. Here we apply it to chromosome 22. ```{r getCpGsetALL2, eval=FALSE} genome[["chr22"]] = injectSNPsMAF( gensequence = BSGenome[["chr22"]], frqcount = "count_ALL_chr22.txt", MAF = 0.01) # Find the CpGs cpgset = getCpGsetALL(genome) ``` Once a CpG set is generated, it can be saved with `saveRDS` function for use by RaMWAS. ```{r save1, eval=FALSE} saveRDS(file = "My_cpgset.rds", object = cpgset) ``` ## *In silico* alignment experiment CpG sites in loci that are problematic in terms of alignment need to be eliminated prior to analysis as CpG score estimates will be confounded with alignment errors. For example, repetitive elements constitute about 45% of the human genome. Reads may be difficult to align to these loci because of their high sequence similarity. To identify problematic sites we conduct an *in silico* experiment. The pre-computed CpG sets for human genome in this vignette are prepared for 75 bp single end reads. In the *in silico* experiment we first generate all possible 75 bp single-end reads from the forward strand of the reference. It starts with the read from position 1 to 75 on chromosome 1 of the reference. Next read spans positions 2 to 76, etc. In the perfect scenario, aligning these reads to the reference genome they originated from should cause each CpG to be covered by 75 reads. We excluded CpG sites with read coverage deviating from 75 by more than 10. For a typical mammalian genome the *in silico* experiment is computationally intensive, as it requires alignment of billions of artificially created reads. RaMWAS supports *in silico* experiments with the function `insilicoFASTQ` for creating artificial reads from the reference genome. The function supports gz compression of the output files, decreasing disk space requirement for human genome from about 500 GB to 17 GB. Here is how `insilicoFASTQ` function ```{r insilicoFASTQ, eval=FALSE} # Do for all chromosomes insilicoFASTQ( con="chr1.fastq.gz", gensequence = BSGenome[["chr1"]], fraglength=75) ``` The generated FASTQ files are then aligned to the reference genome. Taking Bowtie2 as an example: > `bowtie2 --local \` > ` --threads 14 \` > ` --reorder \` > ` -x bowtie2ind \` > ` -U chr1.fastq.gz | samtools view -bS -o chr1.bam` The generated BAMs are then scanned with RaMWAS and the coverage for one sample combining all the BAMs is calculated: ```{r RaMWAS, eval=FALSE} library(ramwas) chrset = paste0("chr",1:22) targetcov = 75 covtolerance = 10 param = ramwasParameters( dirproject = ".", dirbam = "./bams", dirfilter = TRUE, bamnames = chrset, bam2sample = list(all_samples = chrset), scoretag = "AS", minscore = 100, minfragmentsize = targetcov, maxfragmentsize = targetcov, minavgcpgcoverage = 0, minnonzerosamples = 0, # filecpgset - file with the CpG set being QC-ed filecpgset = filecpgset ) param1 = parameterPreprocess(param) ramwas1scanBams(param) ramwas3normalizedCoverage(param) ``` The following code then filters CpGs by the *in silico* coverage ```{r filter, eval=FALSE} # Preprocess parameters to learn the location of coverage matrix param1 = parameterPreprocess(param) # Load the coverage matrix (vector) cover = fm.load( paste0(param1$dircoveragenorm, "/Coverage")) # split the coverage by chromosomes # `cpgset` - the CpG set being QC-ed fac = rep(seq_along(cpgset), times = sapply(cpgset, length)) levels(fac) = names(cpgset) class(fac) = "factor" cover = split(cover, fac) # filter CpGs on each chromosome by the coverage cpgsetQC = cpgset for( i in seq_along(cpgset) ){ keep = (cover[[i]] >= (targetcov - covtolerance)) & (cover[[i]] <= (targetcov + covtolerance)) cpgsetQC[[i]] = cpgset[[i]][ keep ] } ``` Once the desired CpG set is generated, it can be saved with `saveRDS` function for use by RaMWAS. ```{r save2, eval=FALSE} saveRDS(file = "My_cpgset_QC.rds", object = cpgsetQC) ```