EthSEQ: Ethnicity Annotation from Whole-Exome and Targeted Sequencing Data

Alessandro Romanel, Davide Dalfovo

2023-03-18

Whole-exome sequencing (WES) and targeted sequencing (TS) are widely utilized both in translational cancer genomics studies and in the setting of precision medicine. Stratification of individuals’ ethnicity/ancestry is fundamental for the correct interpretation of personal genomic variation impact. We implemented EthSEQ to provide reliable and rapid ancestry annotation from whole -exome and targeted sequencing data. EthSEQ can be integrated into any WES or TS based processing pipeline and exploits multi-core capabilities.

EthSEQ requires genotype data at SNPs positions for a set of individuals with known ancestry (the reference model) and either a list of BAM files or genotype data (in VCF or GDS formats) of individuals with unknown ancestry (the target model). EthSEQ annotates the ancestry of each individual using an automated procedure and returns detailed information about individual’s inferred ancestry, including aggregated visual reports.


Perform ethnicity analysis with individuals genotype data from VCF file

Analysis of target individuals genotype data exploiting a reference model built from 1,000 Genome Project genotype data. Genotype data for 10,000 exonic SNPs are provided in input to EthSEQ in VCF format while the reference model is provided in GDS format and describes genotype data for 1,000 Genome Project individuals for the same SNPs set.

library(EthSEQ)

## Run the analysis
ethseq.Analysis(
  target.vcf = system.file("extdata", "Samples.HGDP.10000SNPs.vcf",package="EthSEQ"),
  model.gds = system.file("extdata", "Reference.Gencode.Exome.10000SNPs.gds",package="EthSEQ"),
  out.dir = file.path(tempdir(),"EthSEQ_Analysis/"),
  verbose=TRUE,
  cores =1,
  composite.model.call.rate = 1,
  space = "3D")
## [2023-03-18 08:55:27] Running EthSEQ
## [2023-03-18 08:55:27] Working directory: /tmp/RtmpsrWazj/EthSEQ_Analysis/
## [2023-03-18 08:55:27] Create /tmp/RtmpsrWazj/EthSEQ_Analysis/ folder
## [2023-03-18 08:55:27] Create target model from VCF
## [2023-03-18 08:55:27] Create aggregated model
## Merge SNP GDS files:
##     open '/tmp/RtmpsrWazj/EthSEQ_Analysis//Target.gds' ...
##         6 samples, 10000 SNPs
##     open '/tmp/RtmpfLfaod/Rinstec167d13c9ca/EthSEQ/extdata/Reference.Gencode.Exome.10000SNPs.gds' ...
##         619 samples, 10000 SNPs
## Concatenating samples (mapping to the first GDS file) ...
##     reference: 10000 SNPs (100.0%)
##     file 2: 586 allele flips, 0 ambiguous locus/loci
##         [no flip]: 9414
##         [flip]: 586
##     create '/tmp/RtmpsrWazj/EthSEQ_Analysis//Aggregated.gds': 625 samples, 10000 SNPs
##     FileFormat = SNP_ARRAY
##     writing genotypes ...
##     transposing the genotype matrix ...
## Clean up the fragments of GDS file:
##     open the file '/tmp/RtmpsrWazj/EthSEQ_Analysis//Aggregated.gds' (1.8M)
##     # of fragments: 35
##     save to '/tmp/RtmpsrWazj/EthSEQ_Analysis//Aggregated.gds.tmp'
##     rename '/tmp/RtmpsrWazj/EthSEQ_Analysis//Aggregated.gds.tmp' (1.1M, reduced: 718.8K)
##     # of fragments: 15
## Done.
## [2023-03-18 08:55:27] Perform PCA on aggregated model
## Principal Component Analysis (PCA) on genotypes:
## Excluding 0 SNP on non-autosomes
## Excluding 543 SNPs (monomorphic: TRUE, MAF: NaN, missing rate: 0)
##     # of samples: 625
##     # of SNPs: 9,457
##     using 1 thread
##     # of principal components: 5
## PCA:    the sum of all selected genotypes (0,1,2) = 9939848
## CPU capabilities: Double-Precision SSE2
## Sat Mar 18 08:55:27 2023    (internal increment: 2244)
## 
[..................................................]  0%, ETC: ---        
[==================================================] 100%, completed, 1s
## Sat Mar 18 08:55:28 2023    Begin (eigenvalues and eigenvectors)
## Sat Mar 18 08:55:28 2023    Done.
## [2023-03-18 08:55:28] Infer ethnicities
## [2023-03-18 08:55:28] Print annotations
## [2023-03-18 08:55:28] Plot visual report
## [2023-03-18 08:55:28] Computation end
## [1] TRUE
## Load and display computed ethnicity annotations
ethseq.annotations = read.delim(file.path(tempdir(),"EthSEQ_Analysis/Report.txt"),
    sep="\t",as.is=TRUE,header=TRUE)
head(ethseq.annotations)
##   sample.id pop    type            contribution
## 1   Sample1 AFR  INSIDE                        
## 2   Sample2 AFR  INSIDE                        
## 3   Sample3 EUR  INSIDE                        
## 4   Sample4 EUR CLOSEST   EUR(89.3%)|AFR(10.7%)
## 5   Sample5 SAS CLOSEST EUR(28.21%)|SAS(71.79%)
## 6   Sample6 EAS  INSIDE
## Delete analysis folder
unlink(file.path(tempdir(),"EthSEQ_Analysis/"),recursive=TRUE)

Current version of EthSEQ manages only VCF files with the following format: - FORMAT column should contain “GT” - Only genotypes 0/0, 0/1, 1/1 and ./. are admitted - Only positions with single reference and single alternative base are admitted - No duplicate IDs are admitted (so no multiple variants with ID equal to “.”) - No duplicated sample names are admitted - No duplicated positions are admitted

Perform ethnicity analysis using pre-computed reference model

Analysis of target individuals genotype data using a reference model built from 1,000 Genome Project genotype data. Genotype data for 10,000 exonic SNPs are provided in input to EthSEQ in VCF format while reference model selected among the set of pre-computed reference models. Reference model Gencode.Exome is used considering hg38 human reference genome assembly and All populations (EUR, AFR, AMR, SAS, EAS). The complete list of reference models can be visualized using the function getModelsList().

library(EthSEQ)

## View all available reference models
getModelsList()

## Run the analysis
ethseq.Analysis(
  target.vcf = system.file("extdata", "Samples.HGDP.10000SNPs.vcf",package="EthSEQ"),
  model.available = "Gencode.Exome",
  model.assembly = "hg38",
  model.pop = "All",
  out.dir = file.path(tempdir(),"EthSEQ_Analysis/"),
  verbose=TRUE,
  cores =1,
  composite.model.call.rate = 1,
  space = "3D")

## Delete analysis folder
unlink(file.path(tempdir(),"EthSEQ_Analysis/"),recursive=TRUE)

Perform ethnicity analysis from BAM files list

Analysis of individual NA07357 from 1,000 Genome Project using a reference model built from 1,000 Genome Project individual’s genotype data. Genotype data for 10,000 SNPs included in Agilent Sure Select v2 captured regions are provided in input to EthSEQ with a BAM file. reference model is provided in GDS format and describes genotype data for 1,000 Genome Project individuls for the same SNPs set. Note than the BAM given in input to EthSEQ is a toy BAM file containing only reads overlapping the positions of the 10,000 SNPs considered in the analysis.

library(EthSEQ)

## Download BAM file used in the analysis
download.file("https://github.com/cibiobcg/EthSEQ_Data/raw/master/BAM/HGDP00228.sub_GRCh38.bam",
              destfile = file.path(tempdir(),"HGDP00228.sub_GRCh38.bam"))
download.file("https://github.com/cibiobcg/EthSEQ_Data/raw/master/BAM/HGDP00228.sub_GRCh38.bam.bai",
              destfile = file.path(tempdir(),"HGDP00228.sub_GRCh38.bam.bai"))
download.file("https://github.com/cibiobcg/EthSEQ_Data/raw/master/BAM/HGDP01200.sub_GRCh38.bam",
              destfile = file.path(tempdir(),"HGDP01200.sub_GRCh38.bam"))
download.file("https://github.com/cibiobcg/EthSEQ_Data/raw/master/BAM/HGDP01200.sub_GRCh38.bam.bai",
              destfile = file.path(tempdir(),"HGDP01200.sub_GRCh38.bam.bai"))
download.file("https://github.com/cibiobcg/EthSEQ_Data/raw/master/BAM/HGDP01201.sub_GRCh38.bam",
              destfile = file.path(tempdir(),"HGDP01201.sub_GRCh38.bam"))
download.file("https://github.com/cibiobcg/EthSEQ_Data/raw/master/BAM/HGDP01201.sub_GRCh38.bam.bai",
              destfile = file.path(tempdir(),"HGDP01201.sub_GRCh38.bam.bai"))

## Create BAM files list 
write(c(file.path(tempdir(),"HGDP00228.sub_GRCh38.bam"),
        file.path(tempdir(),"HGDP01200.sub_GRCh38.bam"),
        file.path(tempdir(),"HGDP01201.sub_GRCh38.bam")),
        file.path(tempdir(),"BAMs_List.txt"))

## Run the analysis
ethseq.Analysis(
  bam.list = file.path(tempdir(),"BAMs_List.txt"),
  model.available = "Gencode.Exome",
  out.dir = file.path(tempdir(),"EthSEQ_Analysis/"),
  verbose = TRUE,
  cores = 1,
  aseq.path = file.path(tempdir(),"EthSEQ_Analysis/"),
  run.genotype = TRUE,
  mbq = 20,
  mrq = 20,
  mdc = 10,
  composite.model.call.rate = 1,
  space = "3D",
  bam.chr.encoding = TRUE) # chromosome names encoded without "chr" prefix in BAM files

## Delete analysis folder
unlink(file.path(tempdir(),"EthSEQ_Analysis/"),recursive=TRUE)

Perform ethnicity analysis using multi-step refinement

Multi-step refinement analysis using a pre-computed reference model. Genotype data for 10,000 exonic SNPs are provided in input to EthSEQ in VCF format. Multi-step refinement tree is constructed as a matrix. Non-empty cells in columns i contains parent nodes for non-empty cells in columns i+1. Ancestry groups in child nodes should be included in parent nodes, while siblings node ancestry groups should be disjoint. Consult EthSEQ papers for more details.

library(EthSEQ)

## Create multi-step refinement matrix
m = matrix("",ncol=2,nrow=2)
m[1,1] = "EUR|AFR|AMR"
m[2,2] = "EUR|AMR"

## Run the analysis on a toy example with only 10000 SNPs
ethseq.Analysis(
  target.vcf = system.file("extdata","Samples.HGDP.10000SNPs.vcf",package="EthSEQ"),
  out.dir = file.path(tempdir(),"EthSEQ_Analysis/"),
  model.available = "Gencode.Exome",
  verbose = TRUE,
  refinement.analysis = m,
  composite.model.call.rate = 1,
  space = "3D")

## Delete analysis folder
unlink(file.path(tempdir(),"EthSEQ_Analysis/"),recursive=TRUE)

Create a reference model from multiple VCF genotype data files

Construction of a reference model from two genotype data files in VCF format and a corresponding annotation files which described ancestry and sex of each sample contained in the genotype data files.

library(EthSEQ)

### Load list of VCF files paths
vcf.files = c(system.file("extdata","RefSample1.vcf", package="EthSEQ"),
                system.file("extdata","RefSample2.vcf", package="EthSEQ"))

### Load samples annotations
annot.samples = read.delim(system.file("extdata","Annotations_Test_v3.txt",package="EthSEQ"))

### Create reference model
ethseq.RM(
  vcf.fn = vcf.files,
  annotations = annot.samples,
  out.dir = file.path(tempdir(),"EthSEQ_Analysis/"),
  model.name = "Reference.Model",
  bed.fn = NA,
  call.rate = 1,
  cores = 1)

## Delete example file
unlink(file.path(tempdir(),"EthSEQ_Analysis/"),recursive=TRUE)