Abstract

multidog() provides support for genotyping many SNP's by iterating flexdog() over the SNP's. Support is provided for parallel computing. The genotyping method is described in @gerard2018genotyping and @gerard2019priors.

Analysis

Let's load updog and the data from @uitdewilligen2013next.

library(updog)
data("uitdewilligen")

uitdewilligen$refmat is a matrix of reference counts while uitdewilligen$sizemat is a matrix of total read counts. In these data, the rows index the individuals and the columns index the loci. But for insertion into multidog() we need it the other way around (individuals in the columns and loci in the rows). So we will transpose these matrices.

refmat  <- t(uitdewilligen$refmat)
sizemat <- t(uitdewilligen$sizemat)
ploidy  <- uitdewilligen$ploidy

sizemat and refmat should have the same row and column names. These names identify the loci and the individuals.

setdiff(colnames(sizemat), colnames(refmat))
#> character(0)
setdiff(rownames(sizemat), rownames(refmat))
#> character(0)

If we want to do parallel computing, we should check that we have the proper number of cores:

parallel::detectCores()
#> [1] 16

Now let's run multidog():

mout <- multidog(refmat = refmat, 
                 sizemat = sizemat, 
                 ploidy = ploidy, 
                 model = "norm",
                 nc = 2)

There is a plot method for the output of multidog().

plot(mout, indices = c(1, 5, 100))
#> [[1]]

plot of chunk unnamed-chunk-6

#> 
#> [[2]]

plot of chunk unnamed-chunk-6

#> 
#> [[3]]

plot of chunk unnamed-chunk-6

The output of multidog contains two data frame. The first contains properties of the SNP's, such as estimated allele bias and estimated sequencing error rate.

str(mout$snpdf)
#> 'data.frame':    100 obs. of  16 variables:
#>  $ snp     : Factor w/ 100 levels "PotVar0089524",..: 1 2 3 4 5 6 7 8 9 10 ...
#>  $ bias    : num  0.519 1.026 0.929 1.221 0.847 ...
#>  $ seq     : num  0.00485 0.00221 0.002 0.0039 0.00206 ...
#>  $ od      : num  0.00304 0.00295 0.00337 0.00275 0.00335 ...
#>  $ prop_mis: num  0.004926 0.002274 0.000626 0.002718 0.003 ...
#>  $ num_iter: num  6 3 3 5 7 7 4 8 8 4 ...
#>  $ llike   : num  -14.7 -25.3 -10.4 -22.7 -32 ...
#>  $ ploidy  : num  4 4 4 4 4 4 4 4 4 4 ...
#>  $ model   : Factor w/ 1 level "norm": 1 1 1 1 1 1 1 1 1 1 ...
#>  $ Pr_0    : num  0.000279 0.248211 0.66369 0.015803 0.08409 ...
#>  $ Pr_1    : num  0.00707 0.45067 0.26892 0.06938 0.20154 ...
#>  $ Pr_2    : num  0.0745 0.2542 0.0597 0.1931 0.2968 ...
#>  $ Pr_3    : num  0.32604 0.04452 0.00725 0.34069 0.26844 ...
#>  $ Pr_4    : num  0.592065 0.002423 0.000482 0.381024 0.149179 ...
#>  $ mu      : num  4.18 1.01 -1 3.75 2.29 ...
#>  $ sigma   : num  1.067 0.925 1.289 1.481 1.433 ...

The second data frame contains properties of each individual at each SNP, such as the estimated genotypes (geno) and the posterior probability of being genotyping correctly (maxpostprob).

str(mout$inddf)
#> 'data.frame':    1000 obs. of  12 variables:
#>  $ snp        : Factor w/ 100 levels "PotVar0089524",..: 1 1 1 1 1 1 1 1 1 1 ...
#>  $ ind        : Factor w/ 10 levels "P1PEM10","P2PEM05",..: 8 4 3 10 7 2 1 5 9 6 ...
#>  $ ref        : num  122 113 86 80 69 85 130 228 60 211 ...
#>  $ size       : num  142 143 96 80 69 86 130 228 86 212 ...
#>  $ geno       : num  3 3 3 4 4 4 4 4 2 4 ...
#>  $ postmean   : num  3 2.99 3 4 4 ...
#>  $ maxpostprob: num  1 0.988 1 1 1 ...
#>  $ Pr_0       : num  3.74e-90 1.03e-78 2.21e-77 1.06e-86 8.21e-79 ...
#>  $ Pr_1       : num  7.97e-23 3.86e-16 2.61e-20 6.80e-30 1.21e-26 ...
#>  $ Pr_2       : num  4.94e-06 1.17e-02 3.27e-06 2.82e-14 1.01e-12 ...
#>  $ Pr_3       : num  1.00 9.88e-01 1.00 6.74e-06 2.75e-05 ...
#>  $ Pr_4       : num  1.45e-10 1.14e-15 3.56e-06 1.00 1.00 ...

You can obtain the columns in inddf in matrix form with format_multidog().

genomat <- format_multidog(mout, varname = "geno")
head(genomat)
#>               P1PEM10 P2PEM05 P2PEM10 P3PEM05 P4PEM01 P4PEM09 P5PEM04 P5PEM08
#> PotVar0089524       4       4       3       3       4       4       4       3
#> PotVar0052647       3       1       0       1       1       2       0       1
#> PotVar0120897       0       0       0       0       0       0       0       1
#> PotVar0066020       3       2       3       4       4       3       1       4
#> PotVar0003381       3       1       2       0       2       3       3       1
#> PotVar0131622       2       4       1       2       2       3       4       3
#>               P6PEM11 P7PEM09
#> PotVar0089524       2       4
#> PotVar0052647       1       1
#> PotVar0120897       2       1
#> PotVar0066020       4       2
#> PotVar0003381       4       3
#> PotVar0131622       3       3

References