This case study analyses the HYE100 hybrid proteome samples from Navarro et al (2016). Two samples, A and B, were prepared by mixing human, yeast and E. coli tryptic peptides in specified proportions. The human peptides was obtained from HeLa cells, which is a cervical cancer cell line. Sample A was composed of 67% human, 30% yeast, and 3% E. coli, while sample B was composed of 67% human, 3% yeast, and 30% E. coli. The human proteins should not be differentially expressed between A and B, while yeast proteins should be 10-fold up-regulated in A, and E. coli proteins should be 10-fold up-regulated in B. Three technical replicates were prepared of A and B, to make a two group experiment with \(n=3\) samples in each group.
Demichev et al (2020) re-processed the mass spectrometry data using DIA-NN version 1, and the DIA-NN report file with the precursor quantifications can be downloaded from https://doi.org/10.17605/OSF.IO/6G3UX.
We use this data for two purposes. First, we use it as an example analysis using DIA-NN data. Second, we explore the data a little more to illustrate the advantages and limitations of a mixed-species data such as this.
The peptide precursor intensities are provided in a tab-delimited report file output by DIA-NN, which we can read into limpa using readDIANN():
> library(limpa)
Loading required package: limma
> y.peptide <- readDIANN("Fig2-diann.tsv", run.column="File.Name",
+ q.columns=c("Q.Value","Protein.Q.Value"),
+ extra.columns="Protein.Group", format="tsv")
> dim(y.peptide)
[1] 41919 6
The samples are in two groups.
> Replicate <- factor(c(1,1,2,2,3,3))
> Group <- factor(c("A","B","A","B","A","B"))
> colnames(y.peptide) <- paste(Group, Replicate, sep=".")
Add a species annotation column:
> PG <- y.peptide$genes$Protein.Group
> n <- nchar(PG)
> y.peptide$genes$Species <- substring(PG,n-4L,n)
> table(y.peptide$genes$Species)
ECOLI HUMAN YEAS8
12260 16882 12777
The detection probability slope is estimated to be about 0.6.
Here we estimate DPC from the complete-normal model using dpcCN because it is robust to very large fold-changes in the data.
> dpc <- dpcCN(y.peptide)
Iter 1: dpc = -2.3153 0.6263
Iter 2: dpc = -2.080 0.573
> plotDPC(dpc)
Quantify protein expression:
> y <- dpcQuant(y.peptide, dpc=dpc)
Estimating hyperparameters ...
Quantifying proteins ...
Proteins: 1000 Peptides: 9001
Proteins: 2000 Peptides: 17399
Proteins: 3000 Peptides: 24460
Proteins: 4000 Peptides: 29422
Proteins: 5000 Peptides: 36667
Proteins: 6000 Peptides: 40815
Proteins: 6430 Peptides: 41919
> y$genes$Protein.Group <- NULL
The A and B samples separate very clearly, showing that there is considerable DE in the data:
> plotMDSUsingSEs(y)
Now we form the design matrix and proceed to differential expression analysis. First, limpa identifies a clear variance trend that will be used in the empirical Bayes modelling:
> design <- model.matrix(~Group)
> fit <- dpcDE(y, design, plot=TRUE)
Differential expression analysis identifies Yeast and E. coli samples as highly signficant:
> fit <- eBayes(fit)
> summary(dt <- decideTests(fit[,2]))
GroupB
Down 1841
NotSig 2887
Up 1702
> topTable(fit, coef=2, n=10)
Species NPeptides PropObs logFC AveExpr t P.Value adj.P.Val B
1/tr|C8ZD01|C8ZD01_YEAS8 YEAS8 2 1.0000 -3.519 8.881 -119.10 2.693e-15 6.639e-12 25.87
1/sp|P0A7N9|RL33_ECOLI ECOLI 2 1.0000 3.286 8.466 115.31 3.568e-15 6.639e-12 25.59
1/tr|C8ZG50|C8ZG50_YEAS8 YEAS8 17 0.9216 -3.481 7.096 -115.30 3.571e-15 6.639e-12 25.58
1/sp|P0A910|OMPA_ECOLI ECOLI 25 0.8400 3.289 7.854 113.39 4.130e-15 6.639e-12 25.51
1/sp|P69908|DCEA_ECOLI ECOLI 2 1.0000 3.387 8.295 106.58 7.085e-15 9.111e-12 24.91
2/tr|C8ZF58|C8ZF58_YEAS8/tr|C8ZIC9|C8ZIC9_YEAS8 YEAS8 4 0.8750 -3.337 7.219 -103.76 8.942e-15 9.582e-12 24.77
1/sp|P02768|ALBU_HUMAN HUMAN 2 1.0000 -2.551 9.524 -96.68 1.654e-14 1.341e-11 24.19
1/sp|P0A8V2|RPOB_ECOLI ECOLI 78 0.7286 3.281 5.081 95.90 1.775e-14 1.341e-11 24.01
1/tr|C8Z3F1|C8Z3F1_YEAS8 YEAS8 53 0.7516 -3.611 6.288 -95.29 1.877e-14 1.341e-11 23.90
1/tr|C8Z4U5|C8Z4U5_YEAS8 YEAS8 9 0.9444 -3.434 6.856 -91.51 2.670e-14 1.500e-11 23.65
A mean-difference plot shows that limpa recovers correct fold-changes for most protein, even single-peptides proteins and those at very low intensities:
> plotMD(fit, coef=2, status=fit$genes$Species, main="B vs A")
> abline(h=log2(10), lty=2)
> abline(h=0, lty=2)
> abline(h=-log2(10), lty=2)
The plot becomes clearer if we restrict to proteins with two or more precursors:
> i <- which(fit$genes$NPeptides>1)
> plotMD(fit[i,], coef=2, status=fit$genes$Species[i], main="B vs A")
> abline(h=log2(10),lty=2)
> abline(h=0, lty=2)
> abline(h=-log2(10),lty=2)
One of the advantages of limpa is its ability to assess DE even for proteins that are entirely missing in one of the groups. We can compute the number of non-missing observations for each protein in each group:
> NObs.A <- rowSums(y$other$n.observations[,c(1,3,5)])
> NObs.B <- rowSums(y$other$n.observations[,c(2,4,6)])
For this data, 499 of the proteins that are significantly up are missing entirely in the A samples, and 801 of the proteins that are significantly down are missing entirely in the B samples:
> table(DE=dt, pmin(NObs.A,5))
DE 0 1 2 3 4 5
-1 0 4 13 248 37 1539
0 304 132 183 762 48 1458
1 499 85 74 178 38 828
> table(DE=dt, pmin(NObs.B,5))
DE 0 1 2 3 4 5
-1 801 91 77 195 40 637
0 380 126 156 730 48 1447
1 0 2 15 193 15 1477
As as example, we pick out most significant of the latter proteins:
> PValue <- fit$p.value[,2]
> PGB <- names(which.min(PValue[NObs.B==0]))
Plot the protein quantifications with standard errors:
> plotProtein(y, PGB, main=PGB)
Now plot the corresponding precursor log-intensities.
This yeast protein has 39 precursors, all of which are missing in all the B samples.
In plot, closed circles indicate observations and open circles indicate missing values.
For the purposes of the plot, the missing values are plotted 0.5 units below the minimum observation for the same precursor:
> plotPeptides(y.peptide[y.peptide$genes$Protein.Group==PGB,],
+ main=paste(PGB,"precursors"))
We did not do any extra normalization on top of that already done by DIA-NN. The human proteins seem equally distributed between samples, so no extra normalization seems necessary:
> isH <- which(y$genes$Species == "HUMAN")
> boxplot(as.data.frame(y$E[isH,]),
+ ylab="log2-Intensity", main="Human proteins")
As expected, the yeast proteins are lower in B samples:
> isY <- which(y$genes$Species == "YEAS8")
> boxplot(as.data.frame(y$E[isY,]),
+ ylab="log2-Intensity", main="Yeast proteins")
while the E. coli proteins are lower in A samples:
> isE <- which(y$genes$Species == "ECOLI")
> boxplot(as.data.frame(y$E[isE,]),
+ ylab="log2-Intensity", main="E coli proteins")
Now we explore the pattern of missing values at the precursor level:
> NMissing.A <- rowSums(is.na(y.peptide$E[,c(1,3,5)]))
> NMissing.B <- rowSums(is.na(y.peptide$E[,c(2,4,6)]))
> table(NMissing.A, y.peptide$genes$Species)
NMissing.A ECOLI HUMAN YEAS8
0 2851 15021 10281
1 895 1064 1484
2 1137 493 988
3 7377 304 24
> table(NMissing.B, y.peptide$genes$Species)
NMissing.B ECOLI HUMAN YEAS8
0 10240 14966 1818
1 1215 990 744
2 793 623 951
3 12 303 9264
We see that human precursors have similar numbers of missing values in A and B samples, while E. coli precursors have far more missing values in A and yeast precursors have far more missing values in B samples. This confirms that the missing values are intensity-dependent, because the missingness pattern correlates with global changes in precursor intensities.
This dataset was created by Navarro et al (2016) and Demichev et al (2020) in order to benchmark peptide quantication algorithms. The dataset is, however, different from real biological datasets in a number of ways that make it unsuitable for benchmarking differential expression statistical methods. One issue that is that the DE proteins and peptides have systematically lower intensities (and consequently more missing values) than the non-DE proteins and peptides. Another issue is that log-fold-changes are very large, while the variation between replicates is very small and not representative of biological variation. Another issue is that the dataset contains false positives, whereby some of the human proteins are clearly differentially expressed, although the experimental design should have ensured that they were not. A striking example of a DE human protein is albumin, for which both precursors are clearly down in the B samples:
> i <- grep("ALBU_HUMAN",y.peptide$genes$Protein.Group)
> plotPeptides(y.peptide[i,], main="ALBU_HUMAN precursors")
The fact that human albumin appears DE is an artefact of the dataset rather than an error of the statistical testing method.
Demichev V, Messner CB, Vernardis SI, Lilley KS, Ralser M (2020). DIA-NN: neural networks and interference correction enable deep proteome coverage in high throughput. Nature Methods 17(1), 41-44.
Navarro P, Kuharev J, Gillet LC, Bernhardt OM, MacLean B, Rost HL, Tate SA, Tsou CC, Reiter L, Distler U, Rosenberger G, Perez-Riverol Y, Nesvizhskii AI, Aebersold R, Tenzer S (2016). A multicenter study benchmarks software tools for label-free proteome quantification. Nature Biotechnology, 34(11), 1130-1136.