Type: | Package |
Title: | Analyze High-Dimensional High-Throughput Dataset and Quality Control Single-Cell RNA-Seq |
Version: | 1.0.2 |
Date: | 2025-09-24 |
Maintainer: | Junhui Li <ljh.biostat@gmail.com> |
Description: | The advent of genomic technologies has enabled the generation of two-dimensional or even multi-dimensional high-throughput data, e.g., monitoring multiple changes in gene expression in genome-wide siRNA screens across many different cell types (E Robert McDonald 3rd (2017) <doi:10.1016/j.cell.2017.07.005> and Tsherniak A (2017) <doi:10.1016/j.cell.2017.06.010>) or single cell transcriptomics under different experimental conditions. We found that simple computational methods based on a single statistical criterion is no longer adequate for analyzing such multi-dimensional data. We herein introduce 'ZetaSuite', a statistical package initially designed to score hits from two-dimensional RNAi screens.We also illustrate a unique utility of 'ZetaSuite' in analyzing single cell transcriptomics to differentiate rare cells from damaged ones (Vento-Tormo R (2018) <doi:10.1038/s41586-018-0698-6>). In 'ZetaSuite', we have the following steps: QC of input datasets, normalization using Z-transformation, Zeta score calculation and hits selection based on defined Screen Strength. |
BugReports: | https://github.com/JunhuiLi1017/ZetaSuite/issues |
Imports: | RColorBrewer, Rtsne, e1071, ggplot2, reshape2, gridExtra, mixtools, shinyjs, shinydashboard, shiny, plotly, DT |
License: | MIT + file LICENSE |
Depends: | R (≥ 2.10) |
RoxygenNote: | 7.3.2 |
Suggests: | knitr, rmarkdown |
VignetteBuilder: | knitr |
Author: | Yajing Hao |
NeedsCompilation: | no |
Packaged: | 2025-09-24 20:22:23 UTC; lij11 |
Repository: | CRAN |
Date/Publication: | 2025-09-24 21:00:02 UTC |
Encoding: | UTF-8 |
Generate event coverage analysis and visualization for alternative splicing data.
Description
This function analyzes event coverage across Z-score thresholds and generates visualizations to compare positive and negative control samples. It calculates the proportion of readouts that exceed different Z-score thresholds for each gene, creating the foundation for zeta score calculations.
Usage
EventCoverage(ZscoreVal, negGene, posGene, binNum, combine = TRUE)
Arguments
ZscoreVal |
A matrix of Z-scores where rows represent genes and columns represent readouts/conditions. This is typically the output from the Zscore() function. |
negGene |
A data frame or matrix containing negative control gene/siRNA identifiers. The first column should contain gene/siRNA names that match the row names in ZscoreVal. |
posGene |
A data frame or matrix containing positive control gene/siRNA identifiers. The first column should contain gene/siRNA names that match the row names in ZscoreVal. |
binNum |
The number of bins to divide the Z-score range. Recommended value is 100. The function creates Z-score thresholds from the 0.00001 to 0.99999 quantiles of the data. |
combine |
Logical. Whether to combine the negative and positive Z-score ranges. Default is TRUE. When TRUE, uses symmetric ranges around zero; when FALSE, uses separate ranges for negative and positive values. |
Details
The function performs the following steps:
Determines Z-score thresholds based on data quantiles and binNum
For each gene and threshold, calculates the proportion of readouts that exceed (increase) or fall below (decrease) the threshold
Separates data into negative and positive control groups
Generates jitter plots comparing event coverage between control groups
Returns both data matrices and visualization plots
The event coverage matrices can be used as input for SVM analysis and zeta score calculations.
Value
A list containing two sublists:
ECdataList |
A list with the following components:
|
ECplotList |
A list with two ggplot objects:
|
Author(s)
Yajing Hao, Shuyang Zhang, Junhui Li, Guofeng Zhao, Xiang-Dong Fu
Examples
data(countMat)
data(negGene)
data(posGene)
ZscoreVal <- Zscore(countMat, negGene)
ECList <- EventCoverage(ZscoreVal, negGene, posGene, binNum=100, combine=TRUE)
Determine optimal cutoff thresholds based on Screen Strength analysis.
Description
This function calculates optimal cutoff thresholds for identifying significant hits in high-throughput screening data using Screen Strength (SS) analysis. It evaluates the trade-off between sensitivity and specificity by calculating the ratio of apparent FDR to baseline FDR across different zeta score thresholds.
Usage
FDRcutoff(zetaData, negGene, posGene, nonExpGene, combine = FALSE)
Arguments
zetaData |
A data frame containing zeta scores calculated by the Zeta() function. Should have columns 'Zeta_D' and 'Zeta_I' representing decrease and increase direction scores, respectively. |
negGene |
A data frame or matrix containing negative control gene/siRNA identifiers. The first column should contain gene/siRNA names that match the row names in zetaData. |
posGene |
A data frame or matrix containing positive control gene/siRNA identifiers. The first column should contain gene/siRNA names that match the row names in zetaData. |
nonExpGene |
A data frame or matrix containing non-expressed gene/siRNA identifiers. These genes are used to estimate the baseline false discovery rate. The first column should contain gene/siRNA names that match the row names in zetaData. |
combine |
Logical. Whether to combine decrease and increase direction zeta scores. Default is FALSE. When TRUE, uses the sum of Zeta_D and Zeta_I; when FALSE, analyzes each direction separately. |
Details
The function performs the following analysis:
Categorizes genes into types: "Gene" (test genes), "Positive" (positive controls), "NS_mix" (negative controls), and "non_exp" (non-expressed genes)
Calculates baseline FDR (bFDR) as the proportion of non-expressed genes in the entire dataset
For each zeta score threshold, calculates apparent FDR (aFDR) as the proportion of non-expressed genes among hits
Computes Screen Strength: SS = 1 - (aFDR / bFDR)
Generates plots showing zeta score distributions and SS curves
Higher Screen Strength values indicate better separation between true hits and false positives. Users can select appropriate thresholds based on desired sensitivity/specificity trade-offs.
Value
A list containing:
FDR_cutOff |
A data frame with 6 columns:
|
plotList |
A list with two ggplot objects:
|
Author(s)
Yajing Hao, Shuyang Zhang, Junhui Li, Guofeng Zhao, Xiang-Dong Fu
Examples
data(nonExpGene)
data(negGene)
data(posGene)
data(ZseqList)
data(countMat)
ZscoreVal <- Zscore(countMat, negGene)
zetaData <- Zeta(ZscoreVal, ZseqList, SVM=FALSE)
cutoffval <- FDRcutoff(zetaData, negGene, posGene, nonExpGene, combine=TRUE)
Perform quality control analysis for high-throughput screening data.
Description
This function performs comprehensive quality control analysis on high-throughput screening data to evaluate experimental design and data quality. It generates multiple diagnostic plots and calculates SSMD (Strictly Standardized Mean Difference) scores to assess the separation between positive and negative controls.
Usage
QC(countMat, negGene, posGene)
Arguments
countMat |
A matrix of raw count data where rows represent genes/siRNAs and columns represent readouts/conditions. The matrix should have row names corresponding to gene/siRNA identifiers. |
negGene |
A data frame or matrix containing negative control gene/siRNA identifiers. The first column should contain gene/siRNA names that match the row names in countMat. |
posGene |
A data frame or matrix containing positive control gene/siRNA identifiers. The first column should contain gene/siRNA names that match the row names in countMat. |
Details
The function performs the following quality control analyses:
Creates jitter plots to visualize score distributions across readouts
Performs t-SNE dimensionality reduction to assess global sample separation
Generates boxplots to compare score distributions between control groups
Calculates SSMD scores for each readout:
\mathrm{SSMD} = (\mu_{pos} - \mu_{neg}) / \sqrt{\sigma_{pos}^2 + \sigma_{neg}^2}
Reports the percentage of readouts with
|\mathrm{SSMD}| \ge 2
(considered high quality)
SSMD scores \ge 2
indicate good separation between positive and negative controls, suggesting high-quality readouts.
Value
A list containing four diagnostic plots:
score_qc |
A jitter plot showing the distribution of raw scores across all readouts for positive and negative controls |
tSNE_QC |
A t-SNE plot showing the global separation of positive and negative control samples in 2D space |
QC_box |
Side-by-side boxplots showing the distribution of scores for positive and negative controls across all readouts |
QC_SSMD |
A density plot showing the distribution of SSMD scores across readouts, with a threshold line at SSMD=2 and the percentage of high-quality readouts displayed |
Author(s)
Yajing Hao, Shuyang Zhang, Junhui Li, Guofeng Zhao, Xiang-Dong Fu
References
Laurens van der Maaten & Geoffrey Hinton: Visualizing Data using t-SNE. Journal of Machine Learning Research 2008, 9(2008):2579-2605.
Zhang XD: A pair of new statistical parameters for quality control in RNA interference high-throughput screening assays. Genomics 2007, 89:552-561.
Examples
data(countMat)
data(negGene)
data(posGene)
QC(countMat, negGene, posGene)
Generate SVM decision boundaries for positive and negative control separation.
Description
This function constructs Support Vector Machine (SVM) models to find optimal decision boundaries that separate positive and negative control samples in event coverage space. It uses radial kernel SVM to create non-linear decision boundaries for both decrease and increase directions.
Usage
SVM(ECdataList)
Arguments
ECdataList |
A list containing event coverage data from the EventCoverage() function. The list should contain:
|
Details
The function performs the following steps:
Prepares training data by combining positive and negative control event coverage data
Trains separate SVM models for decrease and increase directions using radial kernel
Uses pre-tuned hyperparameters (cost=20, gamma=3 for decrease; cost=50, gamma=2 for increase)
Generates prediction grids across the Z-score and event coverage space
Identifies decision boundary points where the SVM prediction changes from negative to positive
Returns the optimal threshold points for each Z-score bin
The resulting SVM curves can be used for background correction in zeta score calculations to improve the accuracy of hit identification.
Value
A list containing two data frames:
cutOffD |
A data frame with SVM decision boundary points for decrease direction. Each row contains a Z-score threshold and the corresponding event coverage threshold that separates positive and negative controls. |
cutOffI |
A data frame with SVM decision boundary points for increase direction. Each row contains a Z-score threshold and the corresponding event coverage threshold that separates positive and negative controls. |
Author(s)
Yajing Hao, Shuyang Zhang, Junhui Li, Guofeng Zhao, Xiang-Dong Fu
Examples
data(countMat)
data(negGene)
data(posGene)
ZscoreVal <- Zscore(countMat, negGene)
ECdataList <- EventCoverage(ZscoreVal, negGene, posGene, binNum=10, combine=TRUE)
SVM(ECdataList)
The SVM curve lines in Zeta-plot.
Description
The SVM curves were calculated from raw input matrix files. They were designed to maximally seperate the positive and negative genes.
Usage
data("SVMcurve")
Format
A data frame with 24 rows and 4 features.
A data frame with 24 rows and 4 features.The first column is the bins cut-offs for decresed direction. The second column is the values of percentage with different cut-offs in column 1. The third column is the bins cut-offs for increased direction. The fourth column is the values of percentage with different cut-offs in column 3.
Details
This data frame is the generated by SVM.R.
Examples
data(SVMcurve)
Calculation of zeta and weighted zeta score.
Description
This function calculates zeta scores for genes based on their Z-score profiles across different thresholds. The zeta score quantifies the regulatory effect of gene knockdown on alternative splicing events by measuring the area under the curve of event coverage across Z-score thresholds.
Usage
Zeta(ZscoreVal, ZseqList, SVMcurve = NULL, SVM = FALSE)
Arguments
ZscoreVal |
A matrix of Z-scores where rows represent genes and columns represent readouts/conditions. This is typically the output from the Zscore() function. |
ZseqList |
A list containing two vectors: 'Zseq_D' (decrease direction thresholds) and 'Zseq_I' (increase direction thresholds). These define the Z-score bins for calculating event coverage. |
SVMcurve |
Optional. A matrix containing SVM curve data for decrease and increase directions. Required only when SVM=TRUE. The matrix should have 4 columns: Z-score and coverage for decrease direction (columns 1-2), and Z-score and coverage for increase direction (columns 3-4). |
SVM |
Logical. Whether to use SVM curves for background correction. Default is FALSE. When TRUE, the function subtracts SVM-predicted background from the event coverage before calculating zeta scores. |
Details
The function calculates zeta scores as follows:
For each Z-score threshold, calculates the proportion of readouts that exceed (increase) or fall below (decrease) the threshold
Computes the area under the event coverage curve using trapezoidal integration
If SVM=TRUE, subtracts SVM-predicted background coverage before area calculation
Returns separate scores for decrease (Zeta_D) and increase (Zeta_I) directions
Higher zeta scores indicate stronger regulatory effects on alternative splicing.
Value
A data frame with two columns:
Zeta_D |
Zeta score for decrease direction (exon skipping events) |
Zeta_I |
Zeta score for increase direction (exon inclusion events) |
Each row corresponds to a gene, and the zeta scores represent the cumulative regulatory effect across all Z-score thresholds.
Author(s)
Yajing Hao, Shuyang Zhang, Junhui Li, Guofeng Zhao, Xiang-Dong Fu
Examples
data(ZseqList)
data(SVMcurve)
data(countMat)
data(negGene)
ZscoreVal <- Zscore(countMat, negGene)
zetaData <- Zeta(ZscoreVal, ZseqList, SVM=FALSE)
Calculate zeta score for single cell RNA-seq quality control.
Description
This function evaluates the quality of cells detected in single-cell RNA-seq data by calculating a zeta score for each cell. The zeta score is based on the distribution of gene expression across different expression thresholds. A cutoff value is automatically determined using a two-component Gaussian mixture model to separate high-quality cells from low-quality or damaged cells.
Usage
ZetaSuitSC(countMatSC, binNum = 10, filter = TRUE)
Arguments
countMatSC |
A matrix of single-cell RNA-seq count data where rows represent cells and columns represent genes. |
binNum |
The number of bins for zeta score calculation. Default is 10. The function creates expression thresholds from 0 to the 80th percentile of non-zero expression values, divided into binNum intervals. |
filter |
Logical. Whether to filter out cells with total read counts less than 100. Default is TRUE. This helps remove extremely low-quality cells before analysis. |
Details
The function works as follows:
Filters cells based on total read count if filter=TRUE
Samples a subset of cells and genes for computational efficiency
Creates expression thresholds (bins) from 0 to the 80th percentile of non-zero expression values
For each cell, counts how many genes exceed each threshold
Calculates the zeta score as a weighted sum of these counts
Fits a two-component Gaussian mixture model to log10-transformed zeta scores
Determines an optimal cutoff to separate high-quality from low-quality cells
Value
A list containing:
zetaData |
A data frame with two columns: 'Cell' (cell identifiers) and 'Zeta' (calculated zeta scores) |
p_cutoff |
A ggplot object showing the distribution of log10-transformed zeta scores with fitted Gaussian mixture components and the determined cutoff threshold |
Author(s)
Yajing Hao, Shuyang Zhang, Junhui Li, Guofeng Zhao, Xiang-Dong Fu
Examples
data(countMatSC)
zetaDataSC <- ZetaSuitSC(countMatSC, binNum=50, filter=TRUE)
Launch ZetaSuite Shiny Application
Description
Launches the ZetaSuite Shiny web application for interactive analysis of high-throughput screening data and single-cell RNA-seq quality control.
Usage
ZetaSuiteApp(launch.browser = TRUE, port = NULL, host = "127.0.0.1")
Arguments
launch.browser |
Logical. Should the app launch in the default browser? Default is TRUE. |
port |
Integer. Port number for the Shiny app. Default is NULL (random port). |
host |
Character. Host address. Default is "127.0.0.1" (localhost). |
Details
The Shiny app provides a user-friendly interface for:
Quality Control Analysis
Z-score Normalization
Event Coverage Analysis
Zeta Score Calculation
SVM-based Background Correction
Screen Strength Analysis
Single Cell Quality Control
Interactive visualizations and data export
Value
Launches the Shiny application in a web browser.
Examples
## Not run:
# Launch the ZetaSuite Shiny app
ZetaSuiteApp()
# Launch without opening browser automatically
ZetaSuiteApp(launch.browser = FALSE)
# Launch on a specific port
ZetaSuiteApp(port = 3838)
## End(Not run)
Z-score normalization for high-throughput screening data.
Description
This function performs Z-score normalization on high-throughput screening data using negative control samples as reference. The Z-score transformation standardizes the data by centering and scaling each column (readout) based on the mean and standard deviation of negative control samples.
Usage
Zscore(countMat, negGene)
Arguments
countMat |
A matrix of raw count data where rows represent genes/siRNAs and columns represent readouts/conditions. The matrix should have row names corresponding to gene/siRNA identifiers. |
negGene |
A data frame or matrix containing negative control gene/siRNA identifiers. The first column should contain the gene/siRNA names that match the row names in countMat. |
Details
The function performs Z-score normalization as follows:
Extracts negative control samples from the input matrix using the identifiers provided in negGene
For each column (readout), calculates the mean and standard deviation using only the negative control samples
Applies Z-score transformation:
Z_{ij} = (X_{ij} - \mu_{j}) / \sigma_{j}
whereX_{ij}
is the raw value for genei
in readoutj
,\mu_{j}
is the mean of negative controls in readoutj
, and\sigma_{j}
is the standard deviation of negative controls in readoutj
This normalization allows for comparison across different readouts and identifies genes/siRNAs that show significant deviation from the negative control distribution.
Value
A Z-score normalized matrix with the same dimensions as the input countMat (excluding the Type column added during processing). Each value represents how many standard deviations away from the negative control mean that particular gene/readout combination is.
Author(s)
Yajing Hao, Shuyang Zhang, Junhui Li, Guofeng Zhao, Xiang-Dong Fu
Examples
data(countMat)
data(negGene)
ZscoreVal <- Zscore(countMat, negGene)
ZscoreVal[1:5, 1:5]
The bin size for Zeta calculation.
Description
A data frame with 11 different cut-offs and 2 directions. We divided the ranges of input values into bins. The number of bins is determined by the users.
Usage
data("ZseqList")
Format
A data frame with 11 different cut-offs and 2 directions.
A data frame with 11 different cut-offs and 2 directions.We divided the ranges of input values into bins. The number of bins is determined by the users.
Details
This data frame is the generated by EventCoverage.R.
Examples
data(ZseqList)
Subsampled data from in-house HTS2 screening for global splicing regulators.
Description
A data frame with 1609 individual screened genes and 100 functional readouts. The data was generated from a siRNA screen for global splicing regulators. In this screen, we interrogated ~400 endogenous alternative splicing (AS) events by using an oligo ligation-based strategy to quantify 18,480 pools of siRNAs against annotated protein-coding genes in the human genome.
Usage
data("countMat")
Format
A data frame with 1609 observations on the following 100 variables
A data frame with 1609 observations on the following 100 maker variables.Each row represents gene with specific knocking-down siRNA pool, each column is an AS event. The values in the matrix are the processed foldchange values between included exons and skipping exons read counts.
Details
This data frame is the raw output data from large-scale screening.
Examples
data(countMat)
The cell x gene matrix from single-cell RNA-seq.
Description
A scRNA-seq dataset generated from placenta that has been analyzed with CellRanger and used to develop EmptyDrops. We have subsampled the genes from the real datasets to generated the matrix.
Usage
data("countMatSC")
Format
A data frame with 1090 cells and 10000 genes. This is the subset of data obtained from single-cell RNAseq for package testing. Each row represents one cell detected in single-cell RNA-seq, each column is one gene in detected cells. The values in the matrix are the raw read counts from single-cell RNAseq.
A data frame with 1090 cells and 10000 genes.This is the subset of data obtained from single-cell RNAseq for package testing.
Details
This data frame is the generated by single-cell RNA-seq.
Examples
data(countMatSC)
Input negative file.
Description
A data frame with 510 different well IDs in which the cells treated with non-specific siRNAs.If users did not have the build-in negative controls, the non-expressed genes should be provided here.
Usage
data("negGene")
Format
A data frame with 510 different well IDs in which the cells treated with non-specific siRNAs.
A data frame with 510 different well IDs in which the cells treated with non-specific siRNAs. These wells were served as negative control.
Details
These wells were designed by the authors in the large-scale screen.
Examples
data(negGene)
Input internal negative control file.
Description
A data frame with 722 different well IDs in which the cells treated with siRNAs targeting to non-expressed genes in HeLa cells.It the subset of total non-expressed genes in HeLa cells.
Usage
data("nonExpGene")
Format
A data frame with 722 different well IDs in which the cells treated with siRNAs targeting to non-expressed genes in HeLa cells.
A data frame with 722 different well IDs in which the cells treated with siRNAs targeting to non-expressed genes in HeLa cells. These wells were served as internal negative controls.
Details
These non-expressed genes can be obtained from a prior expression profile.
Examples
data(nonExpGene)
Input positive file.
Description
A data frame with 299 different well IDs in which the cells treated with siRNAs targeting to PTB.If users didn't have the build-in positive controls, choose the parameters -withoutsvm and the filename can use any name such as 'NA'.
Usage
data("negGene")
Format
A data frame with 299 different well IDs in which the cells treated with siRNAs targeting to PTB.
A data frame with 299 different well IDs in which the cells treated with siRNAs targeting to PTB. These wells were served as positive control.
Details
These wells were designed by the authors in the large-scale screen.
Examples
data(posGene)