This vignette is a condensed version of the documentation pages on the Cicero website. Please check out the website for more details.
The main purpose of Cicero is to use single-cell chromatin accessibility data to predict regions of the genome that are more likely to be in physical proximity in the nucleus. This can be used to identify putative enhancer-promoter pairs, and to get a sense of the overall stucture of the cis-architecture of a genomic region.
Because of the sparsity of single-cell data, cells must be aggregated by similarity to allow robust correction for various technical factors in the data.
Ultimately, Cicero provides a “Cicero co-accessibility” score between -1 and 1 between each pair of accessible peaks within a user defined distance where a higher score indicates higher co-accessibility.
In addition, the Cicero package provides an extension toolkit for analyzing single-cell ATAC-seq experiments using the framework provided by the single-cell RNA-seq analysis software, Monocle. This vignette provides an overview of a single-cell ATAC-Seq analysis workflow with Cicero. For further information and more options, please see the manual pages for the Cicero R package, the Cicero website and our publications.
Cicero can help you perform two main types of analysis:
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
 BiocManager::install("cicero")Or install the development version of the package from Github.
BiocManager::install(cole-trapnell-lab/cicero) library(cicero)Cicero holds data in objects of the CellDataSet (CDS) class. The class is derived from the Bioconductor ExpressionSet class, which provides a common interface familiar to those who have analyzed microarray experiments with Bioconductor. Monocle provides detailed documentation about how to generate an input CDS here.
To modify the CDS object to hold chromatin accessibility rather than expression data, Cicero uses peaks as its feature data fData rather than genes or transcripts. Specifically, many Cicero functions require peak information in the form chr1_10390134_10391134. For example, an input fData table might look like this:
| site_name | chromosome | bp1 | bp2 | |
|---|---|---|---|---|
| chr10_100002625_100002940 | chr10_100002625_100002940 | 10 | 100002625 | 100002940 | 
| chr10_100006458_100007593 | chr10_100006458_100007593 | 10 | 100006458 | 100007593 | 
| chr10_100011280_100011780 | chr10_100011280_100011780 | 10 | 100011280 | 100011780 | 
| chr10_100013372_100013596 | chr10_100013372_100013596 | 10 | 100013372 | 100013596 | 
| chr10_100015079_100015428 | chr10_100015079_100015428 | 10 | 100015079 | 100015428 | 
The Cicero package includes a small dataset called cicero_data as an example.
data(cicero_data)For convenience, Cicero includes a function called make_atac_cds. This function takes as input a data.frame or a path to a file in a sparse matrix format. Specifically, this file should be a tab-delimited text file with three columns. The first column is the peak coordinates in the form “chr10_100013372_100013596”, the second column is the cell name, and the third column is an integer that represents the number of reads from that cell overlapping that peak. The file should not have a header line.
For example:
| chr10_100002625_100002940 | cell1 | 1 | 
| chr10_100006458_100007593 | cell2 | 2 | 
| chr10_100006458_100007593 | cell3 | 1 | 
| chr10_100013372_100013596 | cell2 | 1 | 
| chr10_100015079_100015428 | cell4 | 3 | 
The output of make_atac_cds is a valid CDS object ready to be input into downstream Cicero functions.
input_cds <- make_atac_cds(cicero_data, binarize = TRUE)Because single-cell chromatin accessibility data is extremely sparse, accurate estimation of co-accessibility scores requires us to aggregate similar cells to create more dense count data. Cicero does this using a k-nearest-neighbors approach which creates overlapping sets of cells. Cicero constructs these sets based on a reduced dimension coordinate map of cell similarity, for example, from a tSNE or DDRTree map.
You can use any dimensionality reduction method to base your aggregated CDS on. We will show you how to create two versions, tSNE and DDRTree (below). Both of these dimensionality reduction methods are available from Monocle (and loaded by Cicero).
Once you have your reduced dimension coordinate map, you can use the function make_cicero_cds to create your aggregated CDS object. The input to make_cicero_cds is your input CDS object, and your reduced dimension coordinate map. The reduced dimension map reduced_coordinates should be in the form of a data.frame or a matrix where the row names match the cell IDs from the pData table of your CDS. The columns of reduced_coordinates should be the coordinates of the reduced dimension object, for example:
| ddrtree_coord1 | ddrtree_coord2 | |
|---|---|---|
| cell1 | -0.7084047 | -0.7232994 | 
| cell2 | -4.4767964 | 0.8237284 | 
| cell3 | 1.4870098 | -0.4723493 | 
Here is an example of both dimensionality reduction and creation of a Cicero CDS. Using Monocle as a guide, we first find tSNE coordinates for our input_cds:
set.seed(2017)
input_cds <- detectGenes(input_cds)
input_cds <- estimateSizeFactors(input_cds)
input_cds <- reduceDimension(input_cds, max_components = 2, num_dim=6,
                        reduction_method = 'tSNE', norm_method = "none")For more information on the above code, see the Monocle website section on clustering cells.
Next, we access the tSNE coordinates from the input CDS object where they are stored by Monocle and run make_cicero_cds:
tsne_coords <- t(reducedDimA(input_cds))
row.names(tsne_coords) <- row.names(pData(input_cds))
cicero_cds <- make_cicero_cds(input_cds, reduced_coordinates = tsne_coords)
#> Overlap QC metrics:
#> Cells per bin: 50
#> Maximum shared cells bin-bin: 44
#> Mean shared cells bin-bin: 12.5285714285714
#> Median shared cells bin-bin: 1
#> Warning in make_cicero_cds(input_cds, reduced_coordinates = tsne_coords): On
#> average, more than 10% of cells are shared between paired bins.
#> Warning in if (isSparseMatrix(counts)) {: the condition has length > 1 and only
#> the first element will be usedThe main function of the Cicero package is to estimate the co-accessiblity of sites in the genome in order to predict cis-regulatory interactions. There are two ways to get this information:
The easiest way to get Cicero co-accessibility scores is to run run_cicero. To run run_cicero, you need a cicero CDS object (created above) and a genome coordinates file, which contains the lengths of each of the chromosomes in your organism. The human hg19 coordinates are included with the package and can be accessed with data(“human.hg19.genome”). Here is an example call, continuing with our example data:
data("human.hg19.genome")
sample_genome <- subset(human.hg19.genome, V1 == "chr18")
sample_genome$V2[1] <- 10000000
conns <- run_cicero(cicero_cds, sample_genome, sample_num = 2) # Takes a few minutes to run
#> [1] "Starting Cicero"
#> [1] "Calculating distance_parameter value"
#> [1] "Running models"
#> [1] "Assembling connections"
#> [1] "Successful cicero models:  39"
#> [1] "Other models: "
#> 
#> Zero or one element in range 
#>                            1 
#> [1] "Models with errors:  0"
#> [1] "Done"
head(conns)
#>                     Peak1                 Peak2    coaccess
#> 1 chr18_10006196_10006822 chr18_9755702_9755970  0.00000000
#> 2 chr18_10006196_10006822 chr18_9756925_9757590 -0.02171226
#> 3 chr18_10006196_10006822 chr18_9771216_9771842 -0.05837012
#> 4 chr18_10006196_10006822 chr18_9781976_9782901 -0.17017477
#> 5 chr18_10006196_10006822 chr18_9784605_9785105  0.00000000
#> 6 chr18_10006196_10006822 chr18_9787597_9788029  0.00000000The Cicero package includes a general plotting function for visualizing co-accessibility called plot_connections. This function uses the Gviz framework for plotting genome browser-style plots. We have adapted a function from the Sushi R package for mapping connections. plot_connections has many options, some detailed in the Advanced Visualization section on the Cicero website, but to get a basic plot from your co-accessibility table is quite simple:
data(gene_annotation_sample)
plot_connections(conns, "chr18", 8575097, 8839855, 
                 gene_model = gene_annotation_sample, 
                 coaccess_cutoff = .25, 
                 connection_width = .5, 
                 collapseTranscripts = "longest" )Often, it is useful to compare Cicero connections to other datasets with similar kinds of links. For example, you might want to compare the output of Cicero to ChIA-PET ligations. To do this, Cicero includes a function called compare_connections. This function takes as input two data frames of connection pairs, conns1 and conns2, and returns a logical vector of which connections from conns1 are found in conns2. The comparison in this function is conducted using the GenomicRanges package, and uses the max_gap argument from that package to allow slop in the comparisons.
For example, if we wanted to compare our Cicero predictions to a set of (made-up) ChIA-PET connections, we could run:
chia_conns <-  data.frame(Peak1 = c("chr18_10000_10200", "chr18_10000_10200", 
                                    "chr18_49500_49600"), 
                          Peak2 = c("chr18_10600_10700", "chr18_111700_111800", 
                                    "chr18_10600_10700"))
conns$in_chia <- compare_connections(conns, chia_conns)You may find that this overlap is too strict when comparing completely distinct datasets. Looking carefully, the 2nd line of the ChIA-PET data matches fairly closely to the last line shown of conns. The difference is only ~67 base pairs, which could be a matter of peak-calling. This is where the max_gap parameter can be useful:
conns$in_chia_100 <- compare_connections(conns, chia_conns, maxgap=100)
head(conns)
#>                     Peak1                 Peak2    coaccess in_chia in_chia_100
#> 1 chr18_10006196_10006822 chr18_9755702_9755970  0.00000000   FALSE       FALSE
#> 2 chr18_10006196_10006822 chr18_9756925_9757590 -0.02171226   FALSE       FALSE
#> 3 chr18_10006196_10006822 chr18_9771216_9771842 -0.05837012   FALSE       FALSE
#> 4 chr18_10006196_10006822 chr18_9781976_9782901 -0.17017477   FALSE       FALSE
#> 5 chr18_10006196_10006822 chr18_9784605_9785105  0.00000000   FALSE       FALSE
#> 6 chr18_10006196_10006822 chr18_9787597_9788029  0.00000000   FALSE       FALSEIn addition, Cicero’s plotting function has a way to compare datasets visually. To do this, use the comparison_track argument. The comparison data frame must include a third columns beyond the first two peak columns called “coaccess”. This is how the plotting function determines the height of the plotted connections. This could be a quantitative measure, like the number of ligations in ChIA-PET, or simply a column of 1s. More info on plotting options in manual pages ?plot_connections and in the Advanced Visualization section of the Cicero website.
# Add a column of 1s called "coaccess"
chia_conns <-  data.frame(Peak1 = c("chr18_10000_10200", "chr18_10000_10200", 
                                    "chr18_49500_49600"), 
                          Peak2 = c("chr18_10600_10700", "chr18_111700_111800", 
                                    "chr18_10600_10700"),
                          coaccess = c(1, 1, 1))
plot_connections(conns, "chr18", 10000, 112367, 
                 gene_model = gene_annotation_sample, 
                 coaccess_cutoff = 0,
                 connection_width = .5,
                 comparison_track = chia_conns,
                 include_axis_track = FALSE,
                 collapseTranscripts = "longest") In addition to pairwise co-accessibility scores, Cicero also has a function to find Cis-Co-accessibility Networks (CCANs), which are modules of sites that are highly co-accessible with one another. We use the Louvain community detection algorithm (Blondel et al., 2008) to find clusters of sites that tend to be co-accessible. The function generate_ccans takes as input a connection data frame and outputs a data frame with CCAN assignments for each input peak. Sites not included in the output data frame were not assigned a CCAN.
The function generate_ccans has one optional input called coaccess_cutoff_override. When coaccess_cutoff_override is NULL, the function will determine and report an appropriate co-accessibility score cutoff value for CCAN generation based on the number of overall CCANs at varying cutoffs. You can also set coaccess_cutoff_override to be a numeric between 0 and 1, to override the cutoff-finding part of the function. This option is useful if you feel that the cutoff found automatically was too strict or loose, or for speed if you are rerunning the code and know what the cutoff will be, since the cutoff finding procedure can be slow.
CCAN_assigns <- generate_ccans(conns)
#> [1] "Coaccessibility cutoff used: 0.63"
head(CCAN_assigns)
#>                                            Peak CCAN
#> chr18_10006196_10006822 chr18_10006196_10006822    1
#> chr18_10100928_10101986 chr18_10100928_10101986    1
#> chr18_10174163_10174396 chr18_10174163_10174396    1
#> chr18_11604_13986             chr18_11604_13986    6
#> chr18_157883_158536         chr18_157883_158536    6
#> chr18_2049865_2050884     chr18_2049865_2050884    7We have found that often, accessibility at promoters is a poor predictor of gene expression. However, using Cicero links, we are able to get a better sense of the overall accessibility of a promoter and it’s associated distal sites. This combined score of regional accessibility has a better concordance with gene expression. We call this score the Cicero gene activity score, and it is calculated using two functions.
The initial function is called build_gene_activity_matrix. This function takes an input CDS and a Cicero connection list, and outputs an unnormalized table of gene activity scores. IMPORTANT: the input CDS must have a column in the fData table called “gene” which indicates the gene if that peak is a promoter, and NA if the peak is distal. One way to add this column is demonstrated below.
The output of build_gene_activity_matrix is unnormalized. It must be normalized using a second function called normalize_gene_activities. If you intend to compare gene activities across different datasets of subsets of data, then all gene activity subsets should be normalized together, by passing in a list of unnormalized matrices. If you only wish to normalized one matrix, simply pass it to the function on its own. normalize_gene_activities also requires a named vector of of total accessible sites per cell. This is easily found in the pData table of your CDS, called “num_genes_expressed”. See below for an example.
#### Add a column for the pData table indicating the gene if a peak is a promoter ####
# Create a gene annotation set that only marks the transcription start sites of 
# the genes. We use this as a proxy for promoters.
# To do this we need the first exon of each transcript
pos <- subset(gene_annotation_sample, strand == "+")
pos <- pos[order(pos$start),] 
pos <- pos[!duplicated(pos$transcript),] # remove all but the first exons per transcript
pos$end <- pos$start + 1 # make a 1 base pair marker of the TSS
neg <- subset(gene_annotation_sample, strand == "-")
neg <- neg[order(neg$start, decreasing = TRUE),] 
neg <- neg[!duplicated(neg$transcript),] # remove all but the first exons per transcript
neg$start <- neg$end - 1
gene_annotation_sub <- rbind(pos, neg)
# Make a subset of the TSS annotation columns containing just the coordinates 
# and the gene name
gene_annotation_sub <- gene_annotation_sub[,c(1:3, 8)]
# Rename the gene symbol column to "gene"
names(gene_annotation_sub)[4] <- "gene"
input_cds <- annotate_cds_by_site(input_cds, gene_annotation_sub)
#> Warning in .Seqinfo.mergexy(x, y): Each of the 2 combined objects has sequence levels not in the other:
#>   - in 'x': chr18_GL456216_random
#>   - in 'y': chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr19, chr20, chr21, chr22, chrX, chrY, chrM
#>   Make sure to always combine/compare objects based on the same reference
#>   genome (use suppressWarnings() to suppress this warning).
#> Warning in .Seqinfo.mergexy(x, y): Each of the 2 combined objects has sequence levels not in the other:
#>   - in 'x': chr18_GL456216_random
#>   - in 'y': chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr19, chr20, chr21, chr22, chrX, chrY, chrM
#>   Make sure to always combine/compare objects based on the same reference
#>   genome (use suppressWarnings() to suppress this warning).
head(fData(input_cds))
#>                               site_name chr    bp1    bp2 num_cells_expressed
#> chr18_10025_10225     chr18_10025_10225  18  10025  10225                   5
#> chr18_10603_11103     chr18_10603_11103  18  10603  11103                   1
#> chr18_11604_13986     chr18_11604_13986  18  11604  13986                   9
#> chr18_49557_50057     chr18_49557_50057  18  49557  50057                   2
#> chr18_50240_50740     chr18_50240_50740  18  50240  50740                   2
#> chr18_104385_104585 chr18_104385_104585  18 104385 104585                   1
#>                     overlap          gene
#> chr18_10025_10225        NA          <NA>
#> chr18_10603_11103         1    AP005530.1
#> chr18_11604_13986        NA          <NA>
#> chr18_49557_50057         1 RP11-683L23.1
#> chr18_50240_50740         1 RP11-683L23.1
#> chr18_104385_104585      NA          <NA>
#### Generate gene activity scores ####
# generate unnormalized gene activity matrix
unnorm_ga <- build_gene_activity_matrix(input_cds, conns)
# remove any rows/columns with all zeroes
unnorm_ga <- unnorm_ga[!Matrix::rowSums(unnorm_ga) == 0, !Matrix::colSums(unnorm_ga) == 0]
# make a list of num_genes_expressed
num_genes <- pData(input_cds)$num_genes_expressed
names(num_genes) <- row.names(pData(input_cds))
# normalize
cicero_gene_activities <- normalize_gene_activities(unnorm_ga, num_genes)
# if you had two datasets to normalize, you would pass both:
# num_genes should then include all cells from both sets
unnorm_ga2 <- unnorm_ga
cicero_gene_activities <- normalize_gene_activities(list(unnorm_ga, unnorm_ga2), num_genes)The second major function of the Cicero package is to extend Monocle 2 for use with single-cell accessibility data. The main obstacle to overcome with chromatin accessibility data is the sparsity, so most of the extensions and methods are designed to address that.
We strongly recommend that you consult the Monocle website, especially this section prior to reading about Cicero’s extension of the Monocle analysis described. Briefly, Monocle infers pseudotime trajectories in three steps:
We will describe how each piece is modified for use with sparse single-cell chromatin accessibility data.
The primary way that the Cicero package deals with the sparsity of single-cell chromatin accessibility data is through aggregation. Aggregating the counts of either single cells or single peaks allows us to produce a “consensus” count matrix, reducing noise and allowing us to move out of the binary regime. Under this grouping, the number of cells in which a particular site is accessible can be modeled with a binomial distribution or, for sufficiently large groups, the corresponding Gaussian approximation. Modeling grouped accessibility counts as normally distributed allows Cicero to easily adjust them for arbitrary technical covariates by simply fitting a linear model and taking the residuals with respect to it as the adjusted accessibility score for each group of cells. We demonstrate how to apply this grouping practically below.
The primary aggregation used for trajectory reconstruction is between nearby peaks. This keeps single cells separate while aggregating regions of the genome and looking for chromatin accessibility within them. The function aggregate_nearby_peaks finds sites within a certain distance of each other and aggregates them together by summing their counts. Depending on the density of your data, you may want to try different distance parameters. In published work we have used 1,000 and 10,000.
data("cicero_data")
input_cds <- make_atac_cds(cicero_data)
# Add some cell meta-data
data("cell_data")
pData(input_cds) <- cbind(pData(input_cds), cell_data[row.names(pData(input_cds)),])
pData(input_cds)$cell <- NULL
agg_cds <- aggregate_nearby_peaks(input_cds, distance = 10000)
agg_cds <- detectGenes(agg_cds)
agg_cds <- estimateSizeFactors(agg_cds)
agg_cds <- estimateDispersions(agg_cds)
#> Warning: `group_by_()` was deprecated in dplyr 0.7.0.
#> Please use `group_by()` instead.
#> See vignette('programming') for more help
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
#> Warning: `select_()` was deprecated in dplyr 0.7.0.
#> Please use `select()` instead.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
#> Removing 37 outliersThere are several options for choosing the sites to use during dimensionality reduction. Monocle has a discussion about the options here. Any of these options could be used with your new aggregated CDS, depending on the information you have a available in your dataset. Here, we will show two examples:
If your data has defined beginning and end points, you can determine which sites define progress by a differential accessibility test. We use Monocle’s differentialGeneTest function looking for sites that are different in the timepoint groups.
# This takes a few minutes to run
diff_timepoint <- differentialGeneTest(agg_cds,
                      fullModelFormulaStr="~timepoint + num_genes_expressed")
# We chose a very high q-value cutoff because there are
# so few sites in the sample dataset, in general a q-value
# cutoff in the range of 0.01 to 0.1 would be appropriate
ordering_sites <- row.names(subset(diff_timepoint, qval < .5))
length(ordering_sites)
#> [1] 255Alternatively, you can choose sites for dimensionality reduction by using Monocle’s dpFeature method. dpFeature chooses sites based on how they differ among clusters of cells. Here, we give some example code reproduced from Monocle, for more information, see the Monocle description.
plot_pc_variance_explained(agg_cds, return_all = FALSE) #Choose 2 PCs
agg_cds <- reduceDimension(agg_cds,
                              max_components = 2,
                              norm_method = 'log',
                              num_dim = 2,
                              reduction_method = 'tSNE',
                              verbose = TRUE)
agg_cds <- clusterCells(agg_cds, verbose = FALSE)
plot_cell_clusters(agg_cds, color_by = 'as.factor(Cluster)') + theme(text = element_text(size=8))
clustering_DA_sites <- differentialGeneTest(agg_cds[1:10,], #Subset for example only
                                             fullModelFormulaStr = '~Cluster')
# Not run because using Option 1 to continue
# ordering_sites <-
#  row.names(clustering_DA_sites)[order(clustering_DA_sites$qval)][1:1000]
However you choose your ordering sites, the first step of dimensionality reduction is to use setOrderingFilter to mark the sites you want to use for dimesionality reduction. In the following figures, we are using the ordering_sites from “Choose sites by differential analysis” above.
agg_cds <- setOrderingFilter(agg_cds, ordering_sites)Next, we use DDRTree to reduce dimensionality and then order the cells along the trajectory. Importantly, we use num_genes_expressed in our residual model formula to account for overall assay efficiency.
agg_cds <- reduceDimension(agg_cds, max_components = 2,
          residualModelFormulaStr="~num_genes_expressed",
          reduction_method = 'DDRTree')
agg_cds <- suppressWarnings(orderCells(agg_cds))
plot_cell_trajectory(agg_cds, color_by = "timepoint")Once you have a cell trajectory, you need to make sure that pseudotime proceeds how you expect. In our example, we want pseudotime to start where most of the time 0 cells are located and proceed towards the later timepoints. Further information on this can be found here on the Monocle website. We first color our trajectory plot by State, which is how Monocle assigns segments of the tree.
plot_cell_trajectory(agg_cds, color_by = "State")From this plot, we can see that the beginning of pseudotime should be from state 4. We now reorder cells setting the root state to 4. We can then check that the ordering makes sense by coloring the plot by Pseudotime.
agg_cds <- suppressWarnings(orderCells(agg_cds, root_state = 4))
plot_cell_trajectory(agg_cds, color_by = "Pseudotime")Now that we have Pseudotime values for each cell (pData(agg_cds)$Pseudotime), we need to assign these values back to our original CDS object. In addition, we will assign the State information back to the original CDS.
pData(input_cds)$Pseudotime <- pData(agg_cds)[colnames(input_cds),]$Pseudotime
pData(input_cds)$State <- pData(agg_cds)[colnames(input_cds),]$StateOnce you have your cells ordered in pseudotime, you can ask where in the genome chromatin accessibility is changing across time. If you know of specific sites that are important to your system, you may want to visualize the accessibility at those sites across pseudotime.
For simplicity, we will exclude the branch in our trajectory to make our trajectory linear.
input_cds_lin <- input_cds[,row.names(subset(pData(input_cds), State  != 5))]
plot_accessibility_in_pseudotime(input_cds_lin[c("chr18_38156577_38158261", 
                                                 "chr18_48373358_48374180", 
                                                 "chr18_60457956_60459080")])In the previous section, we used aggregation of sites to discover cell-level information (cell pseudotime). In this section, we are interested in a site-level statistic (whether a site is changing in pseudotime), so we will aggregate similar cells. To do this, Cicero has a useful function called aggregate_by_cell_bin.
We use the function aggregate_by_cell_bin to aggregate our input CDS object by a column in the pData table. In this example, we will assign cells to bins by cutting the pseudotime trajectory into 10 parts.
pData(input_cds_lin)$cell_subtype <- cut(pData(input_cds_lin)$Pseudotime, 10)
binned_input_lin <- aggregate_by_cell_bin(input_cds_lin, "cell_subtype")
#> Warning in if (isSparseMatrix(counts)) {: the condition has length > 1 and only
#> the first element will be used
#> Warning in if (isSparseMatrix(exprs(cds))) {: the condition has length > 1 and
#> only the first element will be used
#> Removing 177 outliersWe are now ready to run Monocle’s differentialGeneTest function to find sites that are differentially accessible across pseudotime. In this example, we include num_genes_expressed as a covariate to subtract its effect.
diff_test_res <- differentialGeneTest(binned_input_lin[1:10,], #Subset for example only
    fullModelFormulaStr="~sm.ns(Pseudotime, df=3) + sm.ns(num_genes_expressed, df=3)",
    reducedModelFormulaStr="~sm.ns(num_genes_expressed, df=3)", cores=1)
#> Warning in if (isSparseMatrix(exprs(X))) {: the condition has length > 1 and
#> only the first element will be used
head(diff_test_res)
#>                         status           family       pval      qval
#> chr18_10006196_10006822     OK negbinomial.size 0.42937624 0.8587525
#> chr18_10010479_10011360     OK negbinomial.size 0.07366305 0.2455435
#> chr18_10015203_10015819     OK negbinomial.size 0.88112536 0.9176184
#> chr18_10015940_10017274     OK negbinomial.size 0.02272776 0.2272776
#> chr18_10025_10225           OK negbinomial.size 0.04677373 0.2338687
#> chr18_10032281_10032988     OK negbinomial.size 0.83847197 0.9176184
#>                                       site_name num_cells_expressed
#> chr18_10006196_10006822 chr18_10006196_10006822                   3
#> chr18_10010479_10011360 chr18_10010479_10011360                   7
#> chr18_10015203_10015819 chr18_10015203_10015819                   3
#> chr18_10015940_10017274 chr18_10015940_10017274                   6
#> chr18_10025_10225             chr18_10025_10225                   2
#> chr18_10032281_10032988 chr18_10032281_10032988                   1
#>                         use_for_ordering
#> chr18_10006196_10006822            FALSE
#> chr18_10010479_10011360            FALSE
#> chr18_10015203_10015819            FALSE
#> chr18_10015940_10017274            FALSE
#> chr18_10025_10225                  FALSE
#> chr18_10032281_10032988            FALSEBlondel, V.D., Guillaume, J.-L., Lambiotte, R., and Lefebvre, E. (2008). Fast unfolding of communities in large networks.
Dekker, J., Marti-Renom, M.A., and Mirny, L.A. (2013). Exploring the three-dimensional organization of genomes: interpreting chromatin interaction data. Nat. Rev. Genet. 14, 390–403.
Sanborn, A.L., Rao, S.S.P., Huang, S.-C., Durand, N.C., Huntley, M.H., Jewett, A.I., Bochkov, I.D., Chinnappan, D., Cutkosky, A., Li, J., et al. (2015). Chromatin extrusion explains key features of loop and domain formation in wild-type and engineered genomes. . Proc. Natl. Acad. Sci. U. S. A. 112, E6456–E6465.
Sexton, T., Yaffe, E., Kenigsberg, E., Bantignies, F., Leblanc, B., Hoichman, M., Parrinello, H., Tanay, A., and Cavalli, G. (2012). Three-Dimensional Folding and Functional Organization Principles of the Drosophila Genome. . Cell 148:3, 458-472.
citation("cicero")
#> 
#>   Hannah A. Pliner, Jay Shendure & Cole Trapnell et. al. (2018). Cicero
#>   Predicts cis-Regulatory DNA Interactions from Single-Cell Chromatin
#>   Accessibility Data. Molecular Cell, 71, 858-871.e8.
#> 
#> A BibTeX entry for LaTeX users is
#> 
#>   @Article{,
#>     title = {Cicero Predicts cis-Regulatory DNA Interactions from Single-Cell Chromatin Accessibility Data},
#>     journal = {Molecular Cell},
#>     volume = {71},
#>     number = {5},
#>     pages = {858 - 871.e8},
#>     year = {2018},
#>     issn = {1097-2765},
#>     doi = {https://doi.org/10.1016/j.molcel.2018.06.044},
#>     author = {Hannah A. Pliner and Jonathan S. Packer and José L. McFaline-Figueroa and Darren A. Cusanovich and Riza M. Daza and Delasa Aghamirzaie and Sanjay Srivatsan and Xiaojie Qiu and Dana Jackson and Anna Minkina and Andrew C. Adey and Frank J. Steemers and Jay Shendure and Cole Trapnell},
#>   }sessionInfo()
#> R version 4.1.1 (2021-08-10)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 20.04.3 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.14-bioc/R/lib/libRblas.so
#> LAPACK: /home/biocbuild/bbs-3.14-bioc/R/lib/libRlapack.so
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#>  [1] grid      splines   stats4    stats     graphics  grDevices utils    
#>  [8] datasets  methods   base     
#> 
#> other attached packages:
#>  [1] cicero_1.12.0        Gviz_1.38.0          GenomicRanges_1.46.0
#>  [4] GenomeInfoDb_1.30.0  IRanges_2.28.0       S4Vectors_0.32.0    
#>  [7] monocle_2.22.0       DDRTree_0.1.5        irlba_2.3.3         
#> [10] VGAM_1.1-5           ggplot2_3.3.5        Biobase_2.54.0      
#> [13] BiocGenerics_0.40.0  Matrix_1.3-4        
#> 
#> loaded via a namespace (and not attached):
#>   [1] backports_1.2.1             Hmisc_4.6-0                
#>   [3] BiocFileCache_2.2.0         plyr_1.8.6                 
#>   [5] igraph_1.2.7                lazyeval_0.2.2             
#>   [7] BiocParallel_1.28.0         densityClust_0.3           
#>   [9] fastICA_1.2-3               digest_0.6.28              
#>  [11] ensembldb_2.18.0            htmltools_0.5.2            
#>  [13] viridis_0.6.2               fansi_0.5.0                
#>  [15] magrittr_2.0.1              checkmate_2.0.0            
#>  [17] memoise_2.0.0               BSgenome_1.62.0            
#>  [19] cluster_2.1.2               limma_3.50.0               
#>  [21] Biostrings_2.62.0           matrixStats_0.61.0         
#>  [23] docopt_0.7.1                prettyunits_1.1.1          
#>  [25] jpeg_0.1-9                  colorspace_2.0-2           
#>  [27] blob_1.2.2                  rappdirs_0.3.3             
#>  [29] ggrepel_0.9.1               xfun_0.27                  
#>  [31] dplyr_1.0.7                 sparsesvd_0.2              
#>  [33] crayon_1.4.1                RCurl_1.98-1.5             
#>  [35] jsonlite_1.7.2              VariantAnnotation_1.40.0   
#>  [37] survival_3.2-13             glue_1.4.2                 
#>  [39] gtable_0.3.0                zlibbioc_1.40.0            
#>  [41] XVector_0.34.0              DelayedArray_0.20.0        
#>  [43] scales_1.1.1                pheatmap_1.0.12            
#>  [45] DBI_1.1.1                   Rcpp_1.0.7                 
#>  [47] viridisLite_0.4.0           progress_1.2.2             
#>  [49] htmlTable_2.3.0             proxy_0.4-26               
#>  [51] foreign_0.8-81              bit_4.0.4                  
#>  [53] Formula_1.2-4               htmlwidgets_1.5.4          
#>  [55] httr_1.4.2                  FNN_1.1.3                  
#>  [57] RColorBrewer_1.1-2          ellipsis_0.3.2             
#>  [59] farver_2.1.0                pkgconfig_2.0.3            
#>  [61] XML_3.99-0.8                nnet_7.3-16                
#>  [63] sass_0.4.0                  dbplyr_2.1.1               
#>  [65] utf8_1.2.2                  labeling_0.4.2             
#>  [67] tidyselect_1.1.1            rlang_0.4.12               
#>  [69] reshape2_1.4.4              AnnotationDbi_1.56.0       
#>  [71] munsell_0.5.0               tools_4.1.1                
#>  [73] cachem_1.0.6                generics_0.1.1             
#>  [75] RSQLite_2.2.8               evaluate_0.14              
#>  [77] stringr_1.4.0               fastmap_1.1.0              
#>  [79] yaml_2.2.1                  knitr_1.36                 
#>  [81] bit64_4.0.5                 purrr_0.3.4                
#>  [83] RANN_2.6.1                  KEGGREST_1.34.0            
#>  [85] AnnotationFilter_1.18.0     glasso_1.11                
#>  [87] slam_0.1-48                 xml2_1.3.2                 
#>  [89] biomaRt_2.50.0              compiler_4.1.1             
#>  [91] rstudioapi_0.13             filelock_1.0.2             
#>  [93] curl_4.3.2                  png_0.1-7                  
#>  [95] tibble_3.1.5                bslib_0.3.1                
#>  [97] stringi_1.7.5               highr_0.9                  
#>  [99] GenomicFeatures_1.46.0      lattice_0.20-45            
#> [101] ProtGenerics_1.26.0         HSMMSingleCell_1.13.0      
#> [103] vctrs_0.3.8                 pillar_1.6.4               
#> [105] lifecycle_1.0.1             combinat_0.0-8             
#> [107] jquerylib_0.1.4             data.table_1.14.2          
#> [109] bitops_1.0-7                rtracklayer_1.54.0         
#> [111] R6_2.5.1                    BiocIO_1.4.0               
#> [113] latticeExtra_0.6-29         gridExtra_2.3              
#> [115] dichromat_2.0-0             assertthat_0.2.1           
#> [117] SummarizedExperiment_1.24.0 rjson_0.2.20               
#> [119] withr_2.4.2                 GenomicAlignments_1.30.0   
#> [121] qlcMatrix_0.9.7             Rsamtools_2.10.0           
#> [123] GenomeInfoDbData_1.2.7      parallel_4.1.1             
#> [125] hms_1.1.1                   rpart_4.1-15               
#> [127] tidyr_1.1.4                 rmarkdown_2.11             
#> [129] MatrixGenerics_1.6.0        Rtsne_0.15                 
#> [131] biovizBase_1.42.0           base64enc_0.1-3            
#> [133] restfulr_0.0.13