--- title: "TemporalForest: A Quick Start Guide" output: rmarkdown::html_vignette author: - name: "Sisi Shao" affiliation: "Department of Biostatistics, Fielding School of Public Health, University of California, Los Angeles, CA, USA" orcid: "0009-0000-9783-9205" - name: "Jason H. Moore" affiliation: | Department of Biostatistics, Fielding School of Public Health, University of California, Los Angeles, CA, USA Department of Computational Biomedicine, Cedars-Sinai Medical Center, Los Angeles, CA, USA orcid: "0000-0002-5015-1099" - name: "Christina M. Ramirez" affiliation: "Department of Biostatistics, Fielding School of Public Health, University of California, Los Angeles, CA, USA" corresponding: true email: "cr@ucla.edu" orcid: "0000-0002-8435-0416" bibliography: refs.bib link-citations: yes vignette: > %\VignetteIndexEntry{A Quick Start Guide to TemporalForest} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} old_ops <- options() suppressPackageStartupMessages(library(TemporalForest)) knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 5, message = FALSE, warning = FALSE ) options(stringsAsFactors = FALSE) suppressPackageStartupMessages({ ok_wgcna <- requireNamespace("WGCNA", quietly = TRUE) }) if (ok_wgcna && "disableWGCNAThreads" %in% getNamespaceExports("WGCNA")) { suppressMessages(WGCNA::disableWGCNAThreads()) } ``` ## Abstract The TemporalForest package provides a reproducible method for feature selection in high-dimensional longitudinal data. It combines network analysis, mixed-effects models, and stability selection to identify robust predictors over time. This vignette offers a quick start guide to using the package. ## 1. Introduction Longitudinal 'omics studies, where subjects are measured repeatedly over time, present unique challenges for feature selection: high dimensionality, temporal dependence, and complex correlations. The `TemporalForest` algorithm addresses these by creating a robust, multi-stage pipeline that identifies features which are both predictive and stable across resamples. ## 2. Installation Since the package is not yet on CRAN, you can install the development version from GitHub: ```{r eval=FALSE} # install.packages("remotes") remotes::install_github("SisiShao/TemporalForest") ``` ## 3. Quick Start: Primary Example This example walks you through a complete analysis with a small, simulated dataset. ### Simulate a Longitudinal Dataset This tiny demo is designed to always return all true signals quickly (1–3s). We will simulate a dataset with 60 subjects, 2 time points, and 20 potential predictors. We will inject **3 true signals** into the outcome \(Y\), coming from predictors `V1`, `V2`, and `V3`. To ensure the example is fast and reliable for CRAN, we will pass a precomputed dissimilarity matrix to **skip Stage 1 (WGCNA/TOM)**. ```{r} set.seed(11) # For reproducibility n_subjects <- 60; n_timepoints <- 2; p <- 20 # Build X (two time points) with matching colnames X <- replicate(n_timepoints, matrix(rnorm(n_subjects * p), n_subjects, p), simplify = FALSE) colnames(X[[1]]) <- colnames(X[[2]]) <- paste0("V", 1:p) # Long view and IDs X_long <- do.call(rbind, X) id <- rep(seq_len(n_subjects), each = n_timepoints) time <- rep(seq_len(n_timepoints), times = n_subjects) # Strong signal on V1, V2, V3 + modest subject random effect + small noise u_subj <- rnorm(n_subjects, 0, 0.7) eps <- rnorm(length(id), 0, 0.08) Y <- 4*X_long[, "V1"] + 3.5*X_long[, "V2"] + 3.2*X_long[, "V3"] + rep(u_subj, each = n_timepoints) + eps # Lightweight dissimilarity to skip Stage 1 (fast on CRAN) A <- 1 - abs(stats::cor(X_long)); diag(A) <- 0 dimnames(A) <- list(colnames(X[[1]]), colnames(X[[1]])) ``` ### Run TemporalForest We call the main function, passing our precomputed `dissimilarity_matrix = A` and asking for 3 features. ```{r} # Run TemporalForest with minimal settings for vignette tf_result <- temporal_forest( X = X, Y = Y, id = id, time = time, dissimilarity_matrix = A, # skip WGCNA/TOM (Stage 1) n_features_to_select = 3, n_boot_screen = 4, # Very low for quick demo n_boot_select =8, # Very low for quick demo keep_fraction_screen = 1, # Permissive screening min_module_size = 2, alpha_screen = 0.5, # Permissive screening alpha_select = 0.6 ) ``` ### Interpret the Results Examine the selected features and check if the true predictors were found. ```{r} print(tf_result) ``` ```{r} # Validate against ground truth true_predictors <- c("V1", "V2", "V3") cat("True predictors found:", sum(true_predictors %in% tf_result$top_features), "out of", length(true_predictors), "\n") ``` The algorithm successfully identified all three true predictors in this high signal-to-noise example. ## 4. How TemporalForest Works TemporalForest operates in three stages: 1. **Time-Aware Module Construction:** Groups correlated features into modules that are stable across time points using a consensus topological overlap matrix (TOM). 2. **Within-Module Screening:** Uses mixed-effects model trees to select the most important predictor from each module while accounting for within-subject correlations. 3. **Stability Selection:** Applies bootstrapping to calculate selection probabilities, ensuring only the most reproducible features are included in the final set. ## 5. Key Parameters Guide - `n_features_to_select`: Final number of features to return (default: 10) - `n_boot_screen`, `n_boot_select`: Number of bootstrap samples for screening and selection stages. Increase for more stable results (defaults: 50, 100). - `keep_fraction_screen`: Proportion of features from each module passed to final selection (default: 0.25). Increase if too few features are selected. - `min_module_size`: Minimum size for network modules (default: 4). - `alpha_screen`, `alpha_select`: Significance levels for splitting in screening and selection trees (defaults: 0.2, 0.05). ## 6. Troubleshooting | Symptom | Likely Cause | Solution | |---------|--------------|----------| | No features selected | Screening too strict | Increase `keep_fraction_screen` or `alpha_screen` | | Too many features selected | Selection too liberal | Decrease `keep_fraction_screen` or `alpha_select` | | Long computation time | Data too large | Reduce bootstrap numbers or pre-filter features | ## 7. Input Data Validation The package includes checks for proper data formatting. Here's an example of the error message for inconsistent inputs: ```{r error=TRUE} # This will produce a clear error message mat1 <- matrix(1:4, nrow=2, dimnames=list(NULL, c("A", "B"))) mat2 <- matrix(1:4, nrow=2, dimnames=list(NULL, c("A", "C"))) bad_X <- list(mat1, mat2) TemporalForest::check_temporal_consistency(bad_X) ``` ## 8. Conclusion TemporalForest provides an end-to-end solution for reproducible feature selection in longitudinal high-dimensional data. For detailed information on all function parameters and advanced usage, see the package documentation (`?TemporalForest`). ## 9. Citation To cite TemporalForest in publications, please use: ```{r citation} citation("TemporalForest") ``` ## Session Info ```{r} sessionInfo() options(old_ops) ```