--- title: "Introduction to INCVCommunityDetection" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Introduction to INCVCommunityDetection} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ## Overview `INCVCommunityDetection` implements **Inductive Node-Splitting Cross-Validation (INCV)** for selecting the number of communities in Stochastic Block Models (SBM). The package also provides competing methods — **CROISSANT**, **Edge Cross-Validation (ECV)**, and **Node Cross-Validation (NCV)** — for comprehensive model selection in network analysis. ## Simulating a network We start by generating a network from a planted-partition SBM with 3 communities, 150 nodes, within-community connection probability 0.5, and between-community probability 0.05. ```{r simulate} library(INCVCommunityDetection) set.seed(42) net <- community.sim(k = 3, n = 150, n1 = 50, p = 0.5, q = 0.05) table(net$membership) ``` The adjacency matrix is a 150 × 150 binary symmetric matrix: ```{r adj-dim, fig.width=5, fig.height=5} dim(net$adjacency) ord <- order(net$membership) image(net$adjacency[ord, ord], main = "Adjacency matrix (3-community SBM, reordered)", xlab = "Node", ylab = "Node") ``` ## Selecting K with INCV (f-fold) The main function `nscv.f.fold()` partitions nodes into `f` folds and uses spectral clustering on the training subgraph. Held-out nodes are assigned to communities based on their connections to training nodes, and the held-out negative log-likelihood and MSE are computed. ```{r incv-ffold} result <- nscv.f.fold(net$adjacency, k.vec = 2:6, f = 5) result$k.loss # K selected by neg-log-likelihood result$k.mse # K selected by MSE ``` We can inspect the full CV loss curve: ```{r loss-curve, fig.width=6, fig.height=4} plot(2:6, result$cv.loss, type = "b", pch = 19, xlab = "Number of communities (K)", ylab = "CV Negative Log-Likelihood", main = "INCV f-fold: CV loss by K") abline(v = result$k.loss, lty = 2, col = "red") ``` ## Selecting K with INCV (random split) An alternative is to use repeated random node splits instead of fixed folds: ```{r incv-random} result2 <- nscv.random.split(net$adjacency, k.vec = 2:6, split = 0.66, ite = 20) result2$k.chosen ``` ```{r random-curve, fig.width=6, fig.height=4} plot(2:6, result2$cv.loss, type = "b", pch = 19, xlab = "Number of communities (K)", ylab = "CV Negative Log-Likelihood", main = "INCV random-split: CV loss by K") abline(v = result2$k.chosen, lty = 2, col = "red") ``` ## Comparing with ECV and NCV ### Edge Cross-Validation ECV holds out random edges and evaluates the predictive fit of a blockmodel reconstruction. It jointly selects between SBM and DCBM. ```{r ecv} ecv <- ECV.for.blockmodel(net$adjacency, max.K = 6, B = 3) ecv$dev.model # best by deviance ecv$l2.model # best by L2 ecv$auc.model # best by AUC ``` ### Node Cross-Validation NCV holds out random nodes and evaluates predictions on the held-out sub-network: ```{r ncv} ncv <- NCV.for.blockmodel(net$adjacency, max.K = 6, cv = 3) ncv$dev.model ncv$l2.model ``` ## Summary of methods | Method | Function | Splits | Selects K | Selects model type | |--------|----------|--------|-----------|-------------------| | INCV f-fold | `nscv.f.fold()` | Nodes into f folds | Yes | No (SBM only) | | INCV random | `nscv.random.split()` | Random node split | Yes | No (SBM only) | | ECV | `ECV.for.blockmodel()` | Random edge holdout | Yes | Yes (SBM vs DCBM) | | NCV | `NCV.for.blockmodel()` | Node folds | Yes | Yes (SBM vs DCBM) | | CROISSANT | `croissant.blockmodel()` | Overlapping subsamples | Yes | Yes (SBM vs DCBM) | ## Spectral clustering and probability estimation The building blocks are also available directly: ```{r spectral} cl <- SBM.spectral.clustering(net$adjacency, k = 3) table(cl$cluster) prob <- SBM.prob(cl$cluster, k = 3, A = net$adjacency, restricted = TRUE) round(prob$p.matrix, 3) ``` ## Distance-decaying SBM simulation For more realistic simulations, `community.sim.sbm()` generates networks where block probabilities decay with community distance: ```{r sbm-decay} net2 <- community.sim.sbm(n = 120, n1 = 40, eta = 0.3, rho = 0.2, K = 4) round(net2$conn, 4) ``` ## Session info ```{r session} sessionInfo() ```