--- title: "Introduction to mLLMCelltype" author: "Chen Yang" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Introduction to mLLMCelltype} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set( echo = TRUE, message = FALSE, warning = FALSE, fig.width = 8, fig.height = 6 ) ``` # Introduction to mLLMCelltype ## Overview mLLMCelltype is an iterative multi-LLM consensus framework for cell type annotation in single-cell RNA sequencing data. By leveraging the complementary strengths of multiple large language models, this framework aims to improve annotation accuracy while providing transparent uncertainty quantification. The package implements a novel approach where multiple large language models (LLMs) collaborate through structured deliberation to achieve potentially more reliable annotations by cross-validating across models. ## Background Cell type annotation is a critical step in single-cell RNA sequencing (scRNA-seq) analysis. Traditional methods often rely on reference datasets or marker gene databases, which can be limited by the availability of high-quality references and the complexity of cell types across different tissues and conditions. Large language models have shown promising results in cell type annotation by leveraging their extensive knowledge of biological literature and ability to reason about gene expression patterns. However, individual LLMs can produce hallucinations or make errors due to limitations in their training data or reasoning capabilities. mLLMCelltype addresses these challenges by implementing a consensus-based approach where multiple LLMs collaborate to provide more reliable annotations. ## Key Features ### Multi-LLM Consensus Architecture mLLMCelltype combines predictions from multiple LLMs to reduce individual model biases. The package currently supports a wide range of models: - OpenAI GPT-4o/4.1 - Anthropic Claude-3.7/3.5 - Google Gemini-2.0/2.5 (including Gemini-2.0-Flash-Lite) - X.AI Grok-3 - DeepSeek-V3 - Alibaba Qwen2.5/Qwen3 - Zhipu GLM-4 - MiniMax - Stepfun - OpenRouter By integrating multiple models with different architectures and training data, mLLMCelltype can produce annotations less sensitive to individual model biases. ### Structured Deliberation Process The package enables LLMs to share reasoning, evaluate evidence, and refine annotations through multiple rounds of collaborative discussion. This structured deliberation process includes: 1. **Initial independent annotation** by each LLM 2. **Identification of controversial clusters** where models disagree 3. **Structured discussion** where models share their reasoning and evaluate others' arguments 4. **Consensus formation** through iterative refinement This process mimics how a panel of human experts might collaborate to reach a consensus on difficult cases. ### Transparent Uncertainty Quantification mLLMCelltype provides quantitative metrics to identify ambiguous cell populations that may require expert review: - **Consensus Proportion**: Measures the level of agreement among LLMs - **Shannon Entropy**: Quantifies the uncertainty in the annotations These metrics help researchers identify which cell clusters have high confidence annotations and which may require further investigation. ### Other Advanced Features - **Hallucination Reduction**: Cross-model comparison may help identify inconsistent predictions - **Robust to Input Noise**: Maintains high accuracy even with imperfect marker gene lists - **Hierarchical Annotation Support**: Optional extension for multi-resolution analysis - **No Reference Dataset Required**: Performs annotation without pre-training or reference data - **Complete Reasoning Chains**: Documents the full deliberation process - **Integration**: Works with standard Scanpy/Seurat workflows - **Modular Design**: Easily incorporate new LLMs as they become available - **Enhanced Marker Gene Visualization**: Publication-ready bubble plots and heatmaps showing marker gene expression patterns across annotated cell types ## Applicable Scenarios mLLMCelltype is designed for a wide range of single-cell RNA sequencing analysis scenarios: - **Novel tissue types** where reference datasets may be limited - **Rare or poorly characterized cell populations** - **Complex tissues** with many similar cell types - **Integrative analysis** across multiple datasets - **Quality control** to validate annotations from other methods ## Latest Updates ### v1.1.4 (2025-04-24) #### Bug Fixes - Fixed a critical issue in cluster index handling for CSV-style inputs - Fixed negative index (-1) edge cases during parsing - Improved cluster ID validation and diagnostics #### Improvements - Improved input-contract consistency for cluster IDs across core workflows - Hardened parser behavior for mixed model output formats - Updated code comments and examples for clarity For a complete list of updates, please refer to the [NEWS.md](../NEWS.md) file. ## Getting Started To get started with mLLMCelltype, please refer to the [Installation Guide](https://cafferyang.com/mLLMCelltype/articles/installation.html) and [Quick Start Guide](https://cafferyang.com/mLLMCelltype/articles/getting-started.html). ## Citation If you use mLLMCelltype in your research, please cite: ``` Yang, C., Zhang, X., & Chen, J. (2025). Large Language Model Consensus Substantially Improves the Cell Type Annotation Accuracy for scRNA-seq Data. bioRxiv. https://doi.org/10.1101/2025.04.10.647852 ``` ## Next Steps - [Installation Guide](https://cafferyang.com/mLLMCelltype/articles/installation.html): Learn how to install and configure mLLMCelltype - [Quick Start Guide](https://cafferyang.com/mLLMCelltype/articles/getting-started.html): Get started with basic usage examples - [Usage Tutorial](https://cafferyang.com/mLLMCelltype/articles/usage-tutorial.html): Explore detailed usage scenarios - [Consensus Annotation Principles](https://cafferyang.com/mLLMCelltype/articles/consensus-principles.html): Understand the technical principles