immReferent 0.99.5
The immReferent package provides a centralized and easy-to-use interface for downloading, managing, and loading immune repertoire and HLA reference sequences from the IMGT, IPD-IMGT/HLA and OGRDB databases. Its primary goal is to ensure that analyses are based on consistent, up-to-date, and correctly formatted reference data.
This vignette will walk you through the basic functionality of the package.
devtools::install_github("BorchLab/immReferent")
Or via Bioconductor (once accepted)
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("immReferent")
library(immReferent)
The main function for all data retrieval is getIMGT(). It handles both downloading new data from the source and loading previously downloaded data from a local cache.
The IPD-IMGT/HLA database provides reference sequences for the Human Leukocyte Antigen (HLA) system. You can download the complete set of nucleotide or protein sequences using gene = "HLA".
# Download all available HLA protein sequences
# This will download the file to the cache on the first run
hla_prot <- getIMGT(gene = "HLA",
type = "PROT")
# Inspect the result
print(hla_prot)
## AAStringSet object of length 43796:
## width seq names
## [1] 365 MAVMAPRTLLLLLSGALALTQ...TQAASSDSAQGSDVSLTACKV HLA:HLA00001 A*01...
## [2] 200 MAVMAPRTLLLLLSGALALTQ...GGARGTGLTAGSGPGSHTIQX HLA:HLA02169 A*01...
## [3] 365 MAVMAPRTLLLLLSGALALTQ...TQAASSDSAQGSDVSLTACKV HLA:HLA14798 A*01...
## [4] 365 MAVMAPRTLLLLLSGALALTQ...TQAASSDSAQGSDVSLTACKV HLA:HLA15760 A*01...
## [5] 365 MAVMAPRTLLLLLSGALALTQ...TQAASSDSAQGSDVSLTACKV HLA:HLA16415 A*01...
## ... ... ...
## [43792] 704 MRLPDLRPWTSLLLVDAALLW...AQLQEGQDLYSRLVQQRLMDX HLA:HLA38075 TAP2...
## [43793] 704 MRLPDLRPWTSLLLVDAALLW...AQLQEGQDLYSRLVQQRLMDX HLA:HLA38025 TAP2...
## [43794] 704 MRLPDLRPWTSLLLADAALLW...AQLQEGQDLYSRLVQQRLMDX HLA:HLA38029 TAP2...
## [43795] 704 MRLPDLRPWTSLLLVDAALLW...AQLQEGQDLYSRLVQQRLMDX HLA:HLA38159 TAP2...
## [43796] 704 MRLPDLRPWTSLLLVDAALLW...AQLQEGQDLYSRLVQQRLMDX HLA:HLA38463 TAP2...
cat("Number of sequences:", length(hla_prot), "\n")
## Number of sequences: 43796
cat("First sequence name:", names(hla_prot)[1], "\n")
## First sequence name: HLA:HLA00001 A*01:01:01:01 365 bp
For T-cell receptor (TCR) and B-cell receptor (BCR) genes, you can specify the species and the gene or gene family you are interested in.
# Download human IGHV nucleotide sequences
ighv_nuc <- getIMGT(species = "human",
gene = "IGHV",
type = "NUC")
# Inspect the result
print(ighv_nuc)
## DNAStringSet object of length 467:
## width seq names
## [1] 320 CAGGTTCAGCTGGTGCAGTCTG...CCGTGTATTACTGTGCGAGAGA M99641|IGHV1-18*0...
## [2] 300 CAGGTTCAGCTGGTGCAGTCTG...CCTAAGATCTGACGACACGGCC X60503|IGHV1-18*0...
## [3] 320 CAGGTTCAGCTGGTGCAGTCTG...CCGTGTATTACTGTGCGAGAGA HM855463|IGHV1-18...
## [4] 320 CAGGTTCAGCTGGTGCAGTCTG...CCGTGTATTACTGTGCGAGAGA KC713938|IGHV1-18...
## [5] 320 CAGGTGCAGCTGGTGCAGTCTG...TCGTGTATTACTGTGCGAGAGA X07448|IGHV1-2*01...
## ... ... ...
## [463] 320 GAGGCCCAGCTTACAGAGTCTG...CAGCATTTAACTGTGCAGGAAA AB019438|IGHV8-51...
## [464] 320 GAGGCCCAGCTTACAGAGTCTG...CAGCATTTAACTGTGCAGGAAA BK063799|IGHV8-51...
## [465] 320 GAGGCCCAGCTTACAGAGTCTG...CAGCATTTAACTGTGCAGGAAA IMGT000055|IGHV8-...
## [466] 320 GAGGCCCAGCTTACAGAGTCTG...CAGCATTTAACTGTGCAGGAAA AC279961|IGHV8-51...
## [467] 320 GAGGCCCAGCTTACAGAGTCTG...CAGCATTTAACTGTGCAGGAAA BK068299|IGHV8-51...
You can also download entire families of genes at once by specifying a group name like "IGH", "IGK", "TRB", etc.
# Download all mouse TRB genes (V, D, J, and C)
trb_mouse <- getIMGT(species = "mouse",
gene = "TRB",
type = "NUC")
# This object will contain TRBV, TRBD, TRBJ, and TRBC sequences
print(trb_mouse)
## DNAStringSet object of length 75:
## width seq names
## [1] 326 GTGACTTTGCTGGAGCAAAACCC...TGTACTGCACCTGCAGTGCAGA AE000663|TRBV1*01...
## [2] 324 GTGACTTTGCTGGAGCAAAACCC...CTTGTACTGCACCTGCAGTGCG X01642|TRBV1*02|M...
## [3] 325 GATGGTGGAATCACCCAGACACC...TTCTGGGCCAGCAGTGAACAAA X16694|TRBV10*01|...
## [4] 324 GATTCTGGGGTTGTCCAGTCTCC...GTACTTCTGTGCCAGCTCTCTC M15614|TRBV12-1*0...
## [5] 321 GATTCTGGGGTTGTCCAGTCTCC...TATGTACTTCTGTGCCAGCTCT M30881|TRBV12-1*0...
## ... ... ...
## [71] 48 CAGCCCTTGCCCTGACTGATTGGCAGCCGATTGAACAGCCTATGCGAG K02802|TRBJ2-6*01...
## [72] 47 CTCCTATGAACAGTACTTCGGTCCCGGCACCAGGCTCACGGTTTTAG K02802|TRBJ2-7*01...
## [73] 47 CTCCTATGAACAGTACTTCGGTCCCGGCACTAGGCTCACGGTTTTAG M16122|TRBJ2-7*02...
## [74] 519 NAGGATCTGAGAAATGTGACTCC...CATGGTCAAGAAAAAAAATTCC M26057+M26058+M26...
## [75] 519 NAGGATCTGAGAAATGTGACTCC...CATGGTCAAGAAAAAAAATTCC AE000665|TRBC2*03...
OGRDB provides AIRR-compliant germline sets for immunoglobulin loci (and growing coverage more broadly). You can retrieve:
DNAStringSet (and optionally translate to AAStringSet).# Human IGH nucleotide sequences (gapped FASTA)
igh_ogrdb <- getOGRDB(
species = "human",
locus = "IGH",
type = "NUC",
format = "FASTA_GAPPED"
)
igh_ogrdb
## DNAStringSet object of length 236:
## width seq names
## [1] 17 GGTACAACTGGAACGAC IGHD1-1*01
## [2] 17 GGTATAACCGGAACCAC IGHD1-14*01
## [3] 17 GGTATAACTGGAACGAC IGHD1-20*01
## [4] 20 GGTATAGTGGGAGCTACTAC IGHD1-26*01
## [5] 17 GGTATAACTGGAACTAC IGHD1-7*01
## ... ... ...
## [232] 320 CAGGTGCAGCTGGTGCAGTCTG...CCATGTATTACTGTGCGAGATA IGHV7-81*01
## [233] 320 GAGGCCCAGCTTACAGAGTCTG...CAGCATTTAACTGTGCAGGAAA IGHV8-51-1*02
## [234] 320 GAGGCCCAGCTTACAGAGTCTG...CAGCATTTAACTGTGCAGGAAA IGHV8-51-1*03
## [235] 320 GAGGCCCAGCTTACAGAGTCTG...CAGCATTTAACTGTGCAGGAAA IGHV8-51-1*04
## [236] 320 GAGGCCCAGCTTACAGAGTCTG...CAGCATTTAACTGTGCAGGAAA IGHV8-51-1*05
Pulling the AIRR-formatted sequences:
# Human IGK sequences via AIRR JSON (parsed to DNAStringSet)
igk_airr <- getOGRDB(
species = "human",
locus = "IGK",
type = "NUC",
format = "AIRR"
)
igk_airr
## DNAStringSet object of length 71:
## width seq names
## [1] 764 ATGAGGCTCCCTGCTCAGCTCCT...GGTACAGCCCTGAACAGAAACC IGKV2D-40*01
## [2] 499 ATGAGGGTCCCCGCTCAGCTCCT...GTTACACACCCGAACAAAAACC IGKV1-NL1*01
## [3] 582 ATGTCGCCATCACAACTCATTGG...GTTACAACCCAGAACAAAAACT IGKV6D-21*02
## [4] 833 ATGAGGCTCCTTGCTCAGCTTCT...GGTACAGCTCTGAACAAAAACC IGKV2-24*02
## [5] 499 ATGAGGGTCCCCGCTCAGCTCCT...GTTACCAACCCGAACATAAACC IGKV1-12*01
## ... ... ...
## [67] 77 GATTTTTGTTAAGGGGAAAGTAA...GGGACACGACTGGAGATTAAAC IGKJ5*01
## [68] 77 GGTTTCTGTTCAGCAAGACAATG...GGGACCAAGGTGGAAATCAAAC IGKJ1*01
## [69] 78 AGTTTTTGTATAGGAGGGAAGTT...GGGACCAAGCTGGAGATCAAAC IGKJ2*04
## [70] 77 GGTTTTTGTTGAGGGGAAAGGGT...GGGACCAAGGTGGAGATCAAAC IGKJ4*01
## [71] 76 GGTTTTTGTAAAGGGAAAAGTTA...GGGACCAAAGTGGATATCAAAC IGKJ3*01
immReferent automatically caches all downloaded data to avoid repeated downloads and to enable offline work.
You can see all the files currently in your cache using the listIMGT() or listOGRDB() function.
# List the full paths of all cached files
listIMGT()
## [1] "/home/biocbuild/.immReferent/Human/ogrdb/Human_IGH_VDJ_published_gapped.fasta"
## [2] "/home/biocbuild/.immReferent/Human/ogrdb/Human_IGKappa_VJ_published_airr.json"
## [3] "/home/biocbuild/.immReferent/human/hla/hla_prot.fasta"
## [4] "/home/biocbuild/.immReferent/human/vdj/imgt_human_IGHV.fasta"
## [5] "/home/biocbuild/.immReferent/immReferent_log.yaml"
## [6] "/home/biocbuild/.immReferent/mouse/constant/imgt_mouse_TRBC.fasta"
## [7] "/home/biocbuild/.immReferent/mouse/vdj/imgt_mouse_TRBD.fasta"
## [8] "/home/biocbuild/.immReferent/mouse/vdj/imgt_mouse_TRBJ.fasta"
## [9] "/home/biocbuild/.immReferent/mouse/vdj/imgt_mouse_TRBV.fasta"
listOGRDB()
## character(0)
When you call getIMGT(), it will always load data from the cache if it’s available. If you want to only load from the cache and prevent any possibility of a download, you can use loadIMGT(). This function is useful in offline environments or for ensuring strict reproducibility.
# This will load from the cache if available, or download otherwise
ighv_nuc <- getIMGT(species = "human",
gene = "IGHV",
type = "NUC")
# This will load from the cache, or fail if not found and offline
ighv_nuc_from_cache <- loadIMGT(species = "human",
gene = "IGHV",
type = "NUC")
Similar to the above, we can pull and load from OGRDB using getOGRDB() and loadOGRDB().
# This will load from the cache if available, or download otherwise
igh_nuc <- getOGRDB(species = "human",
locus = "IGH",
type = "NUC",
format = "FASTA_GAPPED")
# This will load from the cache, or fail if not found and offline
igh_from_cache <- loadOGRDB(species = "human",
locus = "IGH",
type = "NUC",
format = "FASTA_GAPPED")
If you suspect the online data has been updated and you want to re-download it, you can use refreshIMGT() or refreshOGRDB(). This is just a convenient shortcut for getIMGT(..., refresh = TRUE).
# Force a re-download of the human IGHV sequences
ighv_nuc_fresh <- refreshIMGT(species = "human",
gene = "IGHV",
type = "NUC")
# Force a re-download of human IGK (gapped FASTA)
igk_fresh <- refreshOGRDB(species = "human",
locus = "IGK",
type = "NUC",
format = "FASTA_GAPPED")
This has been a general overview of the capabilities of immReferent for downloading and caching immune receptor and HLA sequences from IMGT and OGRDB. If you have any questions, comments, or suggestions, feel free to visit the GitHub repository.
sessionInfo()
## R Under development (unstable) (2025-10-20 r88955)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.3 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.23-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] immReferent_0.99.5 BiocStyle_2.39.0
##
## loaded via a namespace (and not attached):
## [1] jsonlite_2.0.0 selectr_0.5-0 compiler_4.6.0
## [4] BiocManager_1.30.27 crayon_1.5.3 promises_1.5.0
## [7] Rcpp_1.1.0 stringr_1.6.0 xml2_1.5.0
## [10] Biostrings_2.79.2 later_1.4.4 jquerylib_0.1.4
## [13] IRanges_2.45.0 Seqinfo_1.1.0 yaml_2.3.10
## [16] fastmap_1.2.0 R6_2.6.1 XVector_0.51.0
## [19] generics_0.1.4 curl_7.0.0 knitr_1.50
## [22] BiocGenerics_0.57.0 tibble_3.3.0 bookdown_0.45
## [25] pillar_1.11.1 bslib_0.9.0 rlang_1.1.6
## [28] websocket_1.4.4 stringi_1.8.7 cachem_1.1.0
## [31] xfun_0.54 sass_0.4.10 otel_0.2.0
## [34] cli_3.6.5 magrittr_2.0.4 ps_1.9.1
## [37] digest_0.6.39 rvest_1.0.5 processx_3.8.6
## [40] lifecycle_1.0.4 chromote_0.5.1 vctrs_0.6.5
## [43] S4Vectors_0.49.0 evaluate_1.0.5 glue_1.8.0
## [46] stats4_4.6.0 rmarkdown_2.30 httr_1.4.7
## [49] pkgconfig_2.0.3 tools_4.6.0 htmltools_0.5.8.1