William Lees 2024-11-02
ogrdbstats is an R package that can be used to create an analysis of gene usage in a adaptive immune receptor sequencing repertoire. The analysis consists of usage statistics and plots. The package is intended to be used in conjunction with a tool that infers a ‘personalised genotype’. Currently the following tools are supported:
Ogrdbstats requires a recent installation of Pandoc. If Rstudio is installed on your machine, pandoc will already be installed. Otherwise please follow the installation instructions on the Pandoc website.
Once Pandoc is installed, please install ogrdbstats from CRAN:
Alternatively, you can install the latest development version from Github:
The package requires R version 4.2.2 or above.
It’s easiest to use the package from the command line. To do this, download ogrdbstats.R and copy to the directory in which you wish to conduct the analysis.
Command Syntax (see following section for detailed description of file formats)
Positional Arguments:
REF_FILE
- pathname of a FASTA file containing IMGT
gap-aligned reference germline sequences. Usually this would be downloaded
from IMGT.
SPECIES
- should contain the species name used in field
3 of the IMGT REF_FILE FASTA header, with spaces removed,
e.g. Homosapiens for Human. If you are not using an IMGT REF_FILE, you
can use any single word for the species here, and the reference file
should only contain genes for that species.
READ_FILE
- pathname of a tab-separated file containing
the annotated reads used to infer the genotype, in MiAIRR, CHANGEO or
IgDiscover format
CHAIN
specifies the sequence type to be analysed. It
must be one of VH, VK, VL, D, JH, JK, JL
.
Optional Arguments:
INF_FILE
- pathname of a FASTA file containing sequences
of inferred novel alleles. This file must be provided if the read file
contains assignments to alleles that are not listed in the reference
file.
HAP_GENE
- the gene to be used for haplotyping analysis
(see haplotyping section below)
Detailed descriptions of the required input files are given in the next section, but for quick usage with a supported tool, please skip to the Usage Notes for that tool towards the end of this document.
The package provides functions to allow you to create the reports programmatically, or use the information it reads from input files for your own purposes. Please refer to the ‘ogrdbstats API’ section for a brief overview.
The format will be determined automatically by the script.
MiAIRR format files must contain at least the following columns:
sequence_id, v_call_genotyped, d_call, j_call, sequence_alignment, cdr3
.
For J or D inferences they must also contain
J_sequence_start
, J_sequence_end
,
J_germline_start
, J_germline_end
, or the
equivalent fields for D genes. IgBLAST’s
--format airr
creates compatible MiAIRR format
files.
CHANGEO files must contain at least the following columns:
SEQUENCE_ID, V_CALL_GENOTYPED, D_CALL, J_CALL, SEQUENCE_IMGT, CDR3_IMGT
,
V_MUT_NC
, D_MUT_NC
, J_MUT_NC
,
SEQUENCE
, JUNCTION_START
, V_SEQ
,
D_SEQ
, J_SEQ
. If you would like to process
files from IMGT V-Quest, please parse
them with CHANGEO to convert them to CHANGEO format.
D- related fields are only required for heavy chain records.
In both the above file formats,
v_call_genotyped/V_CALL_GENOTYPED
should contain the V
calls made after the subject’s V-gene genotype has been inferred
(including calls of the novel alleles). Sequences may be either unagpped
or IMGT gap-aligned. Determining the personalised V-gene genotype is
recommended when processing D or J gene inferences, so that V-gene usage
counts are accurate. However, this step can be omitted for D or J gene
processing by providing a V_CALL field instead of
V_CALL_GENOTYPED.
For IgDiscover, the file ‘final/filtered.tab’ should be used - see section on IgDiscover below.
<READ_FILE>_ogrdb_report.csv
- the Genotype File
ready to be uploaded to OGRDB.<READ_FILE>_ogrdb_plots.csv
- plots (see next
section for details).READ_FILE
is used as a prefix to the output file names.
They will be written to the directory containing the read file.
If you are submitting inferences to OGRDB, you will be prompted to upload the genotype file. Please also upload the plots file as an attachment to the Notes section of your submission.
The script produces the following plots:
The nucleotide usage plots are not produced from IgDiscover output, as aligned V-sequences are not available.
The script should first be run without the optional
HAP_GENE
parameter. If, having consulted the plots, you
identify a suitable gene for haplotyping, please run the script again,
with this gene specified as HAP_GENE
. The haplotyping_gene
and haplotyping_ratio columns of the genotype file will be appropriately
populated. A J-gene should be used with V- and D- gene inferences, and a
V-gene with J-gene inferences.
To download the IMGT reference file and complete an analysis using example files, run the following commands:
wget -O IMGT_REF_GAPPED.fasta https://www.imgt.org/download/GENE-DB/IMGTGENEDB-ReferenceSequences.fasta-nt-WithGaps-F+ORF+inframeP
Rscript ogrdbstats.R --inf_file TWO01A_naive_novel_ungapped.fasta --hap_gene IGHJ6 IMGT_REF_GAPPED.fasta Homosapiens TWO01A_naive_genotyped.tsv VH
Usage notes are indicative only and are not intended to discount other approaches. Notes for other tools will follow.
To conduct a V-gene analysis with TIgGER:
findNovelAlleles
to identify novel alleles in a
Change-O-formatted data set. Write these to a FASTA file.inferGenotype
or inferGenotypeBayesian
to infer the genotype.reassignAlleles
to correct allele calls in the data
set, based on the inferred genotypeogrdbstats
.Note that inferGenotype
will not necessarily include
every inferred allele produced by findNovelAlleles
in the
genotype that it produces. Only those alleles included in the genotype
will be considered by genotype_statistics.R
because,
leaving other considerations aside, no sequences are assigned to other
alleles.
TIgGER provides additonal information, including its own plots and statistics We encourage you to take these into consideration, and to upload them as attachments to your submission if they are informative.
Following an IgDiscover run, please copy ogrdbstats.R to IgDiscover’s
final
directory. The commands below can then be used to
download the IMGT reference file and run a VH gene analysis. All
commands should be run in the final
directory.
$ wget -O IMGT_REF_GAPPED.fasta https://www.imgt.org/download/GENE-DB/IMGTGENEDB-ReferenceSequences.fasta-nt-WithGaps-F+ORF+inframeP
$ unzip final.tab.gz
$ Rscript ogrdbstats.R --inf_file database/V.fasta IMGT_REF_GAPPED.fasta Homosapiens filtered.tab VH
alternatively, to produce a JH gene analysis:
$ Rscript ogrdbstats.R --inf_file database/J.fasta IMGT_REF_GAPPED.fasta Homosapiens filtered.tab JH
The information required by generate_statstics.R is split between partis’s normal yaml output and that provided by the ‘presto-output’ mode. A python script, convert_partis.py, is provided. This will combine output from partis’s yaml and presto annotations, producing CHANGEO format annotations and a FASTA file of genotype V-sequences. These files can then passed to generate_statistics.R. convert_partis.py is written in python 2.7 for compatibility with partis, and can be run from the command line in the same virtual environment.
Usage of convert_partis.py:
python convert_partis.py [-h] partis_yaml partis_tsv ogrdb_recs ogrdb_vs
positional arguments:
partis_yaml .yaml file created by partis
partis_tsv .tsv file created by partis presto-output mode
ogrdb_recs annotation output file (.tsv)
ogrdb_vs v_gene sequences (.fasta)
optional arguments:
-h, --help show this help message and exit
Although partis must be run twice - once without the presto-output option, and once with it - it will use cached information provided other parameters remain the same, so that the overall impact on run time is low. Typical processing steps are shown below. Note that –presto-output requires an IMGT-gapped V-gene germline file. This can be extracted from the full germline library downloaded from IMGT (see ‘Prerequisites’ above), but partis will report as an error any duplicated identical sequences in the library: duplicates must be removed from the file before processing will complete successfully. Note in the examples below the –extra-annotations option. The CDR3 is required by generate_statistics.R: the other fields are included for reference.
# Run partis to produce annotations in YAML format
partis annotate --extra-annotation-columns cdr3_seqs:invalid:in_frames:stops --infname TW01A.fasta --outfname TW01A.yaml --n-procs 5
# Run partis again with additional --presto-output option. This will produce TSV-formatted output from cached data
partis annotate --extra-annotation-columns cdr3_seqs:invalid:in_frames:stops --infname TW01A.fasta --outfname TW01A.tsv --presto-output \
--aligned-germline-fname IMGT_REF_GAPPED_DEDUPED.fasta --n-procs 5
# Extract and merge required information from YAML and TSV files
python convert_partis.py TW01A.yaml TW01A.tsv TW01A_OGRDB.tsv TW01A_V_OGRDB.fasta
# Process the resulting output to produce the genotye file and plots
Rscript ogrdbstats.R --inf_file TW01A_V_OGRDB.fasta IMGT_REF_GAPPED.fasta Homosapiens TW01A_OGRDB.tsv VH
IMPre does not provide a set of sequences annotated with the novel allele calls. The sequences must be annotated by a separate tool in order to provide the information needed for the OGRDB genotype. One possible approach is as follows:
generate_ogrdb_report()
takes equivalent arguments to
ogrdbstats.R and generates the genotype file and plots.
read_input_files()
reads and parses the input files. It
returns a representation of the genotype file in memory, as well as a
number of other structures containing related information.
make_barplot_grobs()
,
make_novel_base_grobs()
and make_haplo_grobs()
create the plots: each function returns a list of grobs that can be used
as you wish, or passed to write_plot_file()
to create a the
standard file of plots.
The use of the grob functions is demonstrated below with example data included in the package.
library(ogrdbstats)
reference_set = system.file("extdata/ref_gapped.fasta", package = "ogrdbstats")
inferred_set = system.file("extdata/novel_gapped.fasta", package = "ogrdbstats")
repertoire = system.file("extdata/ogrdbstats_example_repertoire.tsv", package = "ogrdbstats")
rd = suppressMessages(
read_input_files(reference_set, inferred_set, 'Homosapiens', repertoire, 'IGHV', NA, 'V', 'H', FALSE)
)
barplot_grobs = make_barplot_grobs(rd$input_sequences, rd$genotype_db, rd$inferred_seqs,
rd$genotype, 'V', rd$calculated_NC)
base_grobs = make_novel_base_grobs(rd$inferred_seqs, rd$input_sequences, 'V', FALSE)
gridExtra::grid.arrange(grobs=list(barplot_grobs[2][[1]], base_grobs$end[1][[1]],
base_grobs$conc[1][[1]]),ncol=1)
Please use help(package=“ogrdbstats”) or ? to find function-level documentation within R.
Some functions are adapted from TIgGER with thanks to the authors.
The example annotated reads and inferences linked from this description are taken from the data of Rubelt et al and were downloaded from VDJServer. The genotype was inferred by TIgGER. A small number of light-chain records were removed from the data set.