---
title: "User guide: `immunotation` package"
author: 
- name: Katharina Imkeller
  affiliation: European Molecular Biology Laboratory, Heidelberg
  email: imkeller@embl.de
package: immunotation
date: "`r Sys.Date()`"
output:  
    BiocStyle::html_document:
        toc_float: true
vignette: >
    %\VignetteIndexEntry{User guide immunotation}
    %\VignetteEngine{knitr::rmarkdown}
    %\VignettePackage{immunotation-vignette}
    %\VignetteEncoding{UTF-8}
---

```{r, message=FALSE}
library(immunotation)
```

# Abstract

MHC (major histocompatibility complex) molecules are cell surface complexes that 
present antigens to T cells.  In humans they are encoded by the highly variable 
HLA (human leukocyte antigen) genes of the hyperpolymorphic HLA locus. 
More than 28,000 different HLA alleles have 
been reported, with significant differences in allele frequencies between human 
populations worldwide. 

The package immunotation provides:

* **Conversion and nomenclature functions** for consistent annotation of HLA 
genes in typical immunoinformatics workflows such as for example the prediction 
of MHC-presented peptides in different human donors. Supported naming schemes
include HLA alleles, serotypes, G and P groups, MACs, ...

* **Automated access to the Allele Frequency Net Database** (AFND) and 
visualization of HLA allele frequencies in human populations worldwide.


# Introduction

## MHC molecules

MHC (major histocompatibility complex) molecules are a family of diverse cell 
surface complexes that present antigens to T cells. MHC molecules are divided 
into three classes (MHC class I, MHC class II, and non-classical MHC), 
which differ in their protein subunit composition and the types of receptors 
they can interact with. MHC class I molecules for example consist of one 
polymorphic $\alpha$-chain and one invariant $\beta$-chain and present peptide 
antigens to T cells that express the MHC-I specific co-receptor CD8. 
MHC class II molecules are typically composed of one $\alpha$- and one 
$\beta$-chain, which are both polymorphic. MHC class II molecules present 
peptide antigens to T cells that express the MHC-II specific co-receptor CD4. 

The repertoire of peptide antigens presented on MHC molecules depends on the 
sequence of the genes encoded in the MHC locus of an individual. Since the 
adaptive immune response to an invading pathogen relies on MHC-dependent 
antigen presentation, a high diversity of MHC genes on a population level is 
beneficial from an evolutionary point of view. Indeed, MHC molecules are 
polygenic, which means that the MHC locus contains several different genes 
encoding MHC class I and MHC class II molecules. Moreover, MHC genes are 
polymorphic, which means that on a population level, multiple 
variants (alleles) of each gene exist.

Several experimental techniques exist to identify the different MHC genes, 
alleles and protein complexes. Protein complexes for example can be classified 
into serotypes by binding of subtype-specific anti-MHC antibodies. The resulting 
information on the protein complex is called the MHC serotype. Moreover, MHC 
genes and alleles can 
be identified by hybridization with sequence-specific probes or by sequencing 
and mapping to reference databases. However, these techniques often  cover only 
specific regions of the MHC genes and thus do not allow a complete and 
unambiguous allele identification.

The [IPD-IMGT/HLA](https://www.ebi.ac.uk/ipd/imgt/hla/) [1] and 
[IPD-MHC](https://www.ebi.ac.uk/ipd/mhc/) [2] databases provide a reference of 
all known MHC genes and alleles in different species. A systematic 
classification of MHC genes and proteins is provided in the 
[MHC restriction ontology (MRO)](https://github.com/IEDB/MRO) [3]. 
The annotation functions in the `immunotation` package use the classification 
scheme provided by the MRO.

## Hyperpolymorphic HLA genes in human populations

In humans, MHC molecules are encoded by highly variable HLA 
(human leukocyte antigen) genes of the hyperpolymorphic HLA locus on 
chromosome 6. To date, more than 28.000 different alleles have been 
registered in the [IPD-IMGT/HLA](http://hla.alleles.org/nomenclature/index.html) 
database [1].

### Nomenclature of HLA genes

HLA genes and alleles are named according to rules defined by the 
[WHO Nomenclature Committee for Factors of the HLA System](http://hla.alleles.org/nomenclature/naming.html). 
The following scheme depicts the components of a complete HLA allele name. 
The different components of the name are separated by ":".

**HLA-(gene)\*(group):(protein):(coding region):(non-coding region)(suffix)**

| Term        | Description          | Example  |
| ----------- |:-----------------:|:-----------------:|
| gene        | HLA gene | A, B, C, DPA1, DPB1, ... |
| group       | group of HLA alleles with similar protein sequence (protein sequence homology) |  01 |
| protein     | all HLA alleles with the same protein sequence  | 01 |
| coding region     | all HLA alleles with the same DNA sequence in the coding region | 01 |
| non-coding region     | all HLA alleles with the same DNA sequence in the non-coding region | 01 |
| suffix    | indicates changes in expression level (e.g. N - not expressed, L - low surface expression)  | N |

For example:

* *HLA-A\*01:01* and *HLA-A\*01:02* - have a similar but not identical protein 
sequence. Note in this example, that not all components listed above need to be
indicated in a given HLA name. *HLA-A\*01:01* includes all HLA alleles with
same protein sequence, but potentially different DNA sequence.
* *HLA-A\*01:01:01* and *HLA-A\*01:01:02* - have the same protein sequence but 
slightly different DNA sequences in the coding region 

Note: In a deprecated naming scheme used before 2010, the components of the 
naming scheme were not separated by ":".

### Protein and gene groups

G and P groups is another naming concept, that is frequently used to groups 
of HLA alleles encoding functionally similar proteins. The concept of *gene* 
and *protein* int the G and P groups is independent from the naming
components concerning gene and protein which were mentioned in section
1.2.1.
 
[**G groups**](http://hla.alleles.org/alleles/g_groups.html) are groups of 
HLA alleles that have identical **nucleotide** sequences across the exons 
encoding the peptide-binding domains.

[**P groups**](http://hla.alleles.org/alleles/p_groups.html) are groups of 
HLA alleles that have identical **protein** sequences in the peptide-binding 
domains.

### MAC (Multiple allele codes)

The National Marrow Donor Program (NMDP) uses, 
[multiple allele codes (MAC)](https://bioinformatics.bethematchclinical.org/hla-resources/allele-codes/allele-code-lists/) 
to facilitate the reporting and comparison of HLA alleles [4]. MACs 
consist of the *gene*:*group* component of the
classical HLA naming scheme in section 1.2.1 and a 
letter code (e.g. **A\*01:ATJNV**).
MACs represent groups of HLA alleles. They are useful when the HLA typing is 
ambiguous and does not allow to narrow down one single allele from a list of 
alleles. The `immunotation` packages provides automated access to the 
[MAC conversion tools](https://hml.nmdp.org/MacUI/) provided by NMDP.


## Variation of HLA alleles across human populations

The frequencies of individual HLA alleles varies substantially between worldwide 
human populations.
The [Allele Frequency Net Database (AFND)](http://www.allelefrequencies.net/) 
is a repository for immune gene frequencies in different populations 
worldwide [5]. In addition to a large collection of HLA allele frequency 
datasets, the database also contains datasets for allele frequencies of KIR 
(Killer Cell Immunoglobulin-like Receptor) genes, MIC (Major histocompatibility 
complex class I chain related) and cytokine genes. The current version of the 
`immunotation` package allows automated R access to the HLA related datasets 
in AFND.

The HLA frequency datasets in AFND are classified according to the following 
standards:

| Criteria        | Gold standard           | Silver standard  | Bronze standard |
| ----------- |:-----------------:|:-----------------:|:----------:|
| Allele frequency      | sum to 1 ± 0.015 | sum to 1 ± 0.015 | do not sum to 1 |
| Sample size     | >= 50 individuals      |   any | any |
| Resolution of allele frequency | four or more digits      | two or more digits | other |

# Scope of the package

The `immunotation` package provides tools for consistent annotation of HLA
alleles and protein complexes. The package currently has two main 
functional modules: 

*1. Conversion and nomenclature functions:*

* access to annotations of MHC loci, genes, protein complexes and serotypes in 
different species
* conversion between different levels of HLA annotation
* conversion between input and output of different immunoinformatics tools
* mapping between allele notation and G- or P-groups
* mapping between HLA allele groups and MAC notation

*2. Access to HLA allele frequencies:*

* automated access to datasets stored in the Allele Frequency Net Database
([AFND](http://www.allelefrequencies.net/))
* querying and visualization of HLA allele frequencies in human populations 
worldwide

## Installation

Install the `immunotation` package by using BiocManager.

```{r, eval = FALSE}
if(!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("immunotation")
```

# HLA genes, alleles, protein complexes and serotypes

## Accessing information from MRO

The MRO provides consistent annotation of MHC genes and molecules in the 
following species. Please note that the information for grass carp, clawed frog, 
cotton-top tamarin, giant panda, sheep and marmoset is limited and might lead to 
unexpected function behavior (such as return empty table).

```{r}
get_valid_organisms()
```

The `retrieve_lookup_table` function allows to build a lookup table of all 
annotated chains in a given species. The table specifies the locus, the gene and 
the chain name.

```{r}
df <- retrieve_chain_lookup_table(organism = "human")

DT::datatable(head(df, n=30))
```

The list of annotated human protein complexes is:

```{r}
DT::datatable(head(human_protein_complex_table, n=30))
```

The `get_serotype` function can be used to query the serotype of encoded protein 
complexes for a given HLA genotype. The allele lists represent the MHC class I 
and MHC class II genotype of an exemplary donor.

```{r}
allele_list1 <- c("A*01:01:01", "A*02:01:01",
                  "B*39:01:01", "B*07:02:01", 
                  "C*08:01:01", "C*01:02:01")
allele_list2 <- c("DPA1*01:03:01", "DPA1*01:04:01",
                  "DPB1*14:01:01", "DPB1*02:01:02",
                  "DQA1*02:01:01", "DQA1*05:03",
                  "DQB1*02:02:01", "DQB1*06:09:01",
                  "DRA*01:01", "DRB1*10:01:01", "DRB1*14:02:01")
```

Retrieve the serotype of MHC class I molecules: 

```{r}
get_serotypes(allele_list1, mhc_type = "MHC-I")
```
Retrieve the serotype of MHC class II molecules:
(In the current version of `immunotation` serotypes are only returned when the 
complete molecule ($\alpha$- and $\beta$- chain) is annotated in MRO.)

```{r}
get_serotypes(allele_list2, mhc_type = "MHC-II")
```

## Functions for mapping between different naming schemes

You can directly obtain a protein complex format that is suitable for input to 
NetMHCpan and NetMHCIIpan using the `get_mhc_pan` function.
```{r}
get_mhcpan_input(allele_list1, mhc_class = "MHC-I")
```

```{r}
get_mhcpan_input(allele_list2, mhc_class = "MHC-II")
```

## Retrieving G and P groups

For every allele in the list return the corresponding G group. If the allele is 
not part of a G group, the original allele name is returned.

```{r}
get_G_group(allele_list2)
```

For every allele in the list return the corresponding P group. If the allele is 
not part of a P group, the original allele name is returned.

```{r}
get_P_group(allele_list1)
```

## Encoding and decoding MAC

Encode a list of alleles into MAC using the `encode_MAC` function.

```{r}
allele_list3 <- c("A*01:01:01", "A*02:01:01", "A*03:01")
encode_MAC(allele_list3)
```
Decode a MAC into a list of alleles using the `decode_MAC` function.

```{r}
MAC1 <- "A*01:AYMG"
decode_MAC(MAC1)
```

# Functions related to worldwide population frequencies

## allele frequencies (given allele in several populations)

**Example 1:** Querying the frequency of allele A*02:01 in all populations with 
more than 10,000 individuals and fulfilling the gold standard.

```{r}
sel1 <- query_allele_frequencies(hla_selection = "A*02:01", 
                                hla_sample_size_pattern = "bigger_than", 
                                hla_sample_size = 10000, 
                                standard="g")

DT::datatable(sel1)
```

```{r}
sel1b <- query_allele_frequencies(hla_selection = "A*01:01", 
                                hla_ethnic = "Asian")

DT::datatable(sel1b)
```

```{r}
hla_selection <- build_allele_group("A*01:02")

sel1 <- query_allele_frequencies(hla_selection = hla_selection, 
                                hla_sample_size_pattern = "bigger_than", 
                                hla_sample_size = 200, 
                                standard="g")
```

**Example 2:** Querying haplotype frequency

```{r}
haplotype_alleles <- c("A*02:01", "B*", "C*")
df <- query_haplotype_frequencies(hla_selection = haplotype_alleles,
                            hla_region = "Europe")

DT::datatable(df)
```


**Example 3:** Querying the frequencies of alleles within the HLA-B locus for 
population "Peru Lamas City Lama" with population_id 1986. If you are not sure 
which population_id you should use, we recommend to search your population of 
interest in the *Population* section of the Allele Frequency Net Database and 
extract the population id from the url (last 4 digits).

```{r}
sel2 <- query_allele_frequencies(hla_locus = "B", hla_population = 1986)

DT::datatable(sel2)
```
```{r}
sel2a <- query_allele_frequencies(hla_locus = "B", hla_population = 3089)

DT::datatable(sel2a)
```

Use the `plot_allele_frequency` function to visualize the frequencies of a given 
allele on a world map. The input of `plot_allele_frequency` is a data table that 
can be produced using the `query_allele_frequencies` function.

```{r}
plot_allele_frequency(sel1)
```

# Querying population metainformation

**Example 3:** Query the metainformation concerning population "Peru Lamas City 
Lama" (population_id 1986). The webpage concerning the queried information for 
population Peru Lamas City Lama (1986) can be found here: 
http://www.allelefrequencies.net/pop6001c.asp?pop_id=1986

```{r}
sel3 <- query_population_detail(1986)

DT::datatable(sel3, options = list(scrollX = TRUE))
```

**Example 4:** Query the metainformation concerning the populations that were 
listed in the table returned by Example 1

```{r}
sel4 <- query_population_detail(as.numeric(sel1$population_id))

# only select the first 5 columns to display in table
DT::datatable(sel4[1:5], options = list(scrollX = TRUE))
```

# References 

[1] Robinson J, Barker DJ, Georgiou X et al. IPD-IMGT/HLA Database. Nucleic 
Acids Research (2020)

[2] Maccari G, Robinson J, Ballingall K et al. IPD-MHC 2.0: an improved 
inter-species database for the study of the major histocompatibility complex. 
Nucleic Acids Research (2017)

[3] Vita R, Overton JA, Seymour E et al. An ontology for major 
histocompatibility restriction. J Biomed Semant (2016).

[4] Milius RP, Mack SJ, Hollenbach JA et al. Genotype List String: a grammar for 
describing HLA and KIR genotyping results in a text string. 
Tissue Antigens (2013). 

[5] Gonzalez-Galarza FF, McCabe A, Santos EJ at al. Allele frequency net 
database (AFND) 2020 update: gold-standard data classification, open access 
genotype data and new query tools. Nucleic Acid Research (2020).

# Session Information

```{r}
sessionInfo()
```