‘geneHapR’ data

Zhang RenLiang

2022-12-07

The first thing to using a software is to know what it inputs and outputs are.

Inputs

Before using geneHapR there were several data set should be prepared by the user.

Genotypes of each individuals/accessions in variant call format (VCF) format or p.link (ped/map) format or table format or FASTA format is necessary for haplotype analysis. Annotations stored in GFF/GFF3 or BED4/BED6 format is needed when visualization of variants on gene model or filtration of variants according annotations. Phenotype data is needed when compare phenotype differences between haplotypes. Longitude and latitude information are needed when display distribution of haplotypes. Accession group information is optionally needed when plot haplonet. Those data format are details as bellow:

VCF: Introduction of this format could be found at https://learn.gencore.bio.nyu.edu/ngs-file-formats/vcf-format/

Fasta: Introduction of this format could be found at https://learn.gencore.bio.nyu.edu/ngs-file-formats/gff3-format/.

p.link (ped/map): The fields in a MAP file are: Chromosome; Marker ID; Genetic distance; Physical position. For example.:

Chr1    rs11511647  0   26765
Chr1    rs3883674   0   32380
Chr1    rs12218882  0   48172

The field in a PED file are: Family ID; Sample ID; Paternal ID; Maternal ID; Sex (1=male; 2=female; other=unknown); Affection (0=unknown; 1=unaffected; 2=affected) and Genotypes (space or tab separated, 2 for each marker. 0=missing). For example:

NA06985 NA06985 0   0   1   1   A   T   T   T   G   G
NA06991 NA06991 0   0   1   1   C   T   T   T   G   G
NA06993 NA06993 0   0   1   1   C   T   T   T   G   G
NA06994 NA06994 0   0   1   1   C   T   T   T   G   G

Table: The first five column are fix as Chrome name, position, reference nucleotide, alter nucleotide and additional information. Accession genotype should be in followed columns. “-” will be treated as Indel. “.” and “N” will be treated as missing data. Additional information should be in format “tag=value”. Heterozygote site should be in “A/G” or “A|G” format. For example:

CHROM   POS REF ALT INFO        Ac1 Ac2 Ac3
Chr1    108 A   T   aa=A23G     T   A   A
Chr1    309 A   C   aa=STOP     C   C   A
Chr1    563 GT  T   aa=SHIFT    T   GT  GT
Chr1    949 C   A   aa=S88A     A   C/A C

GFF/GFF3: Introduction of this format could be found at https://learn.gencore.bio.nyu.edu/ngs-file-formats/gff3-format/

BED4/BED6:As the definition of UCSC. The BED6 contains 6 columns, which are 1) chromosome name, 2) chromosome start, 3) chromosome end, 4) name, 5) score and 6) strand. The BED4 format contains the first 4 column. BE NOTE THAT: the fourth column was used to definition the transcripts name and types, separated by a space, like eg.: “HD1.1 CDS” or “HD1.1 URTs”.

For example:

Chr8    678 890     HD1.1 CDS   .   -
Chr8    891 989     HD1.1 UTR   .   -
Chr8    668 759     HD1.2 CDS   .   -
Chr8    908 989     HD1.2 CDS   .   -

This example indicate a small gene named as HD1 have two transcripts, named as HD1.1 and HD1.2, separately. HD1 has a CDS and a UTR region; while HD1.2 has two CDS region.

Phenotype data and accession Information: The phenotype data and accession information, eg.: group information and geo-coordinate, should be stored in tab delimited table. First column as names of accessions/individuals and phenotype and information are lies in followed columns.

Imported data in R

VCF file (variant call format file) imported into ‘R’ as vcfR object.

P.link (ped/map) imported into ‘R’ as list object.

GFF/GFF3 and BED4/BED6 file (genome annotations) imported into ‘R’ as GRanges object.

DNA sequences (fasta format) imported into ‘R’ as DNAStringSet object.

Genotype data stored in table and Phenotype data and accession group information imported into ‘R’ as data.frame objects.

Output/results

The main results are hapResult and hapSummary could be export as tab delimited tables; and visualizations could be export as figures format or PDF files.

hapResult and hapSummary

hapResult and hapSummary are effectively a matrix, which could be divided into three parts, with some additional attributes.

Part I consists of only one column, indicates contents type of each row. The first four rows are fix to additional information as CHROM, POS, INFO and ALLELE. Further annotations are stored in fields of INFO, and each field are separated by semicolons (;). Followed rows are names of each haplotype.

Part II: consists of at least one column. Each column represents a site. The first four elements in each contents information and annotations of the current sites. And followed elements represents genotype of the corresponding haplotype.

Part III: The part III of hapResult consists of one column named as Accession, while the part III of hapSummary consists two columns named as Accession and freq.

The differences between hapResult and hapSummary only lied in part III: (a) there is a freq column in hapSummary while hapResult not; (b) multi-accessions are separated by semicolons in hapSummary while one accession in each row of hapResult.

Cartoon representation of hapResult and hapSummary contents

Cartoon representation of hapResult and hapSummary contents