The first thing to using a software is to know what it inputs and outputs are.
Before using geneHapR
there were several data set should
be prepared by the user.
Genotypes of each individuals/accessions in variant call format (VCF) format or p.link (ped/map) format or table format or FASTA format is necessary for haplotype analysis. Annotations stored in GFF/GFF3 or BED4/BED6 format is needed when visualization of variants on gene model or filtration of variants according annotations. Phenotype data is needed when compare phenotype differences between haplotypes. Longitude and latitude information are needed when display distribution of haplotypes. Accession group information is optionally needed when plot haplonet. Those data format are details as bellow:
VCF: Introduction of this format could be found at https://learn.gencore.bio.nyu.edu/ngs-file-formats/vcf-format/
Fasta: Introduction of this format could be found at https://learn.gencore.bio.nyu.edu/ngs-file-formats/gff3-format/.
p.link (ped/map): The fields in a MAP file are: Chromosome; Marker ID; Genetic distance; Physical position. For example.:
Chr1 rs11511647 0 26765
Chr1 rs3883674 0 32380
Chr1 rs12218882 0 48172
The field in a PED file are: Family ID; Sample ID; Paternal ID; Maternal ID; Sex (1=male; 2=female; other=unknown); Affection (0=unknown; 1=unaffected; 2=affected) and Genotypes (space or tab separated, 2 for each marker. 0=missing). For example:
NA06985 NA06985 0 0 1 1 A T T T G G
NA06991 NA06991 0 0 1 1 C T T T G G
NA06993 NA06993 0 0 1 1 C T T T G G
NA06994 NA06994 0 0 1 1 C T T T G G
Table: The first five column are fix as Chrome name, position, reference nucleotide, alter nucleotide and additional information. Accession genotype should be in followed columns. “-” will be treated as Indel. “.” and “N” will be treated as missing data. Additional information should be in format “tag=value”. Heterozygote site should be in “A/G” or “A|G” format. For example:
CHROM POS REF ALT INFO Ac1 Ac2 Ac3
Chr1 108 A T aa=A23G T A A
Chr1 309 A C aa=STOP C C A
Chr1 563 GT T aa=SHIFT T GT GT
Chr1 949 C A aa=S88A A C/A C
GFF/GFF3: Introduction of this format could be found at https://learn.gencore.bio.nyu.edu/ngs-file-formats/gff3-format/
BED4/BED6:As the definition of UCSC. The BED6 contains 6 columns, which are 1) chromosome name, 2) chromosome start, 3) chromosome end, 4) name, 5) score and 6) strand. The BED4 format contains the first 4 column. BE NOTE THAT: the fourth column was used to definition the transcripts name and types, separated by a space, like eg.: “HD1.1 CDS” or “HD1.1 URTs”.
For example:
Chr8 678 890 HD1.1 CDS . -
Chr8 891 989 HD1.1 UTR . -
Chr8 668 759 HD1.2 CDS . -
Chr8 908 989 HD1.2 CDS . -
This example indicate a small gene named as HD1 have two transcripts, named as HD1.1 and HD1.2, separately. HD1 has a CDS and a UTR region; while HD1.2 has two CDS region.
Phenotype data and accession Information: The phenotype data and accession information, eg.: group information and geo-coordinate, should be stored in tab delimited table. First column as names of accessions/individuals and phenotype and information are lies in followed columns.
VCF file (variant call format file) imported into ‘R’ as vcfR object.
P.link (ped/map) imported into ‘R’ as list object.
GFF/GFF3 and BED4/BED6 file (genome annotations) imported into ‘R’ as GRanges object.
DNA sequences (fasta format) imported into ‘R’ as DNAStringSet object.
Genotype data stored in table and Phenotype data and accession group information imported into ‘R’ as data.frame objects.
The main results are hapResult
and
hapSummary
could be export as tab delimited tables; and
visualizations could be export as figures format or PDF files.
hapResult
and hapSummary
hapResult
and hapSummary
are effectively a
matrix, which could be divided into three parts, with some additional
attributes.
Part I consists of only one column, indicates contents type of each row. The first four rows are fix to additional information as CHROM, POS, INFO and ALLELE. Further annotations are stored in fields of INFO, and each field are separated by semicolons (;). Followed rows are names of each haplotype.
Part II: consists of at least one column. Each column represents a site. The first four elements in each contents information and annotations of the current sites. And followed elements represents genotype of the corresponding haplotype.
Part III: The part III of hapResult
consists of one column named as Accession, while the
part III of hapSummary
consists two columns named as
Accession and freq.
The differences between hapResult
and
hapSummary
only lied in part III: (a)
there is a freq column in hapSummary
while
hapResult
not; (b) multi-accessions are separated by
semicolons in hapSummary
while one accession in each row of
hapResult
.
Cartoon representation of hapResult and hapSummary contents