Binary Dosage Formats

There are currently 4 formats for the binary dosage file.

The first three formats consist of three files, binary dosage, family, and map. The family and maps are data frames with information about the subjects and SNPs in the binary dosage file, respectively. These are data frames saved with the saveRDS command.

Format 1

Format 1 has a header that begins with a magic word and is followed by a number indicating whether it is in format 1.1 or 1.2. It is the followed by the genotype information. The total header length is 8 bytes.

Format 1.1

In format 1.1 the only value stored is the dosage. The dosage values stored are multiplied by \(2^{15} - 1\), 0x7ffe, and stored as short integers. If a value is missing it is stored as \(2^{16}\), 0xffff. Each subject needs to 2 bytes per SNP. The total size of the data section is 2 times the number of subjects times the number of SNPs bytes.

Format 1.2

In format 1.2 the only values stored are \(\Pr(g=1)\) and \(\Pr(g=2)\). These values are multiplied by \(2^{16} - 1\), 0xfffe, and stored as short integers. A value of \(2^16\), 0xffff indicates a value is missing. The total size of the data section is 4 times the number of subjects times the number of SNPs bytes.

Format 2

Format 2 has the same header as format 1.

Format 2.1

The format of the data section is same as format 1.1 except the dosage values are multiplied by 20,000, 0x4e20. The missing value is still \(2^{16}\), 0xffff.

Format 2.2

The format of the data section is same as format 1.1 except the dosage values are multiplied by 20,000, 0x4e20. The missing value is still \(2^{16}\), 0xffff.

Note: Format 2 was adopted when it was discovered that the values returned from the imputation programs were limited to 3 or 4 digits passed the decimal point. When results from fitting models were compared between using the binary dosage file and the original VCF or GEN file, there were slight but unimportant differences. It was considered desirable to be able to return the values exactly as they appear in the original imputation file.

Format 3

Format 3.1 and 3.2 has a header similar to formats 1 and 2 but the number of subjects and SNPs were added to the header to avoid problems associating the wrong family or map file to the binary dosage file.

Format 3.3 and 3.4 has a header similar to formats 1 and 2 but the md5 hash of the family and map data frames were added to the header to avoid problems associating the wrong family or map file to the binary dosage file.

Format 3.1 and 3.3

The data section of formats 3.1 and 3.3 are the same as format 2.1

Format 3.2

Each SNP in the data section begins with a integer value identifying how long the section is for that SNP. The data is then stored as described below in minimizing the data stored.

Format 3.4

Format 3.4 stores the data is a similar format as 3.2 but the section begins with the lengths of all the sections for the SNPs and then is followed by the genotype information.

Format 4

Format 4 takes the data that is in the family and map files and moves it into the header of the binary dosage file. The first section of the header has the magic word and the format. This is followed by information on where the family, map, and genotype data are stored in the file. After the header there is the family data, followed by the map data, and then the imputation data.

Format 4.1 and 4.3

The data section of formats 4.1 and 4.3 are the same as format 2.1

Format 4.2 and 4.4

The data section of formats 4.2 and 4.4 are the same as format 4.2 and 4.3 respectively.

Minimizing the data stored

A lot is known about the imputation data. We know the following \[\Pr(g=0) + \Pr(g=1) + \Pr(g=2) = 1 \] \[ d = \Pr(g=1) + 2\Pr(g=2)\] where \(d\) is the dosage. This means we only need to know two of the values to calculate the other two. In the BinaryDosage package, the dosage and $(g=1) are used.

It is quite often the case that either \(\Pr(g=0)\) or \(\Pr(g=2)\) is 0. In this case, knowing the dosage is enough.

\[ \Pr(g = 1) = \left\{\begin{array}{ll}% d & \; \Pr(g=2)=0, d \leq 1\\% 2 - d & \; \Pr(g=0) = 0, d > 1 % \end{array}\right. \] Once the dosage and \(\Pr(g=1)\) is know the other values can be quickly calculated.

\[\Pr(g=2) = \frac{d - \Pr(g=1)}{2}\]

\[\Pr(g=0) = 1 - \Pr(g=1) - \Pr(g=2)\]

These formulae work well but sometimes there is round off error in the imputation values. In these cases the above equations can’t be used to get the exact imputation values. In these situations all four imputations values, \(d\), \(\Pr(g=0)\),\(\Pr(g=1)\), and \(\Pr(g=2)\) have to be saved. Fortunately this is not a common occurrence.

Since the values stored are short integers of 2 bytes in length, only the last 15 bits are used. This allows the 16th bit to be used as an indicator. For each SNP for each subject the first value saved is the dosage. If the 16th bit is 0, this indicates that either \(\Pr(g=0)\) or \(\Pr(g=2)\) is 0 and the other values can be calculated as described above. If the 16th bit is set to 1, this indicates that the value of \(\Pr(g=1)\) follows. If the 16th bit is set to 0, this indicates that the above equations can be used to calculate \(\Pr(g=0)\) and \(\Pr(g=2)\). If the 16th bit is set to one, this indicates the next two values are \(\Pr(g=0)\) and \(\Pr(g=2)\) respectively.

Note: Usage of the this method general results in 2.2 to 2.4 bytes needed to store each SNP for each subject.