Read vcf.bgz file from gnomAD with HTSlib

gnomAD is very convinient. However, it can only provide the variants in a gene or a region. I am going to provide a software tool that search through the whole genome and output the variants after filtering with the criteria defined by users. The under developing code can be found at https://github.com/peng-gang/ReadGnomad.

It is interesting that I can use function ‘bcf_read1’ to read records in vcf.bgz file. The output of this function is a data struction with compressedd information. I can get the information from it very fast according to its structure.

However, when I try to read the variant is a specified region like “1:100000-200000” with index extracted from “vcf.bgz.tbi”. I can only use ‘tbx_itr_next’ only instead of ‘bcf_itr_next’. The explanation is the file is in vcf format instead of bcf format. How can I use ‘bcf_read1’ at first? the format of the file can be found use the following code.

htsFile *fp = hts_open(fname,"r");

enum htsExactFormat format = hts_get_format(fp)->format;

The output of this function is a string of a vcf record that is hard to get a specific information very fast like ‘bcf1_t’ structre get from ‘bcf_itr_next’ or ‘bcf_read1’.

An example of reading vcf or bcf file in a specific region from Rsamtools can be found here.

Avatar
Gang Peng
Associate Research Scientist in Biostatistics

My research interests include biostatistics, bioinformatics, and data mining.

comments powered by Disqus