gVCF Files

gVCF was developed to store sequencing information for both variant and nonvariant positions, which is required for human clinical applications. gVCF is a set of conventions applied to the standard variant call format (VCF) 4.1 as documented by the 1000 Genomes Project. These conventions allow representation of genotype, annotation, and other information across all sites in the genome in a compact format. Typical human whole-genome sequencing results expressed in gVCF with annotation are less than 1 Gbyte, or about 1/100 the size of the BAM file used for variant calling. If you are performing targeted resequencing, gVCF is also an appropriate choice to represent and compress results.

gVCF is a text file format, stored as a gzip compressed file (*.genome.vcf.gz). Compression is further achieved by joining contiguous nonvariant regions with similar properties into single ‘block’ VCF records. To maximize the utility of gVCF, especially for high stringency applications, the properties of the compressed blocks are conservative. Block properties like depth and genotype quality reflect the minimum of any site in the block. The gVCF file can be indexed (creating a *.tbi file) and used with existing VCF tools such as tabix and IGV, making it convenient both for direct interpretation and as a starting point for further analysis.

Apps that use gVCF files find it when kicked off and direct it to the sample. If using gVCF files in other tools, download the file to use it in the outside tool.

Each gVCF file contains 1 sample.

Some of the following information about gVCF format may be dated for our newer apps. Please refer to the DRAGEN user guide for up to date information.

Nonvariant Blocks Using END Key

Contiguous nonvariant segments of the genome can be represented as single records in gVCF. These records use the standard 'END' INFO key to indicate the extent of the record. Even though the record can span multiple bases, only the first base is provided in the REF field to reduce file size.

The following is a simplified segment of a gVCF file, describing a segment of nonvariant calls (starting with an A) on chromosome 1 from position 51845 to 51862.

##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the variant described in this record">

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT

NA19238chr1 51845 . A . . PASS END=51862

Any field provided for a block of sites, such as read depth (using the DP key), shows the minimum value that is observed among all sites encompassed by the block. Each sample value shown for the block, such as the depth (DP), is restricted to a range where the maximum value is within 30% or 3 of the minimum. For example, for sample value range [x,y], y <= x+max(3,x*0.3). This range restriction applies to each of the sample values printed in the final block record

Indel Regions

Sites that are "filled in" inside deletions have additional changes:

All deletions:

  • Sites inside any deletion are marked with the deletion filters, in addition to any filters that have already been applied to the site.

  • Sites inside deletions cannot have a genotype or alternate allele quality score higher than the corresponding value from the enclosing indel.

Heterozygous deletions:

  • Sites inside heterozygous deletions are altered to have haploid genotype entries (eg, "0" instead of "0/0", "1" instead of "1/1").

  • Heterozygous SNV calls inside heterozygous deletions are marked with the SiteConflict filter and their genotype is unchanged.

Homozygous deletions:

  • Homozygous reference and no-call sites inside homozygous deletions have genotype "."

  • Sites inside homozygous deletions that have a nonreference genotype are marked with a SiteConflict filter, and their genotype is unchanged.

  • Site and genotype quality are set to "."

The described modifications reflect the notion that the site confidence is bound within the enclosing indel confidence.

On occasion, the variant caller produces multiple overlapping indel calls that cannot be resolved into 2 haplotypes. If this case, all indels and sites in the region of the overlap are marked with the IndelConflict filter.

Genotype Quality for Variant and Nonvariant Sites

The gVCF file uses an adapted version of genotype quality for variant and nonvariant site filtration. This value is associated with the key GQX. The GQX value is intended to represent the minimum of {Phred genotype quality assuming the site is variant, Phred genotype quality assuming the site is nonvariant}. The reason for using this value is to allow a single value to be used as the primary quality filter for both variant and nonvariant sites. Filtering on this value corresponds to a conservative assumption appropriate for applications where reference genotype calls must be determined at the same stringency as variant genotypes, ie:

  • An assertion that a site is homozygous reference at GQX ≥ 30 is made assuming the site is variant.

  • An assertion that a site is a nonreference genotype at GQX ≥ 30 is made assuming the site is nonvariant.

Section Descriptions

The gVCF file contains the following sections:

  • Metainformation lines start with ## and contain metadata, config information, and define the values that the INFO, FILTER, and FORMAT fields can have.

  • The header line starts with # and names the fields that the data lines use. These fields are #CHROM, POS, ID, REF, ALT, QUAL, FILTER, INFO, FORMAT, followed by one or more sample columns.

  • Data lines that contain information about one or more positions in the genome.

  • If you extract the variant lines from a gVCF file, you produce a conventional variant VCF file.

Field Descriptions

The fixed fields #CHROM, POS, ID, REF, ALT, QUAL are defined in the VCF 4.1 standard provided by the 1000 Genomes Project. The fields ID, INFO, FORMAT, and sample are described in the metainformation.

  • CHROM: Chromosome: an identifier from the reference genome or an angle-bracketed ID String ("") pointing to a contig.

  • POS: Position: The reference position, with the first base having position 1. Positions are sorted numerically, in increasing order, within each reference sequence CHROM. There can be multiple records with the same POS. Telomeres are indicated by using positions 0 or N+1, where N is the length of the corresponding chromosome or contig.

  • ID: Semicolon separated list of unique identifiers where available. If this ID is a dbSNP variant, it is encouraged to use the rs number. No identifier is present in more than one data record. If there is no identifier available, then the missing value is used.

  • REF: Reference bases: A,C,G,T,N; there can be multiple bases. The value in the POS field refers to the position of the first base in the string. For simple insertions and deletions in which either the REF or one of the ALT alleles would otherwise be null/empty, the REF and ALT strings include the base before the event. This modification is reflected in the POS field. The exception is when the event occurs at position 1 on the contig, in which case they include the base after the event. If any of the ALT alleles is a symbolic allele (an angle-bracketed ID String ""), the padding base is required. In that case, POS denotes the coordinate of the base preceding the polymorphism.

  • ALT: Comma-separated list of alternate nonreference alleles called on at least 1 of the samples. Options are:

    • Base strings made up of the bases A,C,G,T,N

    • Angle-bracketed ID String (””)

    • Break-end replacement string as described in the section on break-ends.

    • If there are no alternative alleles, then the missing value is used.

  • QUAL: Phred-scaled quality score for the assertion made in ALT. ie -10log10 probability (call in ALT is wrong). If ALT is ”.” (no variant), this score is -10log10 p(variant). If ALT is not ”.”, this score is -10log10 p(no variant). High QUAL scores indicate high confidence calls. Although traditionally people use integer Phred scores, this field is permitted to be a floating point to enable higher resolution for low confidence calls if desired. If unknown, the missing value is specified. (Numeric)

  • FILTER: PASS marks positions that have passed all filters. Otherwise, a semicolon-separated list of codes for filters that failed is provided. gVCF files use the following values:

    • PASS: position has passed all filters.

    • IndelConflict: Locus is in region with conflicting indel calls.

    • SiteConflict: Site genotype conflicts with proximal indel call, which is typically a heterozygous SNV call made inside a heterozygous deletion.

    • LowGQX: Locus GQX (minimum of {Genotype quality assuming variant position,Genotype quality assuming nonvariant position}) is less than 30 or not present.

    • HighDPFRatio: The fraction of base calls filtered out at a site is greater than 0.3.

    • HighSNVSB: SNV strand bias value (SNVSB) exceeds 10. High strand bias indicates a potential high false-positive rate for SNVs.

    • HighSNVHPOL: SNV contextual homopolymer length (SNVHPOL) exceeds 6.

    • HighREFREP: Indel contains an allele that occurs in a homopolymer or dinucleotide track with a reference repeat greater than 8.

    • HighDepth: Locus depth is greater than 3x the mean chromosome depth.

  • INFO: Additional information. INFO fields are encoded as a semicolon-separated series of short keys with optional values in the format: =[,data]. gVCF files use the following values:

    • END: End position of the region described in this record.

    • BLOCKAVG_min30p3a: nonvariant site block. All sites in a block are constrained to be nonvariant, have the same filter value, and have all sample values in range [x,y], y ≤ max(x+3,(x*1.3)). All printed site block sample values are the minimum observed in the region spanned by the block.

    • SNVSB: SNV site strand bias.

    • SNVHPOL: SNV contextual homopolymer length.

    • CIGAR: CIGAR alignment for each alternate indel allele.

    • RU: Smallest repeating sequence unit extended or contracted in the indel allele relative to the reference. RUs longer than 20 bases are not reported.

    • REFREP: Number of times RU is repeated in reference.

    • IDREP: Number of times RU is repeated in indel allele.

  • FORMAT: Format of the sample field. FORMAT specifies the data types and order of the subfields. gVCF files use the following values:

    • GT: Genotype.

    • GQ: Genotype Quality.

    • GQX: Minimum of {Genotype quality assuming variant position, Genotype quality assuming nonvariant position}.

    • DP: Filtered base call depth used for site genotyping.

    • DPF: Base calls filtered from input before site genotyping.

    • AD: Allelic depths for the ref and alt alleles in the order listed. For indels, this value only includes reads that confidently support each allele (posterior probability 0.999 or higher that read contains indicated allele vs all other intersecting indel alleles).

    • DPI: Read depth associated with indel, taken from the site preceding the indel.

  • SAMPLE: Sample fields as defined by the header.

Last updated