VCF Files

VCF is a text file format that contains information about variants found at specific positions in a reference genome. The file format consists of meta-information lines, a header line, and then data lines. Each data line contains information about a single variant.

Use VCF files for direct interpretation or as a starting point for further analysis with downstream analysis that is compatible with VCF, such as IGV or the UCSC Genome Browser. Do not use VCF files with tools that are not compatible with the VCF format, such as Outlook.

If you use a BaseSpace Sequence Hub app that uses VCF files as input, the app locates the file when launched. If using VCF files in other tools, download the file to use it in the external tool.

File Format

Some of the following information about VCF format may be dated for our newer apps. Please refer to the DRAGEN user guide for up to date information.

The file naming convention for VCF files is as follows: SampleName_S#.vcf (where # is the sample number determined by ordering in the sample sheet).

The header of the VCF file describes the tags used in the remainder of the file and has the column header: ##fileformat=VCFv4.1

##fileDate=20120317

##source=SequenceAnalysisReport.vshost.exe

##reference=

##phasing=none

##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">

##INFO=<ID=TI,Number=.,Type=String,Description="Transcript ID">

##INFO=<ID=GI,Number=.,Type=String,Description="Gene ID">

##INFO=<ID=CD,Number=0,Type=Flag,Description="Coding Region">

##FILTER=<ID=q20,Description="Quality below 20">

##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">

##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE

A sample line of the VCF file, with the data that is used to populate each column described: chr22 16285888 rs76548004 T C 17 d15;q20 DP=11;TI=NM_001136213;GI=POTEH;CD GT:GQ 1/0:17

ALT: The alleles that differ from the reference read. For example, an insertion of a single T could show reference A and alternate AT.

CHROM: The chromosome of the reference genome. Chromosomes appear in the same order as the reference FASTA file (generally karyotype order)

FILTER: If all filters are passed, the' PASS' is written. The possible filters are as follows:

  • q20 – The variant score is less than 20. (Configurable using the VariantFilterQualityCutoff setting in the config file)

  • r8 – For an Indel, the number of repeats in the reference (of a 1- or 2-base repeat) is greater than 8. (Configurable using the IndelRepeatFilterCutoff setting in the config file)

FORMAT: The format column lists fields (separated by colons), for example, "GT:GQ". The list of fields provided depends on the variant caller used. The available fields are as follows:

  • AD – Entry of the form X,Y where X is the number of reference calls, Y the number of alternate calls

  • GQ – Genotype quality

  • GT – Genotype. 0 corresponds to the reference base, 1 corresponds to the first entry in the ALT column, 2 corresponds to the second entry in the ALT column, etc. The '/' indicates that there is no phasing information.

  • NL – Noise level; an estimate of base calling noise at this position

  • SB – Strand bias at this position. Larger negative values indicate less bias; values near zero indicate more strand bias.

  • VF – Variant frequency. The percentage of reads supporting the alternate allele.

ID: The rs number for the SNP obtained from dbSNP. If there are multiple rs numbers at this location, the list is semi-colon delimited. If no dbSNP entry exists at this position, the missing value ('.') is used.

INFO: The possible entries in the INFO column:

  • AD – Entry of the form X,Y where X is the number of reference calls, Y the number of alternate calls.

  • CD – A flag indicating that the SNP occurs within the coding region of at least one RefGene entry

  • DP – The depth (number of base calls aligned to this position)

  • GI – A comma-separated list of gene IDs read from RefGene

  • NL – Noise level; an estimate of base calling noise at this position.

  • TI – A comma-separated list of transcript IDs read from RefGene

  • SB – Strand bias at this position.

  • VF – Variant frequency. The number of reads supporting the alternate allele.

POS: The 1-based position of this variant in the reference chromosome. The convention for VCF files is that, for SNPs, this base is the reference base with the variant. For indels or deletions, this base is the reference base immediately before the variant. Variants are in order of position.

QUAL: A Phred-scaled quality score assigned by the variant caller. Higher scores indicate higher confidence in the variant (and lower probability of errors). For a quality score of Q, the estimated probability of an error is 10-(Q/10). For example, the set of Q30 calls has a 0.1% error rate. Many variant callers assign quality scores (based on their statistical models) which are high relative to the error rate observed in practice.

REF: The reference genotype. For example, a deletion of a single T can read reference TT and alternate T.

SAMPLE: The sample column gives the values specified in the FORMAT column. One MAXGT sample column is provided for the normal genotyping (assuming the reference). For reference, a second column is provided for genotyping assuming the site is polymorphic. See the Starling documentation for more details.

Variant files for Isaac also contain off-target variant calls, with filter.

Last updated