FastQ Files
BaseSpace Sequence Hub converts *.bcl files into FASTQ files, which contain base call and quality information for all reads that pass filtering.
BaseSpace Sequence Hub automatically generates FASTQ files in sample sheet-driven workflow apps. Other apps that perform alignment and variant calling also automatically use FASTQ files.
FASTQ files can be used as sequence input for alignment and other secondary analysis software. Do not use them with tools that are not compatible with the FASTQ format.
Detailed Description
Naming
FASTQ files are named with the sample name and the sample number, which is a numeric assignment based on the order that the sample is listed in the sample sheet. Example: Data\Intensities\BaseCalls\samplename_S1_L001_R1_001.fastq.gz
samplename - The sample name provided in the sample sheet. If a sample name is not provided, the file name includes the sample ID, which is a required field in the sample sheet and must be unique.
S1 — The sample number based on the order that samples are listed in the sample sheet starting with 1. In this example, S1 indicates that this sample is the first sample listed in the sample sheet.
Reads that cannot be assigned to any sample are written to a FASTQ file for sample number 0, and excluded from downstream analysis.
L001—The lane number.
R1—The read. In this example, R1 means Read 1. For a paired-end run, there is at least one file with R2 in the file name for Read 2.
001—The last segment is always 001.
Compression
FASTQ files are saved compressed in the GNU zip format (an open source file compression program), indicated by the .gz file extension.
Format
Each entry in a FASTQ file consists of four lines:
Sequence identifier
Sequence
Quality score identifier line (consisting only of a +)
Quality score
For the Undetermined FASTQ files only, the sequence observed in the index read is written to the FASTQ header in place of the sample number. This information can be useful for troubleshooting demultiplexing.
The following table describes the elements:
Element
Requirements
Description
@
@
Each sequence identifier line starts with @
<instrument>
Characters allowed: a–z, A–Z, 0–9 and underscore
Instrument ID
<run number>
Numerical
Run number on instrument
<flowcell ID>
Characters allowed: a–z, A–Z, 0–9
Flowcell ID
<lane>
Numerical
Lane number
<tile>
Numerical
Tile number
<x_pos>
Numerical
X coordinate of cluster
<y_pos>
Numerical
Y coordinate of cluster
<read>
Numerical
Read number. 1 can be single read or Read 2 of paired-end
<is filtered>
Y or N
Y if the read is filtered (did not pass), N otherwise
<control number>
Numerical
0 when none of the control bits are on, otherwise it is an even number. On HiSeq X systems, control specification is not performed and this number is always 0.
<sample number>
Numerical
Sample number from sample sheet
An example of a valid entry is as follows; note the space preceding the read number element:
Control values
If the read is not identified as a control, then the 10th column (<control number>) is zero. If the read is identified as a control, the number is greater than zero, and the value specifies what type of control it is. The value is the decimal representation of a bit-wise encoding scheme. In that scheme bit 0 has a decimal value of 1, bit 1 a value of 2, bit 2 a value of 4, and so on.
Last updated