Result File Description



◦ Deliverables List


File Type File Name Description
FASTQ file Sample_1.fastq.gz Raw read1 sequence data
Sample_2.fastq.gz Raw read2 sequence data
BAM file Sample.recal.bam BWA alignment file
Sample.recal.bam.bai BWA alignment index file
Variant Call Result Sample.final.vcf SNP/INDEL file (vcf format)
Sample.g.vcf Genomic VCF
Sample_SNP_Indel_ANNO.xlsx Annotated variant list file (excel file)
Summary All_samples_stats.xlsx Analysis stats report of all samples (excel file)



◦ Deliverables File Format


 - FASTQ File


  Example:

  FASTQ file consists of four lines.

  • Line1 : Sequence identifier
  • Line2 : Nucleotide sequences
  • Line3 : Quality score identifier line - character '+'
  • Line4 : Quality score

   Q = -10 log10(error rate)

Phred Quality Score Probability of Incorrect Base Call Base Call Accuracy
10 1 in 10 90%
20 1 in 100 99%
30 1 in 1000 99.9%
40 1 in 10000 99.99%
50 1 in 100000 99.999%
60 1 in 1000000 99.9999%
  • Encoding: ASCII Character Code=Phred Quality Value + 33

  HiSeq4000,NovaSeq6000 groups quality scores into specific ranges, or bins, and assigns a value to each range.

  For example, the original quality scosres 20-24 may from one bin, and can all be mapped to a new value of 22. Q-score binning significantly reduces storage space requirements without affecting accuracy or performance of downstream applications. Please refer to this table below, Q Scores for HiSeq4000 are binned using the following criteria.

Q-Score Bins Example of Empirically Mapped Q-Scores
N(no call) N(no call)
2-9 7
10-19 11
20-24 22
25-29 27
30-34 32
35-39 37
40-45 42
  • The quality score table above is typically updated when significant characteristics of the sequencing platform change, such as new hardware, software, or chemistry versions.


 - VCF File


VCF

  The Variant Call Format (VCF) is a text file format that contains information about variants found at specific positions in a reference genome. The file format consists of meta-information lines, a header line, and data lines. Each data line contains information about a single variant.

  Example :

Header Description
#CHROM Chromosome
POS Position (with the 1st base having position 1)
ID The dbSNP rs identifier of the SNP
REF Reference base(s)
ALT Comma separated list of alternate non-reference alleles called on at least one of the samples
QUAL A Phred-scaled quality score assigned by the variant caller. Higher scores indicate higher confidence in the variant (and lower probability of errors).
FILTER Filter status: PASS if this position has passed all filters, i.e. a call is made at this position. Otherwise, if the site has not passed all filters, a semicolon-separated below list of codes for filters that fail. See FILTER tag table for possible entries.
INFO Additional information: INFO fields are encoded as a semicolon-separated series of short keys with optional values in the format: =. The exact format of each INFO sub-field should be specified in the meta-information. See INFO tag table for possible entries.
FORMAT See FORMAT tag table for possible entries.
Tag Description
LowQual Low quality
MG_INDEL_Filter QD < 2.0 || FS > 200.0 || ReadPosRankSum < -20.0
MG_SNP_Filter QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0
Tag Description
AC Allele count in genotypes, for each ALT allele, in the same order as listed
AF Allele Frequency, for each ALT allele, in the same order as listed
AN Total number of alleles in called genotypes
BaseQRankSum Z-score from Wilcoxon rank sum test of Alt Vs. Ref base qualities
ClippingRankSum Z-score From Wilcoxon rank sum test of Alt vs. Ref number of hard clipped bases
DB dbSNP Membership
DP Approximate read depth; some reads may have been filtered
FS Phred-scaled p-value using Fisher’s exact test to detect strand bias
HaplotypeScore Consistency of the site with at most two segregating haplotypes
InbreedingCoeff Inbreeding coefficient as estimated from the genotype likelihoods per-sample when compared against the Hardy-Weinberg expectation
MLEAC Maximum likelihood expectation (MLE) for the allele counts (not necessarily the same as the AC), for each ALT allele, in the same order as listed
MLEAF Maximum likelihood expectation (MLE) for the allele frequency (not necessarily the same as the AF), for each ALT allele, in the same order as listed
MQ RMS Mapping Quality
MQ0 Total Mapping Quality Zero Reads
MQRankSum Z-score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities
QD Variant Confidence/Quality by Depth
ReadPosRankSum Z-score from Wilcoxon rank sum test of Alt vs. Ref read position bias
SOR Symmetric Odds Ratio of 2x2 contingency table to detect strand bias
set Source VCF for the merged record in CombineVariants
SNP Variant is a SNP
MNP Variant is an MNP
INS Variant is an insertion
DEL Variant is an deletion
MIXED Variant is mixture of INS/DEL/SNP/MNP
HOM Variant is homozygous
HET Variant is heterozygous
VARTYPE Comma separated list of variant types. One per allele.
Tag Description
GT Genotype
0/0 - the sample is homozygous reference
0/1 - the sample is heterozygous, carrying 1 copy of each of the REF and ALT alleles
1/1 - the sample is homozygous alternate
AD Allelic depths for the ref and alt alleles in the order listed.
DP Read depth at this position for this sample
GQ Conditional genotype quality, encoded as a phred quality
PL The normalized, Phred-scaled likelihoods for each of the 0/0, 0/1, and 1/1, without priors. The most likely genotype (given in the GT field) is scaled so that it’s P = 1.0 (0 when Phred-scaled), and the other likelihoods reflect their Phred-scaled likelihoods relative to this most likely genotype.