Result File Description



◦ Deliverables List

File Type File Name Description
Raw Data Sample1_R1.fastq.gz Raw read1 sequence data
Sample1_R2.fastq.gz Raw read2 sequence data
BWA Alignment File Sample1_sorted.bam iSAAC alignment file
Sample1_sorted.bam.bai iSAAC alignment index file
SNP/INDEL Result Sample1_SNP_INDEL.vcf SNP/INDEL file (vcf format)
Sample1_[chr*].xlsx Convert SNP_INDEL result (excel file)
Sample1.genome.vcf.gz Genomic VCF
Sample1.genome.vcf.gz.tbi Genomic VCF index file
CNV Result Sample1_CNVs.xlsx Control-FREEC CNV result
SV Result Sample1_SV.vcf Manta SV result
Md5sum Order_#samples_md5sum.xlsx MD5 is a string of 32 hexadecimal values, which represents a 'fingerprint' of a file. By comparing the supplied MD5 value to the actual value computed by the MD5sums utility, you can make sure that the file that you downloaded off of the internet has not been tampered with or modified from the original file stored in our server.



◦ Deliverables File Format


- FASTQ File

  Example:


  FASTQ file consists of four lines.

  • Line1 : Sequence identifier
  • Line2 : Nucleotide sequences
  • Line3 : Quality score identifier line - character '+'
  • Line4 : Quality score

  Quality score is represented with each character. One character matches its base with Phred+33.

   Q = -10 log10(error rate)

Phred quality score Probability of incorrect base call Base call accuracy
10 1 in 10 90%
20 1 in 100 99%
30 1 in 1000 99.9%
40 1 in 10000 99.99%
50 1 in 100000 99.999%
60 1 in 1000000 99.9999%

  • Encoding: ASCII character code=Phred quality value + 33
  •   Q scores have been calibrated specifically to the Illumina system and its consumables. It does use Q score binning. This is necessary for Illumina runs due to the quantity of data being generated and since it cannot be turned off.


      More information can be found here:

    Illumina_technote_understanding_quality_scores.pdf



  • The quality score table above is typically updated when significant characteristics of the sequencing platform change, such as new hardware, software, or chemistry versions.

  • - BAM File

    BAM

      The BAM is a compressed binary format of a SAM(Sequence Alignment Map). The BAM file contains information about sequence alignment of reads against a large reference sequence.

      Example :

    Tag Description
    @HD The header line
    @PG Program and command line
    @RG Read group. platform, sample name information
    @SQ Reference sequence dictionary. The order of @SQ lines defines the alignment sorting order.
    Field Description
    QNAME Query template name (read ID)
    FLAG Bitwise flag
    RNAME Reference sequence name (chromosome id)
    POS 1-based leftmost mapping position
    MAPQ Mapping quality
    CIGAR CIGAR string
    RNEXT Ref. name of the mate/next read
    PNEXT Position of the mate/next read
    TLEN Observed template length
    SEQ Segment sequence
    QUAL ASCII of Phred-scaled base QUALity+33
    Optional Optional fields. (TAG, TYPE, VALUE)
  • https://samtools.github.io/hts-specs/SAMv1.pdf

  • - VCF File

    VCF

      The Variant Call Format (VCF) is a text file format that contains information about variants found at specific positions in a reference genome. The file format consists of meta-information lines, a header line, and data lines. Each data line contains information about a single variant.

      Example :

    Header Description
    #CHROM Chromosome
    POS Position (with the 1st base having position 1)
    ID The dbSNP rs identifier of the SNP
    REF Reference base(s)
    ALT Comma separated list of alternate non-reference alleles called on at least one of the samples
    QUAL A Phred-scaled quality score assigned by the variant caller. Higher scores indicate higher confidence in the variant (and lower probability of errors).
    FILTER See FILTER tag table for possible entries.
    INFO See INFO tag table for possible entries.
    FORMAT See FORMAT tag table for possible entries.

      Filter status: PASS if this position has passed all filters, i.e. a call is made at this position. Otherwise, if the site has not . passed all filters, a semicolon-separated below list of codes for filters that fail.

    Tag Description
    IndelConflict Indel genotypes from two or more loci conflict in at least one sample
    SiteConflict Site is filtered due to an overlapping indel call filter
    LowGQX Locus GQX is below threshold or not present
    HighDPFRatio The fraction of basecalls filtered out at a site is greater than 0.4
    HighSNVSB Sample SNV strand bias value (SB) exceeds 10
    HighDepth Locus depth is greater than 3x the mean chromosome depth
    LowDepth Locus depth is below
    NotGenotyped Locus contains forcedGT input alleles which could not be genotyped
    PloidyConflict Genotype call from variant caller not consistent with chromosome ploidy

      Additional information: INFO fields are encoded as a semicolon-separated series of short keys with optional values in the format: <key>=<data>. The exact format of each INFO sub-field should be specified in the meta-information.

    Tag Description
    END End position of the region described in this record
    BLOCKAVG_min30p3a Non-variant multi-site block. Non-variant blocks are defined independently for each sample. All sites in such a block are constrained to be non-variant, have the same filter value, and have sample values {GQX,DP,DPF} in range [x,y], y <= max(x+3,(x*1.3)).
    SNVHPOL SNV contextual homopolymer length
    CIGAR CIGAR alignment for each alternate indel allele
    RU Smallest repeating sequence unit extended or contracted in the indel allele relative to the reference. RUs are not reported if longer than 20 bases
    REFREP Number of times RU is repeated in reference
    IDREP Number of times RU is repeated in indel allele
    MQ RMS of mapping quality
    Tag Description
    GT Genotype
    0/0 - the sample is homozygous reference
    0/1 - the sample is heterozygous, carrying 1 copy of each of the REF and ALT alleles
    1/1 - the sample is homozygous alternate
    GQ Genotype quality
    GQX Empirically calibrated genotype quality score for variant sites, otherwise minimum of {Genotype quality assuming variant position,Genotype quality assuming non-variant position}
    DP Filtered basecall depth used for site genotyping. In a non-variant multi-site block this value represents the average of all sites in the block.
    DPF Basecalls filtered from input prior to site genotyping. In a non-variant multi-site block this value represents the average of all sites in the block.
    MIN_DP Minimum filtered basecall depth used for site genotyping within a non-variant multi-site block
    AD Allelic depths for the ref and alt alleles in the order listed. For indels this value only includes reads which confidently support each allele (posterior prob 0.51 or higher that read contains indicated allele vs all other intersecting indel alleles)
    ADF Allelic depths on the forward strand
    ADR Allelic depths on the reverse strand
    FT Sample filter, 'PASS' indicates that all filters have passed for this sample
    DPI Read depth associated with indel, taken from the site preceding the indel
    PL Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification
    PS Phase set identifier
    SB Sample site strand bias

    - CNV file

    CNV

      Sample1_CNVs.xlsx file with coordinates of predicted copy number alterations.

    Header Description
    Chromosome Chromosome
    Start Start position
    End End position
    Predicted copy number The number of copies
    Type of alteration Types of CNV (gain, loss)
    Gene Gene annotation in the CNV regions

    - SV File

    SV

      Example :

    Header Description
    #CHROM Chromosome
    POS Position (with the 1st base having position 1)
    ID Annotation, in the case of BND ('breakend') records for translocations, the ID value is used to link breakend mates or partners.
    REF
    ALT
    All variants are reported in the VCF using symbolic alleles unless they are classified as a small indel, in which case full sequences are provided for the VCF REF and ALT allele fields. A variant is classified as a small indel if all of these criteria are met:
    -The variant can be entirely expressed as a combination of inserted and deleted sequence.
    -The deletion or insertion length is not 1000 or greater.
    -The variant breakends and/or the inserted sequence are not imprecise.
    QUAL A Phred-scaled quality score assigned by the variant caller. Higher scores indicate higher confidence in the variant (and lower probability of errors).
    FILTER See FILTER tag table for possible entries.
    INFO See INFO tag table for possible entries.
    FORMAT See FORMAT tag table for possible entries.
    Tag Description
    Ploidy For DEL & DUP variants, the genotypes of overlapping variants (with similar size) are inconsistent with diploid expectation
    MaxDepth Depth is greater than 3x the median chromosome depth near one or both variant breakends
    MaxMQ0Frac For a small variant (<1000 bases), the fraction of reads in all samples with MAPQ0 around either breakend exceeds 0.4
    NoPairSupport For variants significantly larger than the paired read fragment size, no paired reads support the alternate allele in any sample.
    MinQUAL QUAL score is less than 20
    MinGQ GQ score is less than 15 (filter applied at sample level and record level if all samples are filtered)
    MinSomaticScore SOMATICSCORE is less than 30
    SampleFT No sample passes all the sample-level filters
    HomRef Homozygous reference call
    Tag Description
    IMPRECISE Imprecise structural variation
    SVTYPE Type of structural variant
    SVLEN Difference in length between REF and ALT alleles
    END End position of the variant described in this record
    CIPOS Confidence interval around POS
    CIEND Confidence interval around END
    CIGAR CIGAR alignment for each alternate indel allele
    MATEID ID of mate breakend
    EVENT ID of event associated to breakend
    HOMLEN Length of base pair identical homology at event breakpoints
    HOMSEQ Sequence of base pair identical homology at event breakpoints
    SVINSLEN Length of insertion
    SVINSSEQ Sequence of insertion
    LEFT_SVINSSEQ Known left side of insertion for an insertion of unknown length
    RIGHT_SVINSSEQ Known right side of insertion for an insertion of unknown length
    INV3 Inversion breakends open 3' of reported location
    INV5 Inversion breakends open 5' of reported location
    BND_DEPTH Read depth at local translocation breakend
    MATE_BND_DEPTH Read depth at remote translocation mate breakend
    JUNCTION_QUAL If the SV junction is part of an EVENT (ie. a multi-adjacency variant), this field provides the QUAL value for the adjacency in question only
    SOMATIC Flag indicating a somatic variant
    SOMATICSCORE Somatic variant quality score
    JUNCTION_SOMATICSCORE If the SV junction is part of an EVENT (ie. a multi-adjacency variant), this field provides the SOMATICSCORE value for the adjacency in question only
    CONTIG Assembled contig sequence, if the variant is not imprecise (with --outputContig)
    Tag Description
    GT Genotype
    FT Sample filter, 'PASS' indicates that all filters have passed for this sample
    GQ Genotype Quality
    PL Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification
    PR Spanning paired-read support for the ref and alt alleles in the order listed
    SR Split reads for the ref and alt alleles in the order listed, for reads where P(allele|read)>0.999