Analysis Workflow



  [Exome/Target analysis flow chart]




   The Illumina platform generates raw images and base calling with an integrated primary analysis software called RTA. The base calling files which are expressed in binary are converted into FASTQ by Illumina package bcl2fastq v2.20.0. The demultiplexing option (--barcode-mismatches) is set as value : 0. ‘FastQC’ is then used to check the sequencing quality.
  [ LINK ]
  https://www.bioinformatics.babraham.ac.uk/projects/fastqc

   Paired-end sequences produced by HiSeq Instrument are firstly mapped to the human reference genome using the mapping program 'BWA'. (BWA-MEM is used out of the three algorithms provided by BWA) The mapping result file is generated in BAM format, without unordered sequences and alternate haplotypes.
  [ LINK ]
  http://hgdownload.cse.ucsc.edu/goldenPath
  http://bio-bwa.sourceforge.net/bwa.shtml

   PCR duplicates are removed using MarkDuplicates.jar from 'Picard-tools' package, which requires reads to be sorted. Reads with identical starting positions are considered as duplicates and reduced into a single read.
  [ LINK ]
  http://broadinstitute.github.io/picard/

   BAM files are then recalibrated with Base Quality Score Recalibration (BQSR). BQSR is a process which uses machine learning to model the sequencing errors empirically and adjust the quality scores accordingly.
  [ LINK ]
  http://www.broadinstitute.org/gatk

   Based on the BAM file previously generated, variant genotyping for each sample is performed with Haplotype Caller of GATK. In this stage SNP and short indels candidates are detected at nucleotide resolution.
  [ LINK ]
  http://www.broadinstitute.org/gatk

   We filter variants with VariantFiltration of GATK Tool. This tool is designed for hard-filtering variant calls based on certain criteria. Records are hard-filtered by changing the value in the FILTER field to something else other than PASS. Filtered records will be preserved in the output unless their removal is requested in the command line.
  [ LINK ]
  http://www.broadinstitute.org/gatk

   Filtered variants are annotated with another program called SnpEff and filtered with dbSNP and SNPs from the 1000 genome project. The format of the final product is in vcf. Then, in-house program and SnpEff are used to annotate with additional databases, including ESP6500, ClinVar, dbNSFP, ACMG information.
  [ LINK ]
  http://snpeff.sourceforge.net/SnpEff.html