Analysis Workflow

  Whole genome resequencing analysis process

  First, paired-end sequences generated by the HiSeq instrument are mapped to the human genome using Isaac aligner (iSAAC- where the reference sequence is the UCSC assembly hg38 (original GRCh38 from NCBI, Dec. 2013). The Isaac aligner identifies and selects the foremost mapping candidates using a 32-mer seed-based search. And low quality 3’ end and adapter sequences have been trimmed from the alignment. The Isaac aligner generates a binary alignment output file (.bam) which includes sorted and duplicate-marked data.

  Strelka (2.9.10) is performed to identify single-nucleotide variants (SNVs) and short insertions and deletions (Indels). Through read processing, low quality reads and PCR duplicates are filtered out. Strelka carries out read realignment to increase accuracy. The germline probability model is used for variant genotyping. The result file from the processes is a block-compressed genomic variant call format(gVCF) file which includes all information on these variations. The extract_variant, which is one of utilities from gvcftools, is able to generate Variant-only VCF by removing all non-variants blocks from the gVCF file as well as filtering out the low-quality and high-depth variants. Variants from Variant-only VCF file are annotated by another program called SnpEff (v4.3t). Then, in-house program and SnpEff is applied to annotate the VCF file with additional databases, including ESP6500, ClinVar and dbNSFP3.5 .

  To identify structural variants and large indels, the configuration step from Manta (1.5.0) recognizes the specific input data and options before the executing the procedure on a single node. The whole genome sequencing structural variant calling analysis performed with default options provided from Manta. Control-FREEC (11.5) is performed to identify copy number variant with 10,000 window size and no additional options. Control-FREEC also uses GC-content bias to normalize read counts and XY for sex. The classified CNV types are based on genome ploidy value 2; below 2 is loss and above 2 is gain.