Analysis Tools

  BWA is a software package for mapping low-divergent sequences to a large reference genome, such as the human genome. It consists of three algorithms: BWA-backtrack, BWA-SW and BWA-MEM. The first algorithm is designed for Illumina sequence reads up to 100bp, while the rest two are for longer sequences ranging from 70bp to 1Mbp. BWA-MEM and BWA-SW share similar features such as long-read support and split alignment. However, BWA-MEM, the latest of all, is generally recommended for high-quality queries as it is faster and more accurate. BWA-MEM also has better performance than BWA-backtrack for 70-100bp Illumina reads.

  For all the algorithms, BWA first needs to construct the FM-index for the reference genome (the index command). Alignment algorithms are invoked with different sub-commands: aln /samse/sampe for BWA-backtrack, bwasw for BWA-SW and mem for the BWA-MEM algorithm. 06

  More information can be found here:

  Picard is a collection of Java-based command-line utilities that manipulate SAM files, and a Java API (SAM-JDK) for creating new programs that read and write SAM files. Both SAM text format and SAM binary (BAM) format are supported. Picard MarkDuplicates examines aligned records in the supplied SAM or BAM file to locate duplicate molecules. All records are then written to the output file with the duplicate records flagged.

  More information can be found here:

  The Genome Analysis Toolkit or GATK is a software package developed at the Broad Institute to analyze high-throughput sequencing data. The toolkit offers a wide variety of tools, with a primary focus on variant discovery and genotyping as well as strong emphasis on data quality assurance. Its robust architecture, powerful processing engine and high-performance computing features make it capable of taking on projects of any size.

  HaplotypeCaller calls SNPs and indels simultaneously via local re-assembly of haplotypes in an active region.

  More information can be found here:

  SnpEff is a variant annotation and effect prediction tool. It annotates and predicts the effects of variants on genes (such as amino acid changes).

  SnpEff can generate the following results :

    - Genes and transcripts affected by the variant

    - Location of the variants

    - How the variant affects the protein synthesis (e.g. generating a stop codon)

    - Comparison with other databases to find equal known variants

  More information can be found here:

◦ Tool Version

Software Version

◦ Tool Parameters

Software Parameter Value Remark
BWA-MEM -M Mark shorter split hits as secondary(for Picard compatibility)
Picard VALIDATION_STRINGENCY LENIENT Improve performance when validate of stringency
SO coordinate Sort order
AS true Assume Sorted
CREATE_INDEX true Create index files
GATK BaseRecalibrator Generate the first pass recalibration
HaplotypeCaller Call SNPs and indels simultaneously via local re-assembly of haplotypes in an active region
SelectVariants Selects variants from a VCF source
VariantFiltration Filters variant calls using a number of user-selectable, parameterizable criteria
CombineVariants Combines VCF records from different sources
-knownSites Database of known polymorphic sites

◦ Database Version

Software Version