Department of Bioinformatics and Computational Biology

MuSE

From MD Anderson Bioinformatics
Jump to: navigation, search

MuSE

MuSE
Overview
Description Somatic point mutation caller for tumor-normal paired samples in next-generation sequencing data.
Development Information
Language C/C++
Current Version 1.0rc
Platforms Platform independent
License GNU GPL Version 2
Status Active
References
Citations Fan, Y., Xi, L., Hughes, D. S. T., Zhang, J., Zhang, J., Futreal, P. A., Wheeler, D. A., and Wang, W. Accounting for inter-tumor heterogeneity using a sample-specific error model improves sensitivity and specificity in mutation calling for sequencing data. Genome Biology. 2016. 17:178. DOI: 10.1186/s13059-016-1029-6.
Help and Support
Contact Wenyi Wang
Discussion Project Forum



The detection of somatic point mutations is a key component of cancer genomic research, which has been rapidly developing since next-generation sequencing (NGS) technology revealed its potential for describing genetic alterations in cancer. We present MuSE, a novel approach to mutation calling based on the F81 Markov substitution model for molecular evolution [1], which models the evolution of the reference allele to the allelic composition of the matched tumor and normal tissue at each genomic locus. To improve overall accuracy, we further adopt a sample-specific error model to identify cutoffs, reflecting the variation in tumor heterogeneity among samples.

Contents


Download

Source File: https://github.com/danielfan/MuSE

Linux Binary File: MuSEv1.0rc_b MuSEv1.0rc_c

Installation

After downloading the source file, for Unix-like operating systems please type the following commands sequentially in the command line to generate the executable:

unzip MuSEv1.0rc.zip
cd MuSEv1.0rc
make

For Windows, please install Cygwin (http://www.cygwin.com) first, which provides functionality similar to a Linux distribution on Windows. The following procedures are the same as above.

Input Data

MuSE is comprised of two steps, which requires

  • (1) the indexed reference genome FASTA file,
  • (2) the binary sequence alignment/map formatted (BAM) sequence data from the pair of tumor and normal DNA samples, and
  • (3) the dbSNP variant call format (VCF) file that should be bgzip compressed, tabix indexed and based on the same reference genome as (1).

The first step, ‘MuSE call’, takes as input (1) and (2). The BAM files require aligning all the sequence reads against the reference genome using the Burrows-Wheeler alignment tool (BWA), with either the backtrack or the maximal exact matches (MEM) algorithm [2]. In addition, the BAM files need to be processed by following the Genome Analysis Toolkit (GATK) Best Practices [3-5] that include marking duplicates, realigning the paired tumor-normal BAMs jointly and recalibrating base quality scores.

To speed up ‘MuSE call’, we recommend splitting the WGS data into small blocks (<50Mb) by using the provided option either ‘-r’ or ‘-l’, and concatenating all the output files by the Linux command CAT.

The second step, ‘MuSE sump’, takes as input the output file from ‘MuSE call’ and (3). We provide two options for building the sample-specific error model. One is applicable to WES data (option ‘-E’), and the other to WGS data (option ‘-G’).

Example Commands

The following commands briefly illustrate how to use MuSE. As to the preparation of BAM files, please refer to the first part, PRE-PROCESSING, of the Genome Analysis Toolkit (GATK) Best Practices (http://www.broadinstitute.org/gatk/guide/best-practices).

./MuSE call –O Output.Prefix –f Reference.Genome Tumor.bam Matched.Normal.bam
./MuSE sump -I Output.Prefix.MuSE.txt -G –O Output.Prefix.vcf –D dbsnp.vcf.gz

Output

The final output of MuSE is a VCF file that lists the identified somatic variants.

Reference

[1] Felsenstein, J. Evolutionary trees from DNA sequences: a maximum likelihood approach. Journal Of Molecular Evolution 17, 368–376 (1981).
[2] Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics (Oxford, England) 25, 1754–1760 (2009).
[3] McKenna, A. et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Research 20, 1297–1303 (2010).
[4] DePristo, M. A. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nature Genetics 43, 491–498 (2011).
[5] Van der Auwera, G. A. et al. From FastQ data to high confidence variant calls: the Genome Analysis Toolkit best practices pipeline. Current Protocols in Bioinformatics 11, 11.10.1–11.10.33 (2013).