BM-Map

A Powerful Bioinformatics Software Tool for Refining Next-Generation Sequencing (NGS) Read Mapping

Site Map

Home
News archive
Getting started
Manual
Download
Credits & Contacts

Latest Release

BM-Map 2.0.1

12/01/2011

For citation information, please go to Credits.

Related Tools

Publications

Ji, Y, Xu, Y, Zhang, Q, Tsui, K-W, Yuan, Y, Liang, S, Norris, C, Liang, H. BM-Map: Bayesian Mapping of Multireads for Next-Generation Sequencing Data. Biometrics.2011 Dec;67(4):1215-24. [PDF Full Text][Supplementary Info]

Authors

Yuan Yuan
Clift Norris
Yanxun Xu
Kam-Wah Tsui
Yuan Ji*
Han Liang*

Links

Manual: Table of Contents

What is BM-Map?
What isn't BM-Map?
Why BM-Map?
How can BM-Map improve the gene expression quantification of next-generation sequencing data?
Obtaining BM-Map executable
Preparation before running BM-Map

Input format
Run BM-Map GUI on non-Windows platforms

The parameters in "InputFile.txt"
- Basic parameters
- Advanced parameters
Example input file (test.sam)
Run BM-Map from command line
Output

What is BM-Map?

BM-Map is a powerful NGS genomic loci mapping refiner. It improves the mapping of the Multireads (reads mapped to more than one genomic location with similar fidelities), as a refnement step after the general read-alignment is completed. It is a multi-platform software tool that is built based on the Bayesian mapping of multireads (BM-Map) algorithm that computes a posterior probability of mapping each multiread to a genomic location.

What isn't BM-Map?

BM-Map is NOT a general mapping tool. Rather, it refines the results produced by those prevailing mapping tool or aligner (Bowtie, etc). Therefore, you will need to use the output of an aligner as the input of BM-Map. See the following figure for the current niche of BM-Map in the NGS pipeline.

Why BM-Map?

Currently, the industry standard practice is to discard the multireads in subsequent analyses such as gene expression quantification. This practice generates a large bias in estimating the expression levels of duplicated genes. As an initial attempt, Mortazavi et al. (2008) proposed a proportional alignment method in which unique reads are first mapped, and then multireads are aligned to equally similar loci in proportion to the number of corresponding mapped unique reads. The key idea of the proportional method is that the individual numbers of unique reads are used to infer the probabilities of mapping the multireads. While the proportional method provides a simple and valuable solution to the mapping of the multireads, it fails to account for the mismatch profiles between the unique reads and the genomic locations.

Unlike the proportional method which only considers the equally best aligned genomic locations, BM-Map evaluates genomic locations with unequal numbers of mismatches to a multiread. More importantly, BM-Map utilizes three sources of information when mapping the multireads: the sequencing error profiles, the likelihood of hidden nucleotide variations, and the expression levels of competing genomic locations. In contrast, the proportional method only uses the last source of information. The key idea of the BM-Map is to use the base-level error rates and the observed mismatch profiles from unique reads to estimate the error rate due to hidden nucleotide variations in a hierarchical model. In the end, the BM-Map method assigns multireads to competing genomic locations based on posterior probabilities.

Because of the extra information BM-Map incorporates during the calculation, it handles multireads allocation exceptionally well, especially for organisms with a relatively large polymophism rate.

How can BM-map improve the gene expression quantification of next-generation sequencing data?

Compared with other mapping method, gene expression based on BM-Map shows a better correlation with the experimental approach: qRT-PCR measurement (see the following figure). The results demonstrate the feasibility of BM-Map and highlight the importance of accurately allocating multireads when quantifying the expression of young human duplicates based on next-generation sequencing. This is an essential step for studying the expression and evolution of young duplicated genes in the human genome.

Obtaining BM-Map executable

BM-Map is NOT a mapping tool. Rather, it refines the results produced by those prevailing mapping tool or aligner (Bowtie, etc). Therefore, you will need to use the output of an aligner as the input of BM-Map. See the following figure for the current niche of BM-Map in the NGS pipeline.

Building BM-Map from source (For Linux/MacOS)

Open terminal, cd to the "BM_Map" directory. Type:
make
This will generate BM_Map executable in the same directory.
The above building process was tested on Ubuntu 10 with gcc 4.6.3, and MacOs X 10.6 with gcc 4.2.1.
Windows binary (For Windows 7/Vista 64-bit)

For windows users, the 64-bit Windows binary executable BM-Map.exe is already built and ready for use.
Mac binary (For MacOS X10.6)

For MacOS 10.6 (Snow Leopard) users, the binary executable BM-Map is already built and ready for use.
Linux binary (For Ubuntu 10.04)

For Ubuntu 10.04 users, the binary executable BM-Map is already built and ready for use.

Preparation before running BM-Map

Input format
BM-Map has strict requirements for its input file format. Currently, the supported file format is SAM for BM-Map 2.0.0. Many prevailing mapping tools can output alignments in SAM format. However, we strongly recommend Bowtie-produced SAM/Map files as the input for BM-Map, which have been extensively tested for current versions. Bowtie produces SAM and/or Map format as output by simply specifying options on its command line ( See Bowtie website for more details).
Run BM-Map GUI on non-Windows platforms
Since the BM-Map GUI is built using .NET framework under Windows system, for non-Winodws platforms, an open-source software called Mono has to be pre-installed so that the interface could be displayed properly. Mono is available for various platform and easy to install. Currently, the GUI is tested for MacOS 10.6 (with Mono 2.10.8 installed) and Ubuntu 10.04 (with mono-xsp installed). However, other distributions of Mac/Linux with proper Mono components supporting .NET 2.0 installed should also be able to run the GUI.

The parameters in "InputFile.txt"

The "InputFile.txt" provides a user-friendly and straightforward way to configure the parameters for BM-Map. Default values were provided, however, most paramters should be customized based on the users' need. Updating the values in "InputFile.txt" appropriately is an important step before running BM-Map.

Starting from 2.0.1, BM-Map also comes with a graphical user interface (GUI), the parameters in which are equivalent and have exactly the same order as thoes in "InputFile.txt".

Basic parameters
The parameters falling in this category are changed often. Actually, some of the parameters here are required to be updated each time before BM-Map is running, like MAP_FILENAME.
- MAP_FILENAME
  Specify the path and name of the SAM/Map input file which will be read by BM-Map.
  The default filename is test.sam.
- LOG_FILENAME
  Specify the name of the log (output) file, to which the program progress and summary will be written during execution.
  The default value is BM_Map.log.
- MAX_NUM_MISMATCHES_FOR_TOP_HITS
  A read is considered mappable if it has a top hit with no more than this number of mismatches. In another word, reads with mismatches greater than this number will be discarded immediately, therefore being excluded from the subsequent downstream calculations.
  The default value is 2.
- EXTRA_NUM_MISMATCHES_FOR_ADDITIONAL_HITS
  Given the top hit of a mappable read, other hits with no more than this number of extra mismatches and no more than three total mismatches are defined as additional competing genomic locations (Currently, we did not include any hits with more than three total mismatches because Bowtie only outputs hitsj with up to three mismatches.)
  The default value is 2.
- SEQ_FLANKING_VALUE
  The number of bases to 'expand' the MultiRead 'SequenceLength' to be used when determining the number of Unique Reads that 'overlay' the MultiRead. This provides more flexibility for calculating prior probability by considering the expression level in flanking regions of competing loci.
  The default value is 0.
- PROB_OUTPUT
  Users can choose one of the following three options (case sensitive):
  PROP: output the less accurate Proportional Probability (less time-consuming)
  BAYES: output the more accurate Bayesian Probability (10-fold more time-consuming than calculating Proportional Probability alone)
  BOTH: output both proportional probability and the Bayesian Probability (takes same amount of time as BAYES)
  The default value is BOTH.
- STRANDNESS_OPTION
  Whether strand (+/-) should be considered when mapping Unique Reads to Multireads. Whether to choose 'ON' or 'OFF' depends on the sequencing platform.
  The default value is OFF.
- NUM_THREADS
  The parameter determines threads of execution (how many jobs will run concurrently on your computer). Usually, this number is an interger between 1 and 6 inclusive. Caution: this MUST be fewer than the number of Processors or logical Cores in your computer. For example, the recommended setting is 3 for Intel i5 CPU and 6 for Intel i7 CPU.
  The default value is 3.
- POLYMORPHISM_RATE
  This parameter is an indicator of the extent of polymorphic events (SNP, etc.). BM-Map is expected to have a larger effect in the species with high polymorphism frequencies or in cross-reference situations where RNA-Seq reads from one species without an available genome sequence are mapped to the genome of a closely related species as a surrogate reference. This value should be carefully specified for different specices. For example, for humans, POLYMORPHISM_RATE= 0.001.
  The default value is 0.001.
Advanced parameters
- TOTAL_ITERATIONS
  This is the total number of iterations in Markov Chain Monte Carlo (MCMC) calculation. Caution: Increasing this number will result a drastic increase of the program execution time and memory consumption.
  The default value is 1000.
- BURNIN_ITERATIONS
  This is a colloquial term that describes the practice of throwing away some iterations at the beginning of the MCMC calculation.
  This value must be fewer than TOTAL_ITERATIONS. The default value is 200.
- NUM_PARTS
  Specify the number of parts to split large/long-running jobs into. Please note, each of the splitted sub-jobs will require the same amount of memory as the non-splitting case. Therefore, do NOT change the default value of this parameter unless you are running BM-Map on clustered computers. If you simply want to run a few jobs (<= 6) concurrently, change NUM_THREADS instead.
  The default value is 1.
- PART_TO_RUN
  This parameter is used in combination with NUM_PARTS, specifying which part out of the total parts the current program should run. This value must be greater than zero, and fewer than or equal to NUM_PARTS
  The default value is 1.

Example input file

The BM-Map package comes with an example input file (test.sam) that is ready to run. It is recommended to try the test file first to get a whole picture of the operation of the BM-Map software, such as what kind of output should be expected with different values of parameters and how long the program takes to run. Typically, the execution time is about 2 seconds for the test file with the default value settings.

Run BM-Map from command line

Prompt command line windows in Windows (or open terminal in Linux/Mac), cd to the directory where BM-Map executable and "inputfile.txt" are located.
On the command line (or terminal), for Mac/Linux version, type:
./BM_Map.exe InputFile.txt
or for Windows version, type:
.\BM_Map.exe .\Inputfile.txt
Then press "return".

Output

The output format of BM-Map 2.0.1 is a SAM-like format, named sam+. It retains all the required fields and information in the original input SAM file, and the program appends the newly-calculated probabilities to the end of each line. Note that if the read is unmappable or not qualified to pass the standards defined by the program parameters, 'NA' is given to the probability value(s). The probabilities for the unique reads, because of the certainty of their mapping, are always ONE.

In addition to the sam+ file, BM-Map also produces a log file (the default name is "BM_Map.log"), which documents the output in the command line windows so that user can check some useful information later.