Department of Bioinformatics and Computational Biology

Cromwell:Overview

From MD Anderson Bioinformatics
Jump to: navigation, search

Cromwell: The New Model Processor for Mass Spectrometry Data

Cromwell
Overview
Description Cromwell is a set of Matlab scripts that performs low-level processing of mass spectrometry proteomics data.
Development Information
Language Matlab
Current Version 1.0
Platforms Platforms supporting Matlab
Status Inactive
Last Updated 2004-03-26
References
Citations Coombes KR, Tsavachidis S, Morris JS, Baggerly KA, Hung MC, Kuerer HM. (2005).
Improved peak detection and quantification of mass spectrometry data acquired from surface-enhanced laser desorption and ionization by denoising spectra with the undecimated discrete wavelet transform.
Proteomics, 5(16), 4107-4117.
doi:10.1002/pmic.200401261
Help and Support
Contact Kevin R. Coombes


Cromwell is an implementation of our algorithms for low-level processing of mass spectrometry proteomics data. Cromwell represents our third attempt (at least) to find a method to process spectra that both works quickly and achieves reasonably good results. The name is thus derived from an historical pun: just as Oliver Cromwell became ruler of England by fashioning and leading the New Model Army, we intend to lead the way toward better proteomics data processing with our "new model processor".

Contents


Packages

The download package contains the following matlab scripts:

basecorr2.m
Wrapper function to perform wavelet denoising, baseline correction, and peak finding on a collection of spectra.
collateSpectra.m
This function collects an entire set of spectra into a single matrix.
massTrans.m
Transform an input vector or matrix X (thought of as one or more mass spectra collected at equal time points) to an output vector MX equally spaced on the mass scale.
monotoneMinimum.m
This function estimates the baseline of a smooth spectrum as a monotone minimum.
pad.m
This function pads an input vector representing a spectrum to get its length to equal the next larger multiple of 2^L by reflecting the end of the spectrum.
peakSet.m
This function coalesces (aligns) peaks found in multiple spectra to determine which ones represent the same molecular stuff.
signal2noise.m
This function computes the signal to noise ratio
trivialPeakFinder.m
This function Locates local maxima from the intensity levels of a wavelet smoothed and baseline corrected spectrum.
truncateName.m
This function takes as its input a string representing a file name. It strips off the extension (everything after the final '.') to create a a variable name.
waveletSmoothAndBaselineCorrect.m
This function computes a denoised spectrum using a wavelet transform, and a baseline corrected denoised spectrum using a monotone minimum from a raw spectrum.

Cromwell Thresholding

Cromwell uses the undecimated discrete wavelet transform to denoise spectra; a critical parameter is the threshold to use for denoising. We have developed graphical MATLAB tools to aid users in making this decision. This is available as a separate downloadable package.

Download

The actual code for Cromwell is a set of MATLAB files packaged in a zip file.

System Requirements

Cromwell is not self-contained. Since it is implemented as a set of MATLAB scripts, it requires access to a commercial software package available from The MathWorks.

Cromwell also uses the undecimated discrete wavelet transform to denoise spectra, as implemented in the Rice Wavelet Toolbox.

Installation

Please set the Matlab path to include the directory containing the scripts. You can find instructions on how to do so here.

Documentation

Documentation regarding the functions can be obtained using the help commmand. We have included the documentation comments here for your convenience.

The functions are arranged in alphabetical order. Each function corresponds to the eponymous Matlab .m script file in the download package.

basecorr2 function

function bc = basecorr2(raw, rawMass, M, L, thld)
%
% bc = basecorr2(raw, rawMass, M, L, thld)
% wrapper to perform wavelet denoising, baseline correction, and
% peak finding on a collection of spectra.
%
% inputs are:
%	raw		% matrix of intensities
%	rawMass	% vector of masses
%	M			% number of data points to ignore at beginning
%	L			% wavelet level
%	thld		% multiple of MAD at which to set the threshold
% output is:
%	bc			% matrix of baseline corrected intensities
%
% side effects:
% creates the collowing binary MATLAB (*.mat) files in the current directory
%	baselineCorrected	% the matrix bc
%	noises				% matrix of noise estimates
%	normalized			% matrix of normalized baselin corrected intensities
%	meanBase				% vector containing mean baseline corrected spectrum
%	meanNorm				% vector containing mean normalized spectrum
%	peakLocations		% matrix of peaks (uses trivialPeakFinder)
%	numpeaks				% vector of the number of peaks pe spectrum

collateSpectra function

function [spec, mass] = collateSpectra(dirName, rawOrCorrected, oneOrMany, skipLines)
% collateSpectra(DIRECTORY, RAWORCORRECTED, ONEORMANY, SKIPLINES)
%
% This function collects an entire set of spectra into a single matrix.
% The DIRECTORY argument pointws the function to the top-level direcctory
% containing the data. The argument RAWORCORRECTED determines whether we
% are collating the raw spectra or the baseline-corrected and normalized
% spectra. If RAWORCORRECTED takes the value "raw", then the data is found
% in the subdirectory "RawSpectra" and is described by the file "RawFiles.txt";
% the format of this file is defined in the online help for batchProcessRawFiles.
% If RAWORCORRECTED taskes on the value "corrected", then the data is found in
% the subdirectory "CorrectedSpectra" and is described by the file named
% "CorrectedFiles.txt"; this has the same format as the corresponding raw file.
% The argument ONEORMANY, which can only take on the values "one" or "many",
% tells us if the data files contiaing a single spectrum or multiple spectra.
% Finally, the SKIPLINES argument tells us how many header lines there are at
% the top of each data file.
%
% The function returns two items
%	SPEC = a matrix, with one row per spectrum and one column for each time point.
%	MASS = a vector with (some estimate) of the common masses.
%
% Note: this routine does not do any fine tuning of the alignment across spectra.
% It implicitly assumesd all the masses are the same, and that the time points in
% different spectra coincide with the same masses. Further, everything is truncated
% to the length of the shortest spectrum in the set.

massTrans function

function mx = massTrans(X, mass, resolution)
%
% MX = massTrans(X, MASS, RESOLUTION)
% Transform an input vector or matrix X (thought of as one or more mass
% spectra collected at equal time points) to an output vector MX equally
% spaced on the mass scale. This uses the calibration vector MASS to give
% the mapping from times to masses and the scalar RESOLUTION to determine
% the spacing between mass measurements. The main application is to allow
% the plotting of images or heat maps on the mass axis instead of the time
% axis. For example,
%	mx = massTrans(X, mass, 1);
%	imagesc(mx)


monotoneMinimim function

function base = monotoneMinimim(slider, saturate)
% baseline = monotoneMinimum(smoothSpectrum, saturationEnd)
%
% This function estimates the baseline of a smooth spectrum as a monotone
% minimum. It starts at the maximum level reached during the saturation
% period. Everything before this maximum is reached is removed from the
% spectrum. Everything after this is computed as a decreasing function
% that always takes on the minimum value seen so far. The baseline curve
% is returned.

pad function

function w = pad(spectrum, L)
%
% W = pad(spectrum, L)
%
% Pad an input vector representing a spectrum to get its length to equal
% the next larger multiple of 2^L by reflecting the end of the spectrum.

peakSet function

function [x, y] = peakSet(peaksByMass, minimumSN, secondSN, tickSeparation, massSeparation)
%
% [X, Y] = peakSet(peaksByMass, minSN, secondSN, massSeparation)
% Coalesces (aligns) peaks found in multiple spectra to determine which ones 
% represent the same molecular stuff.
%
% INPUTS:
%	peaksByMass is a matrix with one row for each peak and five columns
%		that represent
%			spectrum ID
%			peak location in clock ticks
%			peak location in mass units
%			signal-to-noise ratio of the peak
%			normalized baseline-corrected intensity of the peak
%	minimumSN = signal-to-noise ratio required for a single peak to count; usually 10
%	secondSN  = signal-to-noise needed to be included on second pass; usually 2
%	tickSeparation = minimum distance between resolvable peaks, in clock
%		ticks, usually 7
%	massSeparation = minimum distance between resolvable peaks as a relative
%		fraction of mass; usually 0.003.
%
% OUTPUTS:
%	X = peak class information, a matrix with one row per peak class and
%	thirteen columns, representing
%		unique class ID
%		min tick
%		max tick
%		min mass
%		max mass
%		number of unique spectra having this peak
%		number of peaks collected in this class
%		min S/N
%		max S/N
%		mean S/N
%		median tick
%		mean of normalized intensity
%		standard deviation of normalized intensity
%	Y = mapping from peaks to classes, a matrix with one row per peak and
%	five columns representing
%		peak class ID (as in X)
%		spectrum ID
%		tick location
%		mass location
%		s/N


signal2noise function

function sn = signal2noise(raw, base, smooth)
% Help not available, but function is simple

trivialPeakFinder function

function [Peak, leftMin, rightMin] = trivialPeakFinder(w)
% [PEAKS, LEFTMIN, RIGHTMIN] = mixedup(spectrum)
%
% Input is the intensity levels of a spectrum. We assume the spectrum
% has already been wavelet smoothed and baseline corrected. We identify
% peaks as local maxima.
%
% The return values are:
%	PEAKS: a list of peak locations, in ticks
%	LEFTMIN:	a list of the closest minimum to the left of each peak
%	RIGHTMIN: a list of the closest minimum to the right of each peak

truncateName function

function dataname = truncateName(filename)
% dataname = truncateName(filename)
%
% This function takes as its input a string representing a file name.
% It strips off the extension (everything after the final '.') to
% create a a variable name. In order to ensure that the variable name
% is valid in MATLAB, it replaces everything that is not a normal
% letter or digit with an underscore.

waveletSmoothAndBaselineCorrect function

function [baselineCorrected, smoothSpec] = waveletSmoothAndBaselineCorrect(spectrum, t, saturate, L)
% [B S] = waveletSmoothAndBaseleinCorrect(spectrum, threshold, saturation, L)
% computes a denoised spectrum using a wavelet transform, and a baseline corrected
% denoised spectrum using a monotone minimum from a raw spectrum. The wavelet 
% transform is performed using code supplied as part of the Rice Wavelet Toolbox
% (http://www-dsp.rice.edu/software/rwt.shtml). The wavelet transform will only
% work on input vectors whose length is a multiple of a power of 2. To get the
% lengths right, we reverse the end of the spectrum and pad to the next multiple,
% and then truncate the processed spectrum to the original length.
%
% Inputs:
%	spectrum = vector of intensities, assumed to be collected at consecutive time points
%	threshold = multiple of 0.67*MAD at which to threshold the wavelet coefficients
%		we typically use thresholds in the range of 4-10.
%	saturation = index at which saturation ends and baseline correction should start
%	L = power of 2 to use to compute wavelets (default value is L = 10, which says
%		to use blocks of size 1024)
% Outputs:
%	B = vector of denoised and baseline corrected intensities
%	S = vector of denoised intensities
%
% Example:
%	[B S] = waveletSmoothAndBaselineCorrect(spectrum, 6, 2500, 10);
%	noise = spectrum - S;
%	plot(spectrum)
%	hold on
%	plot(S, 'r')

License

Cromwell is copyrighted by the MD Anderson Cancer Center; the licensing details are straightforward, considering that Cromwell is free software from an academic institution.


This is the "cromwell" mass spec analysis package.
Released - <1 March 2004>

CONDITIONS FOR USE:
Copyright (c) 2003, 2004 UT MD Anderson Cancer Center. All rights reserved.

This software is distributed and licensed to you on a non-exclusive 
basis, free-of-charge. Redistribution and use in source and binary forms, 
with or without modification, are permitted provided that the following 
conditions are met:

1. Redistribution of source code must retain the above copyright notice, 
   this list of conditions and the following disclaimer.
2. Redistribution in binary form must reproduce the above copyright notice, 
   this list of conditions and the following disclaimer in the 
   documentation and/or other materials provided with the distribution.
3. All advertising materials mentioning features or use of this software 
   must display the following acknowledgment: This product includes 
   software developed by the University of Texas MD Anderson Cancer
   Center, Houston, Texas and its contributors.
4. Neither the name of the University nor the names of its contributors 
   may be used to endorse or promote products derived from this software 
   without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY UNIVERSITY OF TEXAS MD ANDERSON CANCER
CENTER, HOUSTON, TEXAS, AND CONTRIBUTORS AS IS AND ANY EXPRESS OR
IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL MD ANDERSON OR CONTRIBUTORS BE
LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR
BUSINESS INTERRUPTIONS) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY,
WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE
OR OTHERWISE), PRODUCT LIABILITY, OR OTHERWISE ARISING IN ANY WAY OUT
OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
SUCH DAMAGE.

Reference

The algorithms and performance of Cromwell are describe in our technical report
Coombes KR, Tsavachidis S, Morris JS, Baggerly KA, Hung MC, Kuerer HM.
Improved peak detection and quantification of mass spectrometry data acquired from surface-enhanced laser desorption and ionization by denoising spectra with the undecimated discrete wavelet transform. UTMDABTR-001-04.

Future Plans

We are continuing to develop Cromwell. One of the tools we are using is a mass spectrometry simulation engine (version 2.1) implemented in the S-Plus statistical software package from Insightful Corp. Details of the model underlying the simulation and preliminary applications are described in another technical report: Coombes KR, Koomen JM, Baggerly KA, Morris JS, Kobayashi R. Understanding the characteristics of mass spectrometry data through the use of simulation. UTMDABTR-002-04.

The simnulation engine has been used extensively in the publication by Morris JS, Coombes KR, Koomen J, Baggerly KA, Kobayashi R. Feature extraction and quantification for mass spectrometry in biomedical applications using the mean spectrum. Bioinformatics. 2005;21:1764-75. The data sets that were simulated for this publication are available from our web site.