From MD Anderson Bioinformatics

Cromwell | ||

Overview | ||
---|---|---|

Description | Cromwell is a set of Matlab scripts that performs low-level processing of mass spectrometry proteomics data. | |

Development Information | ||

Language | Matlab | |

Current Version | 1.0 | |

Platforms | Platforms supporting Matlab | |

Status | Inactive | |

Last Updated | 2004-03-26 | |

References | ||

Citations | Coombes KR, Tsavachidis S, Morris JS, Baggerly KA, Hung MC, Kuerer HM. (2005).Improved peak detection and quantification of mass spectrometry data acquired from surface-enhanced laser desorption and ionization by denoising spectra with the undecimated discrete wavelet transform.Proteomics, 5(16), 4107-4117.doi:10.1002/pmic.200401261 | |

Help and Support | ||

Contact | Kevin R. Coombes |

Cromwell is an implementation of our algorithms for low-level processing of mass spectrometry proteomics data. Cromwell represents our third attempt (at least) to find a method to process spectra that both works quickly and achieves reasonably good results. The name is thus derived from an historical pun: just as Oliver Cromwell became ruler of England by fashioning and leading the New Model Army, we intend to lead the way toward better proteomics data processing with our "new model processor".

## Contents |

The download package contains the following matlab scripts:

- basecorr2.m
- Wrapper function to perform wavelet denoising, baseline correction, and peak finding on a collection of spectra.
- collateSpectra.m
- This function collects an entire set of spectra into a single matrix.
- massTrans.m
- Transform an input vector or matrix X (thought of as one or more mass spectra collected at equal time points) to an output vector MX equally spaced on the mass scale.
- monotoneMinimum.m
- This function estimates the baseline of a smooth spectrum as a monotone minimum.
- pad.m
- This function pads an input vector representing a spectrum to get its length to equal the next larger multiple of 2^L by reflecting the end of the spectrum.
- peakSet.m
- This function coalesces (aligns) peaks found in multiple spectra to determine which ones represent the same molecular stuff.
- signal2noise.m
- This function computes the signal to noise ratio
- trivialPeakFinder.m
- This function Locates local maxima from the intensity levels of a wavelet smoothed and baseline corrected spectrum.
- truncateName.m
- This function takes as its input a string representing a file name. It strips off the extension (everything after the final '.') to create a a variable name.
- waveletSmoothAndBaselineCorrect.m
- This function computes a denoised spectrum using a wavelet transform, and a baseline corrected denoised spectrum using a monotone minimum from a raw spectrum.

Cromwell uses the undecimated discrete wavelet transform to denoise spectra; a critical parameter is the threshold to use for denoising. We have developed graphical MATLAB tools to aid users in making this decision. This is available as a separate downloadable package.

The actual code for Cromwell is a set of MATLAB files packaged in a zip file.

Cromwell is not self-contained. Since it is implemented as a set of MATLAB scripts, it requires access to a commercial software package available from The MathWorks.

Cromwell also uses the undecimated discrete wavelet transform to denoise spectra, as implemented in the Rice Wavelet Toolbox.

Please set the Matlab path to include the directory containing the scripts. You can find instructions on how to do so here.

Documentation regarding the functions can be obtained using the help commmand. We have included the documentation comments here for your convenience.

The functions are arranged in alphabetical order. Each function corresponds to the eponymous Matlab .m script file in the download package.

function bc = basecorr2(raw, rawMass, M, L, thld) % % bc = basecorr2(raw, rawMass, M, L, thld) % wrapper to perform wavelet denoising, baseline correction, and % peak finding on a collection of spectra. % % inputs are: % raw % matrix of intensities % rawMass % vector of masses % M % number of data points to ignore at beginning % L % wavelet level % thld % multiple of MAD at which to set the threshold % output is: % bc % matrix of baseline corrected intensities % % side effects: % creates the collowing binary MATLAB (*.mat) files in the current directory % baselineCorrected % the matrix bc % noises % matrix of noise estimates % normalized % matrix of normalized baselin corrected intensities % meanBase % vector containing mean baseline corrected spectrum % meanNorm % vector containing mean normalized spectrum % peakLocations % matrix of peaks (uses trivialPeakFinder) % numpeaks % vector of the number of peaks pe spectrum

function [spec, mass] = collateSpectra(dirName, rawOrCorrected, oneOrMany, skipLines) % collateSpectra(DIRECTORY, RAWORCORRECTED, ONEORMANY, SKIPLINES) % % This function collects an entire set of spectra into a single matrix. % The DIRECTORY argument pointws the function to the top-level direcctory % containing the data. The argument RAWORCORRECTED determines whether we % are collating the raw spectra or the baseline-corrected and normalized % spectra. If RAWORCORRECTED takes the value "raw", then the data is found % in the subdirectory "RawSpectra" and is described by the file "RawFiles.txt"; % the format of this file is defined in the online help for batchProcessRawFiles. % If RAWORCORRECTED taskes on the value "corrected", then the data is found in % the subdirectory "CorrectedSpectra" and is described by the file named % "CorrectedFiles.txt"; this has the same format as the corresponding raw file. % The argument ONEORMANY, which can only take on the values "one" or "many", % tells us if the data files contiaing a single spectrum or multiple spectra. % Finally, the SKIPLINES argument tells us how many header lines there are at % the top of each data file. % % The function returns two items % SPEC = a matrix, with one row per spectrum and one column for each time point. % MASS = a vector with (some estimate) of the common masses. % % Note: this routine does not do any fine tuning of the alignment across spectra. % It implicitly assumesd all the masses are the same, and that the time points in % different spectra coincide with the same masses. Further, everything is truncated % to the length of the shortest spectrum in the set.

function mx = massTrans(X, mass, resolution) % % MX = massTrans(X, MASS, RESOLUTION) % Transform an input vector or matrix X (thought of as one or more mass % spectra collected at equal time points) to an output vector MX equally % spaced on the mass scale. This uses the calibration vector MASS to give % the mapping from times to masses and the scalar RESOLUTION to determine % the spacing between mass measurements. The main application is to allow % the plotting of images or heat maps on the mass axis instead of the time % axis. For example, % mx = massTrans(X, mass, 1); % imagesc(mx)

function base = monotoneMinimim(slider, saturate) % baseline = monotoneMinimum(smoothSpectrum, saturationEnd) % % This function estimates the baseline of a smooth spectrum as a monotone % minimum. It starts at the maximum level reached during the saturation % period. Everything before this maximum is reached is removed from the % spectrum. Everything after this is computed as a decreasing function % that always takes on the minimum value seen so far. The baseline curve % is returned.

function w = pad(spectrum, L) % % W = pad(spectrum, L) % % Pad an input vector representing a spectrum to get its length to equal % the next larger multiple of 2^L by reflecting the end of the spectrum.

function [x, y] = peakSet(peaksByMass, minimumSN, secondSN, tickSeparation, massSeparation) % % [X, Y] = peakSet(peaksByMass, minSN, secondSN, massSeparation) % Coalesces (aligns) peaks found in multiple spectra to determine which ones % represent the same molecular stuff. % % INPUTS: % peaksByMass is a matrix with one row for each peak and five columns % that represent % spectrum ID % peak location in clock ticks % peak location in mass units % signal-to-noise ratio of the peak % normalized baseline-corrected intensity of the peak % minimumSN = signal-to-noise ratio required for a single peak to count; usually 10 % secondSN = signal-to-noise needed to be included on second pass; usually 2 % tickSeparation = minimum distance between resolvable peaks, in clock % ticks, usually 7 % massSeparation = minimum distance between resolvable peaks as a relative % fraction of mass; usually 0.003. % % OUTPUTS: % X = peak class information, a matrix with one row per peak class and % thirteen columns, representing % unique class ID % min tick % max tick % min mass % max mass % number of unique spectra having this peak % number of peaks collected in this class % min S/N % max S/N % mean S/N % median tick % mean of normalized intensity % standard deviation of normalized intensity % Y = mapping from peaks to classes, a matrix with one row per peak and % five columns representing % peak class ID (as in X) % spectrum ID % tick location % mass location % s/N

function sn = signal2noise(raw, base, smooth) % Help not available, but function is simple

function [Peak, leftMin, rightMin] = trivialPeakFinder(w) % [PEAKS, LEFTMIN, RIGHTMIN] = mixedup(spectrum) % % Input is the intensity levels of a spectrum. We assume the spectrum % has already been wavelet smoothed and baseline corrected. We identify % peaks as local maxima. % % The return values are: % PEAKS: a list of peak locations, in ticks % LEFTMIN: a list of the closest minimum to the left of each peak % RIGHTMIN: a list of the closest minimum to the right of each peak

function dataname = truncateName(filename) % dataname = truncateName(filename) % % This function takes as its input a string representing a file name. % It strips off the extension (everything after the final '.') to % create a a variable name. In order to ensure that the variable name % is valid in MATLAB, it replaces everything that is not a normal % letter or digit with an underscore.

function [baselineCorrected, smoothSpec] = waveletSmoothAndBaselineCorrect(spectrum, t, saturate, L) % [B S] = waveletSmoothAndBaseleinCorrect(spectrum, threshold, saturation, L) % computes a denoised spectrum using a wavelet transform, and a baseline corrected % denoised spectrum using a monotone minimum from a raw spectrum. The wavelet % transform is performed using code supplied as part of the Rice Wavelet Toolbox % (http://www-dsp.rice.edu/software/rwt.shtml). The wavelet transform will only % work on input vectors whose length is a multiple of a power of 2. To get the % lengths right, we reverse the end of the spectrum and pad to the next multiple, % and then truncate the processed spectrum to the original length. % % Inputs: % spectrum = vector of intensities, assumed to be collected at consecutive time points % threshold = multiple of 0.67*MAD at which to threshold the wavelet coefficients % we typically use thresholds in the range of 4-10. % saturation = index at which saturation ends and baseline correction should start % L = power of 2 to use to compute wavelets (default value is L = 10, which says % to use blocks of size 1024) % Outputs: % B = vector of denoised and baseline corrected intensities % S = vector of denoised intensities % % Example: % [B S] = waveletSmoothAndBaselineCorrect(spectrum, 6, 2500, 10); % noise = spectrum - S; % plot(spectrum) % hold on % plot(S, 'r')

Cromwell is copyrighted by the MD Anderson Cancer Center; the licensing details are straightforward, considering that Cromwell is free software from an academic institution.

This is the "cromwell" mass spec analysis package. Released - <1 March 2004> CONDITIONS FOR USE: Copyright (c) 2003, 2004 UT MD Anderson Cancer Center. All rights reserved. This software is distributed and licensed to you on a non-exclusive basis, free-of-charge. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistribution of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistribution in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. 3. All advertising materials mentioning features or use of this software must display the following acknowledgment: This product includes software developed by the University of Texas MD Anderson Cancer Center, Houston, Texas and its contributors. 4. Neither the name of the University nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY UNIVERSITY OF TEXAS MD ANDERSON CANCER CENTER, HOUSTON, TEXAS, AND CONTRIBUTORS AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL MD ANDERSON OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTIONS) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE), PRODUCT LIABILITY, OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

- The algorithms and performance of Cromwell are describe in our technical report

**Coombes KR, Tsavachidis S, Morris JS, Baggerly KA, Hung MC, Kuerer HM**.

*Improved peak detection and quantification of mass spectrometry data acquired from surface-enhanced laser desorption and ionization by denoising spectra with the undecimated discrete wavelet transform*. UTMDABTR-001-04.

We are continuing to develop Cromwell. One of the tools we are using is a mass spectrometry simulation engine (version 2.1) implemented in the S-Plus statistical software package from Insightful Corp. Details of the model underlying the simulation and preliminary applications are described in another technical report: Coombes KR, Koomen JM, Baggerly KA, Morris JS, Kobayashi R. Understanding the characteristics of mass spectrometry data through the use of simulation. UTMDABTR-002-04.

The simnulation engine has been used extensively in the publication by Morris JS, Coombes KR, Koomen J, Baggerly KA, Kobayashi R. Feature extraction and quantification for mass spectrometry in biomedical applications using the mean spectrum. Bioinformatics. 2005;21:1764-75. The data sets that were simulated for this publication are available from our web site.