Department of Bioinformatics and Computational Biology

Home > Public Datasets > Cromwell Data


For analyzing proteomic data, we currently use

An Example Analysis Using Cromwell, Coombes et al 2005

To show how to use Cromwell, one of our current analysis packages, we’ve created an example using serum quality control (QC) data derived from the Pusztai et al 2004 dataset. The Cromwell package is decribed in Coombes et al, Proteomics 2005 .

This example is not completely self-contained. It requires MATLAB and the Rice Wavelet Toolbox . The Cromwell package is decribed in Coombes et al, Proteomics 2005.

At this point, the MATLAB scripts we have provided work on text files containing only the sample data from the XML files. We have provided a simple Perl script,, to strip the required data from the XML files. Both the Perl script and the processing scripts assume a fairly specific directory structure, so hardcoded names must be changed for adaptations.

Files available for download:

Quality Control Study for Proteomics of Nipple Aspirate Fluid

The development of the Cromwell package in Coombes et al, Proteomics 2005 used a set of 24 SELDI spectra that were collected from a pooled (quality control) sample of nipple aspirate fluid from breast cancer patients and healthy controls.

In the paper by Coombes et al. , we described the Cromwell package that we use to process mass spectrometry data. In that paper, we illustrate the method using a set of 24 quality control spectra from a breast cancer study. These raw spectra and additional scripts used in the processing are available here:

More details on the breast cancer study from which these QC spectra were derived can be found in:

Significant differences in nipple aspirate fluid protein expression between healthy women and those with breast cancer demonstrated by time-of-flight mass spectrometry .
Pawlik TM, Fritsche H, Coombes KR, Xiao L, Krishnamurthy S, Hunt KK, Pusztai L, Chen JN, Clarke CH, Arun B, Hung MC, Kuerer HM.
Breast Cancer Res Treat. 2005 Jan;89(2):149-57.

Association between ductal fluid proteomic expression profiles and the presence of lymph node metastases in women with breast cancer .
Kuerer HM, Coombes KR, Chen JN, Xiao L, Clarke C, Fritsche H, Krishnamurthy S, Marcy S, Hung MC, Hunt KK.
Surgery. 2004 Nov;136(5):1061-9.

Quality control and peak finding for proteomics data collected from nipple aspirate fluid by surface-enhanced laser desorption and ionization .
Coombes KR, Fritsche HA Jr, Clarke C, Chen JN, Baggerly KA, Morris JS, Xiao LC, Hung MC, Kuerer HM.
Clin Chem. 2003 Oct;49(10):1615-23.

Simulated Proteomics Spectra for Method Development and Comparison, Morris et al.

In our paper on using the mean spectrum for peak finding and quantification, we simulated hundreds of proteomics data sets. We used the simulated data to compare the results of two different processing algorithms. The data sets are available here so other people can compare their algorithms to ours on a standard data set where the truth is known about what peaks are in each spectrum.

Additional Proteomics Resources

Each zip file contains 25 data sets. Each data set unzips into its own directory, Dataset_X, where X is a number from 1 to 100.

Each data set directory contains two subdirectories: “‘RawSpectra”’ and “‘truePeaks”’.

The “‘RawSpectra”’ subdirectory contains 100 text files. Each text file represents a single spectrum with two columns of data, one for the mass and one for the intensity.

The “‘truePeaks”’ subdirectory also contains 100 text files, representing the list of true peaks in the data. The truth is also given in two columns, with the first column containing the mass and the second column containing the number of ions of that mass in the simulation.

Finally, the Dataset_X directory contains a file called “‘true_peaks.txt”‘. This is a comma-separated-values UNIX text file with 4 columns, containing a description of the virtual population from which the 100 virtual spectra were generated. There is one row for each peak, which is described by its mass, its prevalance (the probability that it appears in an individual spectrum), its mean log intensity, and the standard deviation of the log intensity.

Feature extraction and quantification for mass spectrometry in biomedical applications using the mean spectrum
Morris JS, Coombes KR, Koomen J, Baggerly KA, Kobayashi R.
Bioinformatics. 2005; 21:1764-75.