Simulated Proteomics Spectra

In our paper on using the mean spectrum for peak finding and quantification, we simulated hundreds of proteomics data sets. We used the simulated data to compare the results of two different processing algorithms. The data sets are available here so other people can compare their algorithms to ours on a standard data set where the truth is known about what peaks are in each spectrum.

Each zip file contains 25 data sets. Each data set unzips into its own directory, Dataset_X, where X is a number from 1 to 100.

Each data set directory contains two subdirectories: RawSpectra and truePeaks.

The RawSpectra subdirectory contains 100 text files. Each text file represents a single spectrum with two columns of data, one for the mass and one for the intensity.

The truePeaks subdirectory also contains 100 text files, representing the list of true peaks in the data. The truth is also given in two columns, with the first column containing the mass and the second column containing the number of ions of that mass in the simulation.

Finally, the Dataset_X directory contains a file called true_peaks.txt. This is a comma-separated-values UNIX text file with 4 columns, containing a description of the virtual population from which the 100 virtual spectra were generated. There is one row for each peak, which is described by its mass, its prevalance (the probability that it appears in an individual spectrum), its mean log intensity, and the standard deviation of the log intensity.


Morris JS, Coombes KR, Koomen J, Baggerly KA, Kobayashi R. Feature extraction and quantification for mass spectrometry in biomedical applications using the mean spectrum. Bioinformatics. 2005; 21:1764-75.