In our paper on using the mean spectrum for peak finding and quantification, we simulated hundreds of proteomics data sets. We used the simulated data to compare the results of two different processing algorithms. The data sets are available here so other people can compare their algorithms to ours on a standard data set where the truth is known about what peaks are in each spectrum.
Each zip file contains 25 data sets. Each data set unzips into its
own directory, Dataset_X
, where X
is a
number from 1 to 100.
Each data set directory contains two subdirectories:
RawSpectra
and truePeaks
.
The RawSpectra
subdirectory contains 100 text files. Each
text file represents a single spectrum with two columns of data, one
for the mass and one for the intensity.
The truePeaks
subdirectory also contains 100 text
files, representing the list of true peaks in the data. The truth is
also given in two columns, with the first column containing the mass
and the second column containing the number of ions of that mass in
the simulation.
Finally, the Dataset_X
directory contains a file
called true_peaks.txt
. This is a comma-separated-values
UNIX text file with 4 columns, containing a description of the virtual
population from which the 100 virtual spectra were generated. There is
one row for each peak, which is described by its mass, its prevalance
(the probability that it appears in an individual spectrum), its mean
log intensity, and the standard deviation of the log intensity.