# Simulated Proteomics Spectra

In our paper on using
the mean spectrum for peak finding and quantification, we
simulated hundreds of proteomics data sets. We used the simulated data
to compare the results of two different processing algorithms. The
data sets are available here so other people can compare their
algorithms to ours on a standard data set where the truth is known
about what peaks are in each spectrum.

Each zip file contains 25 data sets. Each data set unzips into its
own directory, `Dataset_X`

, where `X`

is a
number from 1 to 100.

Each data set directory contains two subdirectories:
`RawSpectra`

and `truePeaks`

.

The `RawSpectra`

subdirectory contains 100 text files. Each
text file represents a single spectrum with two columns of data, one
for the mass and one for the intensity.

The `truePeaks`

subdirectory also contains 100 text
files, representing the list of true peaks in the data. The truth is
also given in two columns, with the first column containing the mass
and the second column containing the number of ions of that mass in
the simulation.

Finally, the `Dataset_X`

directory contains a file
called `true_peaks.txt`

. This is a comma-separated-values
UNIX text file with 4 columns, containing a description of the virtual
population from which the 100 virtual spectra were generated. There is
one row for each peak, which is described by its mass, its prevalance
(the probability that it appears in an individual spectrum), its mean
log intensity, and the standard deviation of the log intensity.

## Reference

Morris JS, Coombes KR, Koomen J, Baggerly KA, Kobayashi R.
Feature extraction and quantification for mass spectrometry in
biomedical applications using the mean spectrum.
*Bioinformatics*. 2005; **21**:1764-75.