Department of Bioinformatics and Computational Biology

Public Datasets

From MD Anderson Bioinformatics
Jump to: navigation, search

Publicly Available Data

This site has been built as a repository for selected datasets collected and analyzed by investigators at MD Anderson. We have tried to provide a reasonable amount of explanation. Certain tools used to analyze these data are also posted under Software and Services.

Please note that supplementary data sets to published papers are found in the Publications & Supplements section.

Contents


Standardized TCGA Data

We provide standardized and versioned snapshots of the TCGA Level 3 data in the "Open Access HTTP Directories" found at the TCGA data site. The TCGA data has been put through a "standardization" process and converted to a standard format consisting of a matrix with samples as columns and "gene equivalents" (such as, gene symbols, probe ids, and miRNA ids) as row labels.

Proteomics

For analyzing proteomic data, we currently use


An Example Analysis Using Cromwell, Coombes et al 2005

To show how to use Cromwell, one of our current analysis packages, we've created an example using serum quality control (QC) data derived from the Pusztai et al 2004 dataset. The Cromwell package is decribed in Coombes et al, Proteomics 2005; to appear. An earlier version of this paper is available as a Technical Report (UTMDABTR-001-04).

This example is not completely self-contained. It requires MATLAB and the Rice Wavelet Toolbox. The Cromwell package is decribed in Coombes et al, Proteomics 2005.

At this point, the MATLAB scripts we have provided work on text files containing only the sample data from the XML files. We have provided a simple Perl script, xml2txt.pl, to strip the required data from the XML files. Both the Perl script and the processing scripts assume a fairly specific directory structure, so hardcoded names must be changed for adaptations.

Files available for download


Quality Control Study for Proteomics of Nipple Aspirate Fluid

The development of the Cromwell package in Coombes et al, (Proteomics 2005; to appear. See also the preliminary Technical Report (UTMDABTR-001-04)) used a set of 24 SELDI spectra that were collected from a pooled (quality control) sample of nipple aspirate fluid from breast cancer patients and healthy controls.

In the paper by Coombes et al, (Proteomics 2005; to appear; see also the preliminary Technical Report (UTMDABTR-001-04)), we described the Cromwell package that we use to process mass spectrometry data. In that paper, we illustrate the method using a set of 24 quality control spectra from a breast cancer study. These raw spectra and additional scripts used in the processing are available here:

More details on the breast cancer study from which these QC spectra were derived can be found in:

Pawlik TM, Fritsche H, Coombes KR, Xiao L, Krishnamurthy S, Hunt KK, Pusztai L, Chen JN, Clarke CH, Arun B, Hung MC, Kuerer HM.
Significant differences in nipple aspirate fluid protein expression between healthy women and those with breast cancer demonstrated by time-of-flight mass spectrometry.
Breast Cancer Res Treat. 2005 Jan;89(2):149-57. Abstract

Kuerer HM, Coombes KR, Chen JN, Xiao L, Clarke C, Fritsche H, Krishnamurthy S, Marcy S, Hung MC, Hunt KK.
Association between ductal fluid proteomic expression profiles and the presence of lymph node metastases in women with breast cancer.
Surgery. 2004 Nov;136(5):1061-9. Abstract

Coombes KR, Fritsche HA Jr, Clarke C, Chen JN, Baggerly KA, Morris JS, Xiao LC, Hung MC, Kuerer HM.
Quality control and peak finding for proteomics data collected from nipple aspirate fluid by surface-enhanced laser desorption and ionization.
Clin Chem. 2003 Oct;49(10):1615-23. Abstract


Simulated Proteomics Spectra for Method Development and Comparison, Morris et al.

In our paper on using the mean spectrum for peak finding and quantification, we simulated hundreds of proteomics data sets. We used the simulated data to compare the results of two different processing algorithms. The data sets are available here so other people can compare their algorithms to ours on a standard data set where the truth is known about what peaks are in each spectrum. Additional Proteomics Resources

In our paper on using the mean spectrum for peak finding and quantification, we simulated hundreds of proteomics data sets. We used the simulated data to compare the results of two different processing algorithms. The data sets are available here so other people can compare their algorithms to ours on a standard data set where the truth is known about what peaks are in each spectrum.

Each zip file contains 25 data sets. Each data set unzips into its own directory, Dataset_X, where X is a number from 1 to 100.

Each data set directory contains two subdirectories: RawSpectra and truePeaks.

The RawSpectra subdirectory contains 100 text files. Each text file represents a single spectrum with two columns of data, one for the mass and one for the intensity.

The truePeaks subdirectory also contains 100 text files, representing the list of true peaks in the data. The truth is also given in two columns, with the first column containing the mass and the second column containing the number of ions of that mass in the simulation.

Finally, the Dataset_X directory contains a file called true_peaks.txt. This is a comma-separated-values UNIX text file with 4 columns, containing a description of the virtual population from which the 100 virtual spectra were generated. There is one row for each peak, which is described by its mass, its prevalance (the probability that it appears in an individual spectrum), its mean log intensity, and the standard deviation of the log intensity.

Morris JS, Coombes KR, Koomen J, Baggerly KA, Kobayashi R.
Feature extraction and quantification for mass spectrometry in biomedical applications using the mean spectrum.
Bioinformatics. 2005; 21:1764-75.

Microarrays

Normalizer Array and Probe Sensitivity Index File

This zip file has a digital standard Affymetrix U133A v1 array, a dChip Probe Sensitivity Index file, and instructions for using dChip as a common normalizing method for Breast Cancer Samples.

The code for performing diagonal linear discriminant analysis on this data set is also available.


CEL files for 19 breast cancer cell lines

The goal of this study was to develop pharmacogenomic predictors in response to standard chemotherapy drugs in breast cancer cell lines and test their predictive value in patients who received treatment with the same drugs. Nineteen human breast cancer cell lines were tested for sensitivity to paclitaxel (T), 5-fluorouracil (F), doxorubicin (A) and cyclophosphamide (C) in vitro. Baseline gene expression data were obtained for each cell line with Affymetrix U133A gene chips, and multigene predictors of sensitivity were derived for each drug separately.

The zip file can be downloaded here.

Cornelia Liedtke, Jing Wang, Attila Tordai, William F. Symmans, Gabriel N. Hortobagyi, Ludwig Kiesel, Kenneth Hess, Keith A. Baggerly, Kevin R. Coombes and Lajos Pusztai.
Clinical evaluation of chemotherapy response predictors developed from breast cancer cell lines.
(http://www.springerlink.com/content/g10397585l7051p6/) BREAST CANCER RESEARCH AND TREATMENT, Volume 121, Number 2, 301-309.