Public Datasets

This site is a repository for selected datasets that have been collected and analyzed by investigators at MD Anderson. We have tried to provide a reasonable amount of explanation. Certain tools used to analyze these data are also posted under Software.

Please note that supplementary data sets to published papers are found in the Supplements page.

Standardized TCGA Data

We provide standardized and versioned snapshots of the TCGA Level 3 data in the “Open Access HTTP Directories” found at the TCGA data site. The TCGA data has been put through a “standardization” process and converted to a standard format consisting of a matrix with samples as columns and “gene equivalents” (such as, gene symbols, probe ids, and miRNA ids) as row labels.

Older public datasets

The following data is obsolete. We provide it for historical reasons.

Data for the Cromwell proteomics package (from about 2005).
Data and code for analyzing breast cancer microarray data.
CEL files for 19 breast cancer cell lines.
Microarray Data.
Supplement to Wang J, Coombes KR, Highsmith WE, Keating MJ, Abruzzo LV . Differences in gene expression between B-cell chronic lymphocytic leukemia and normal B cells: a meta-analysis of three microarray studies. Bioinformatics. 2004; 20:3166-78.
MDA133: Clinical Data and dChip MBEI value Files
Supplement to: Hess, et. al, Pharmacogenomic Predictor of Sensitivity to Preoperative Chemotherapy With Paclitaxel and 5-Fluorouracil, Doxorubicin, and Cyclophosphamide in Breast Cancer, Journal of Clinical Oncology, 24 (26), 2006. The latest version of this file include “molecular class” informationon a subset of 82 cases.
MDA133: CEL files for Predictor Training and Validation Data Sets
Supplement to: Hess, et. al, Pharmacogenomic Predictor of Sensitivity to Preoperative Chemotherapy With Paclitaxel and 5-Fluorouracil, Doxorubicin, and Cyclophosphamide in Breast Cancer, Journal of Clinical Oncology, 24 (26), 2006.
39 samples with replicate gene expression data.zip Replicate RNA hybridizations
Supplementary data for: Anderson K, Hess KR, Kapoor M, Tirrell S, Courtemanche J, Wang B, Wu Y, Gong Y, Hortobagyi GN, Symmans WF, Pusztai L. Reproducibility of gene expression signature-basedpredictions in replicate experiments. Clin Cancer Res 2006;12:1721-7.
CEL files for MDACC-FNA-CBX-74
This zip file contains CEL files and sample matching information for: Bianchini, G., Qi, Y., Alvarez, R.H., Iwamoto, T., Coutant, C., Ibrahim, N.K., Valero, V., Cristofanilli, M., Green, M.C., Radvanyi, L., Hatzis, C., Hortobagyi, G.N., Andre, F., Gianni, L., Symmans, W.F. and Pusztai, L. Molecular Anatomy of Breast Cancer Stroma and Its Prognostic Value in Estrogen Receptor-Positive and -Negative Cancers, Journal of Clinical Oncology, Published online before print August 30, 2010.
Testing Response to Chemotherapy in Breast Cancer, Pusztai et al 2004
This dataset consists of 620 sample and QC SELDI spectra used in Pusztai et al,“Pharmacoproteomic Analysis of Prechemotherapy and Postchemotherapy Plasma Samples from Patients Receiving Neoadjuvant or Adjuvant Chemotherapy for Breast Carcinoma”, Cancer 2004; 100:1814-1822. Summary of Study: Proteomic changes in NAF plasma were taken before and after paclitaxel or FAC (5-fluorouracil, doxorubicin, and cyclophosphamide) chemotherapy in patients with Stage I - III breast carcinoma to measure response to the chemotherapy. Samples of healthy women were taken also, to help identify breast carcinoma-associated protein markers. Full Abstract
Simulated Proteomics Spectra for Method Development and Comparison,Morris et al.
In our paper on using the mean spectrum for peak finding and quantification , we simulated hundreds of proteomics data sets. We used the simulated data to compare the results of two different processing algorithms. The data sets are available here so other people can compare their algorithms to ours on a standard data set where the truth is known about what peaks are in each spectrum.

Department of Bioinformatics and Computational Biology

Public Datasets

Standardized TCGA Data

Older public datasets