|Description||The TCGA Batch Effects website analyzes TCGA data and provides quantitative and visual means for users to identify and quantify the amount of batch effect present in a given TCGA data set.|
|Language||D3 (JS), Dojo 1.7(JS), Java|
|News||Data for 21 different tumor types are now available|
|Help and Support|
A large, complex, multi-faceted, multi-institutional project such as The Cancer Genome Atlas (TCGA) collects tumor samples in different institutions and at different times. Because the samples are processed in batches rather than all at once, the data can be vulnerable to systematic noise such as batch effects (unwanted variation between batches) and trend effects (unwanted variation over time), which can lead to misleading analysis results.
This website is designed to help assess, diagnose and correct for any batch effects in TCGA data. It first allows the user to assess and quantify the presence of any batch effects via algorithms such as Hierarchical Clustering and Principal Component Analysis. The results from these algorithms are presented graphically as both simple and interactive diagrams. If significant batch effects is observed in the data, the user then has the option of downloading data that has been computationally corrected using methods such as Empirical Bayes (aka. ComBat), Median Polish and ANOVA.
The website is currently under development, so only a subset of TCGA level 3 data has been analyzed thus far. Our eventual goal is to completely and comprehensively annotate all TCGA data sets, and provide users with batch effects corrected data for all of them.
Before using TCGA data, please read TCGA guidelines for publication and moratoriums.
Below is a quick overview of the various steps involved in the TCGA Data Collection Process, each of which could potentially introduce systematic errors. Relevant terms are highlighted in bold and explanatory links have been provided wherever possible.
Dispersion Separability Criterion (DSC) is a new metric that has been designed to quantify the amount of batch effect in the data. It is based on a similar metric, the Scatter Separability Criterion. DSC is defined as:
Where Db is a measure of dispersion between batches (or other groupings of the data), and Dw is a measure of dispersion within batches. Therefore, DSC is a ratio of between batch dispersion vs. within batch dispersion. More precisely, Db is defined as:
and Dw is defined as:
Where Sb is the "between batch" scatter matrix, and Sw is the "within batch" scatter matrix defined in " (Dy et al, 2004). Dw can roughly be viewed as the average "distance" between samples within a batch and the batch mean, or centroid, whereas Db can roughly be viewed as the average distance between batch centroids and the global mean.
A higher DSC value means that there is a greater dispersion between batches than within batches, i.e. the samples within batches are more homogeneous to each other than the batches themselves are to each other. It is a continuous positive number and as such, there is no absolute threshold where one could binarize the results, so that values above the threshold indicate a batch effect and vice versa. However, as a rule of thumb, for DSC values significantly below 0.5, one can usually assume that batch effects aren't very strong. Values above 0.5 indicate that one needs to consider the possibility of batch effects existing in the data. Values above 1 usually indicate strong batch effects that may need to be corrected before the data can be used for analysis. However, one also needs to consider the DSC p-value before making such an assessment.
Batches can sometimes contain outliers that may skew the DSC value, since the value is based on taking averages. This may especially be a problem when the batch sizes are small. To counter that possibility, another metric, the DSC p-value, is provided to aid researchers in assessing the statistical significance of the DSC value. The p-value is derived empirically, using permutation tests. One thousand or more permutations are typically run on each data set, using different random permutations of the data each time. DSC values are computed for each permutation and at the end of all the runs, the proportion of values greater than the actual DSC value is computed to yield the p-value. The null hypothesis is that there are no batch effects and the data set is homogeneous in terms of batches. A p-value less than some significance threshold (usually 0.05) rejects the null hypothesis.
However, it should be noted that large enough sample sizes can yield significant p-values even though the overall difference between batches may be very small. This is typical of most p-value based hypothesis testing. Therefore, we recommend that both p-value as well as the DSC value itself be used to assess whether batch effects are are present in the data or not. Batch effects would be expected to be present if the p-value were significant (less than 0.05 for example) AND the DSC value was high (greater than 0.5 for instance). If either of those conditions was false, batch effects can be presumed to be less serious.
The website provides data that has been corrected for batch effects. Users will have the choice to assess the amount of batch effects in the original TCGA data, and if they feel that the batch effects are significant, they can download computationally corrected data. The corrected data is computed using the Empirical Bayes (aka.Combat), Median Polish and ANOVA. Batch effects assessment results after correction will be made available for each method to allow users to make an informed decision about which data sets they want to use for their analyses.
The Menu allows you to control viewing options and gain access to website documentation.
The Data Browser on the left provides various means to select data for viewing. The query form allows one to select data by standard TCGA data fields such as Disease Type, Center/Platform, Data Level and Data Set. The Algorithmic-specific scores allows one to zoom in on data sets that registered particularly high DSC scores. The Data Browser can be hidden to allow for more space to view the diagrams.
The Tabbed Viewing Area in the bottom right allows one to open multiple diagrams and tables at once.
The Console at the bottom currently does not do anything, but will show additional textual information in the future.
Parts (1), (2), (3), (4), (5) together forms the Query Form on the left. The Query Form consists of a number of select boxes that allows the user to navigate through the different diagrams by choosing properties related to the data set and assessment algorithm. The page refreshes when the user makes a change to any one of the select boxes, and the diagram image and the accordion menu on the right will be updated correspondingly.
You can identify data showing large batch effects quantitatively by using the Algorithm-Specific Scores pane in the Data Browser. Currently, only DSC scores are available for viewing.
To use this interface:
Picking the score type will launch the corresponding table on the right. You can sort the data by clicking on the header of the column you wish to sort by. Repeated clicks on the same column header will alternate the sort between ascending and descending order.
Hierarchical Clustering diagrams groups samples together based on similarity.
There are often multiple legends associated with each Hierarchical Clustering image, one associated with each row at the bottom of the diagram. You can select the legend you want using the combobox underneath the legend panel on the right. You can also hide the legend panel by clicking on the small circular arrow button within the Blue 'Legend' title.
You can view scatter plots of the first 4 components of PCA analysis on the interactive PCA diagrams.
There are two types of PCA diagrams you can pick using the query form:
All available batches are plotted against one another.
A single batch can be selected using the 'Select batch' combobox at the bottom, while all other batches are grouped together using a single color.
You can zoom in/out of the diagram by using the mouse scroll wheel. Double-clicking on the diagram will cause the diagram to zoom in as well.
You can pan the diagram by clicking and dragging across the canvas.
You can pick the PCA components you want to plot using the x-axis and y-axis comboboxes below.
To view details for each point, mouse over the points and wait for a second. The details of the points will appear in the Datapoint log in the lower right, and the group that the datapoint belows to will be highlighted in the Legend. Mousing over the centroid will highlight the entire group.
You can also optionally display a popup label for the point by using the 'Toggle Tooltips' button at the top.
You can open a diagram in a new window if you wish to compare 2 diagrams side by side by clicking on the 'View in New Window' button. This will launch the diagram in a new window:
Resizing the window will automatically change the diagram size to match.
You can bookmark a diagram to revisit later, or to share with other users, by clicking on the 'Bookmark' button. This will launch a dialog box with a link on it. To bookmark the diagram, right-click on the link and add it to your bookmarks (may vary across browsers). To share the diagram with other users, right-click on the link, copy the url and share it with others.
When you click on "Related Documents", you can see a list of files associated with the chosen diagram and data set:
To download the files, mark the checkboxes for the ones you are interested in, and click on the 'Download Selected Files' button. This will download a zip file to your computer containing the selected files. Alternatively, you can also view certain files on the website itself using the file's 'View' button. This will launch a separate tab displaying its contents.
A detailed description of each file type is given below:
The following files are available for download on all plot typen.
The following text files are available for viewing or download with all PCA plots.
We require recent browser technology, and we have tested on the following browsers:
There are known compatibility problems with Internet Explorer 8 and below, as well as Opera 11.62.
For Frequently Asked Questions, Bug Reports, and other concerns, please visit the forum at this link