Home > Public Software > GeneClust > R User Guide

GeneClust R User’s Guide

Description

Gene Shaving is a method for clustering groups of similarly behaving genes whose changes in expression are most tightly linked to observed biological changes. The basic method is similar to observed principal components (singular value decomposition, maximum eigenvalue, etc.) with a sequential twist: a canonical “gene vector” is identified based on the eigenvectors, and the genes are ranked according to their agreement with this vector. The worst fitting are then “shaved off” and a new canonical vector is identified and fit.

The GeneClust distribution is a denovo implementation of the Gene Shaving method. GeneClust consists of three components:

A Java frontend that accepts and validates user input.
An R or S-plus backend for performing the statistical method and for generating graphical output.
A pseudo-terminal application qua agent that links the Java frontend to the backend, accepting R commands from the Java frontend and returning status output from the backend process.

User Interface

Before invoking the application, the user must create the data and output directories, and place the data file(s) to be analyzed in the data directory. (These directory names may be overridden by environment variables: see Environment Variables below.)

The data must be stored in a tab-separated (tsv) file. It may optionally contain an initial header line containing column names. The first column (of all rows) may optionally contain the row names.

The application GUI consists of four panels:

the data input panel
the shaving parameters panel
the display selection panel, and
the command panel.

The information input by the user is checked for validity. Invalid input will cause the offending field to be highlighted (in black). Tool tips for numeric range information are provided.

Selecting the data to process

To process data stored in a tab-separated (tsv) file, select the File Data tab in the data input panel and complete the FileName field in that panel. Pressing the Select… button will popup a file selection dialog to simplify choosing the correct filename.

GeneClust in File Data Mode
GeneClust can also generate random data for demonstation or testing purposes. To generate synthetic data, select the Demo tab in the data input panel and specify the model parameters.

Creating Gene Shaving Clusters

Specify the desired Gene Shaving parameters using the fields in the Shaving Parameters panel. If Percent Supervision is zero, unsupervised shaving will be performed. Otherwise, the Filename field in this panel must contain a valid classification file for the data being analyzed. If Percent Supervision is 100, complete supervision will be performed, otherwise partial supervision.

After the appropriate parameters have been set, press the Shave button to import (or create) the data to be analyzed and obtain the specified number of clusters.

The frontend will create a backend R process, sent it commands to perform the requested shaving, and display a process monitor that will display informative messages about the progress of the analysis. When the process monitor displays S interpreter processing complete, press the End Simulation button to destroy the backend process. If the backend process does not complete successfully, press the End Simulation button at anytime to interrupt the analysis and destroy the backend process.

Displaying and/or Printing Gene Shaving Clusters

After the backend process completes successfully, the geneshaving results can be displayed or printed. Optionally, the original data matrix and an hierarchical clustering of that matrix (by genes or samples) may also be displayed or printed.

Note: In this version, these optional displays cannot be generated until the backend process has completed successfully.

To display the desired graphs, check the Geneshaving Clusters checkbox and any additional checkboxes in the Display Selection panel, then press the Display button. The frontend will invoke a backend R process to generate the desired displays. After viewing the displays, press the End Simulation button to destroy the graphs and the backend process. The graphs may be displayed as many times as desired by repeatedly pressing the Display button.

To print the desired graphs, check the appropriate checkboxes and press the Print button. The graphs will be output as encapsulated postscript figures to files with .eps extensions located in the output directory.

Menus

File Pulldown Menu

The File pulldown menu contains all the generic file handling options.

Open…: Reads settings from a configuration file.
Save…: Write current settings to a configuration file.
Quit: Quits the application.

Help Pulldown Menu

The Help pulldown menu contains all the options providing basic assistance in using the application.

Overview: Provides a high level description of the application’s purpose.
User Guide: Displays this document via web browser.
About: Provides information about the application itself.

Input/Output

By default, GeneClust expects its input to reside in the data subdirectory and will store all results in the output subdirectory.

Environment variables:
- GCDATA: Alternative directory where data files exist. Defaults to ./data directory if unset.
- GCOUTPUT: Alternative directory where simulation output files should be stored. Defaults to ./output directory if unset.
Files:
- $HOME/.geneclustrc: User configuration file used to override system environment variables.
- ./.geneclustrc: Configuration file used to override system and user environment variables.
- <settings>.cf: Configuration file used to store simulation settings.
- <settings>.clf: Classification file used to perform supervised shaving.
- <datafile>.tsv: ASCII file that contains tab-separated values.

Department of Bioinformatics and Computational Biology