Object Oriented Microarray Library: Designs

Version 4.0

The Object Oriented Microarray Library in S-PLUS is a suite of object-oriented programming modules written in S-PLUS for the analysis of microarray experiments. The entire library can be compiled by running the script microarrays.ssc, which sources the remaining sections of code, including all the built-in designs. There is a separate script load-all-designs.ssc listing the design source files.

A design is a description of the genes that have been spotted on a microarray. We store designs as a data frame, in which each row corresponds to a spot on the array. The number of columns varies depending on how much information we have accumulated from the manufacturer about the material that was used to produce the array.

New Methods For Designs

Because the manufacturers provide descriptive information about the genes under many different names, most design objects are given another class designation in addition to that of a data frame. This additional class structure allows us to implement methods that will provide a common interface to the descriptive information we need. At present, there are two such methods:

get.accession(object): This method returns a text string containing the GenBank accession number of the gene.
get.name(object): This method returns a text string containing a descriptive name for the gene.

CG4 Named Gene Array

The "CG4 Named Gene" microarray produced by the M.D. Anderson Cancer Genomics Core Laboratory contains 4800 spots, consisting of 2304 distinct genes spotted in duplicate, 48 positive controls, 48 negative controls, and 96 blanks.

cg4.gene.list: This object is a data frame with 4800 rows and three columns describing the CG4 array. The columns provide the Location of the spot (in the form A1a1), the GenBank Accession number, and a character string giving a Name for the gene.
cg4.short.gene.list: This object is another data frame, with 2304 rows and 4 columns. Each row corresponds to a pair of genes on the array; the two locations are stored as columns Location2Location2.

Both of these objects are of class cg4list.

CG8 Pathways Array

The "CG8 Pathways" microarray produced by the M.D. Anderson Cancer Genomics Core Laboratory contains 3702 spots, consisting of 1152 distinct genes spotted in duplicate, 96 positive controls, 96 negative controls, and 192 regularly spaced blanks. There are additional blank spots, usually near the ends of the subgrids.

cg8.gene.list: This object is a data frame with 3702 rows and four columns describing the CG8 array. The columns provide the Location of the spot (in the form A1a1), the GenBank Accession number, a character string containing the standard Symbol for the gene, and a character string giving a Name for the gene.
cg8.short.gene.list: This object is another data frame, with 1344 rows and 4 columns. Each row corresponds to a pair of genes on the array; the two locations are stored as columns Location2Location2, and the gene symbol is inexplicably omitted. Note that only 1152 of the 1344 actually correspond to genes; the others can probably be detected by applying is.na to the Accession column.

Both of these objects are of class cg8list.

Clontech ATLAS Human Cancer 1.2 Microarray

The Human Cancer 1.2 is a commercial nylon microarray produced by Clontech. It contains 1185 spots, nine of which are housekeeping genes spotted below the main rectangular grid of spots.

hcan: This object is a data frame with 1185 rows and 7 columns. The column names are the ones automatically generated by S-PLUS when you read in the manufacturer's file; they include the location, gene name and symbol, GenBank accession, and SwissProt accession numbers.

Research Genetics GeneFilter Microarrays

Research Genetics produces a series of nylon microarrays containing different sets of genes. The typical configuration consists of two fields of eight grids each, where each grid contains 12 columns and 30 rows of spots. Thus, there are typically 5760 spots on the array. Within each grid, there are 12 control spots of total genomic DNA and 12 blank spots. Research Genetics provides a great deal of information (about 20 columns, but much of it is out of date) about the genes on the arrays. In July 2001, we updated that information for the GF200-GF205 arrays, and so the annotations contained here are probably better than those obtained directly from the company.

gf200: This object is a data frame describing Release I (GF200) of the Human GeneFilters, containing 5760 spots as described above.
gf201: This object is a data frame describing Release II (GF201) of the Human GeneFilters, containing 5760 spots as described above.
gf202: This object is a data frame describing Release III (GF202) of the Human GeneFilters, containing 5760 spots as described above.
gf203: This object is a data frame describing Release IV (GF203) of the Human GeneFilters, containing 5760 spots as described above.
gf204: This object is a data frame describing Release V (GF204) of the Human GeneFilters, containing 5760 spots as described above.
gf205: This object is a data frame describing Release VI (GF205) of the Human GeneFilters, containing 5760 spots as described above.
gf211: This object is a data frame describing the "Named Gene" release of the Human GeneFilters. Although laid out in the same pattern as the other GeneFilters, one half of the grids are approximately half empty, by design.

These seven objects are all of class rglist.

The design information for the GeneFilters includes additional objects along with additional functions for processing them.

is.control: This logical vector of length 5760 describes the location of the control spots on the arrays.
is.hkg: This logical vector of length 5760 describes the location of the housekeeping genes on the arrays.
is.tg: This logical vector of length 5760 describes the location of the total genomic DNA control spots on the arrays.
is.blank: This logical vector of length 5760 describes the location of the blank spots on the arrays.
is.rg: This logical vector of length 5760 describes the location of additional control on the arrays. These are designated as "RG" spots; we are not sure why the company marks these spots in particular.
rg2arv: This vector maps the coordinate system preferred by Research Genetics to the coordinate system preferred by ArrayVision (unless it does the reverse). Use at your own risk.
f.grid(n, sz): THe input n is an index representing a spot onthe array (or a row in the data produced by ArrayVision); the output is a list of the indices describing all spots in the same grid. The sz variable is optional; it defaults to a list c(8, 2, 12, 30, 0, 2) describing the grid geometry.
patch.bkgd.extractor(channel): This is an extractor method, which replaces the local background at each spot with the median background on its grid.
top.patch.bkgd.extractor(channel, extra): This is an extractor method. It returns the background-corrected volume measurements, using local background at each spot up to a cap specified as a percentile by the extra argument (which defaults to 0.5). This percentile is used to compute a maximum allowable background on each grid. (The term "grid" is preferred by the company; it is synonymous in our usage with "patch".)
patchwise.extractor(channel, extra): This is an extractor method. It returns the background-corrected volume measurements, using local background at each spot up to a cap specified as a percentile by the extra argument (which defaults to 0.5). Values on each grid are then rescaled to adjust the mean of the total genomic spots within the grid to equal 100.

NCI60

The NCI60 array design refers to the microarrays used by Ross et al. to study the NCI60 cell lines; this data set is publicly available and is one of the data sets being used for the 2001 CAMDA competition. Most of the arrays have 4 grids of size 50 by 50, for a total of 10000 spots. A few of the early arrays contain four grids of size 49 by 51, for a total of 9996 spots. All of the actual gens are in the same order within a grid, when read left-to-right and top-to-bottom. The additional spots are blank.

nci60.design: This object is a data frame with 10000 rows and 40 columns. The large number of columns results from the fact that we have performed extensive annotation of this data set; 24 of the columns are vectors that describe functions that the genes are known to paritcipate in.