geneSmash
Overview | |
Description | geneSmash is a mash-up of various sources of information about human genes including the NCBI Entrez gene FTP site, UCSC Genome Browser, miRBase and human gene expression array annotation extracted from manufacturers’ websites. |
Development Information | |
URL | https://app1.bioinformatics.mdanderson.org/genesmash/_design/basic/index.html |
Language | JavaScript (ETL scripts in Perl) |
Current version | 1.1 |
Platforms | Platform independent |
License | Not required |
Status | Active |
Last updated | July 1, 2012 |
References | |
Citation | G, Payton M, Roth J, Abruzzo L, Coombes K. (2012). Relax with CouchDB - Into the non-relational DBMS era of bioinformatics. Genomics 100(1),1-7. http://doi.org/10.1016/j.ygeno.2012.05.006 |
Help and Support | |
Contact | MDACC-Bioinfo-IT-Admin@mdanderson.org |
geneSmash is a mash-up of various sources of information about human genes. The primary sources at the time of this writing are
Human gene expression array annotation information is extracted from the Manufacturer’s (Affymetrix, Agilent and Illumina) websites.
Currently, probe annotation information for various Human gene expression array platforms from the above specified manufacturers is available in geneSmash
Other sources may be incorporated in the future. These sources of information have been combined into a simple CouchDB database. As a consequence, we can build tools that make it possible to find the genomic location of a gene from its symbol, or to map easily between other classes of gene identifiers.
The geneSmash web site provides one set of search tools built upon this infrastructure. You can enter the official gene symbol to get back the genome location, along with links out to the source databases at Entrez Gene or at the UCSC Genome Browser. Alternatively, you can search for genes by alias, by gene expression probe in a microarray, by cytoband location, or by giving a range of base positions in the human genome. You can also write your own progams (see below) or add your own web applications on top of a local copy of the database.
Mirrors and Replication
CouchDB provides native support for database replication . You can use those facilities to make (and maintain) a local copy of the entire geneSmash database. Because replication copies items at the granularity of an individual document (which in this instance means the collection of information about one gene), it is much gentler on network resources than copying the entire source files from the NCBI or UCSC. This advantage becomes particularly pronounced during maintenance, since a second replication will only copy the documents that have changed since the last time you replicated.
If you replicate the database, we request that you maintain links to the geneSmash logo and to the University of Texas MD Anderson Cancer Center.
Because geneSmash is implemented using CouchDB , all of the data is available through a RESTful interface. Calls are made to the server using standard HTTP, and responses are sent in JSON format.
Database Overview
The database “design document” (which serves as the equivalent of a database schema from a relational database) is available (in JSON format) via the call https://app1.bioinformatics.mdanderson.org/genesmash/_design/basic
Information on individual genes
The primary key (in CouchDB terms, the _id) in geneSmash is the NCBI Entrez Gene database identifier. For example, suppose we are interested in the tumor suppressor gene p53, whose Entrez gene id is 7157. In order to get all of the geneSmash information about p53, you would make an HTTP call to the URL: http://app1.bioinformatics.mdanderson.org/genesmash/7157 .
In order to get the data on a different gene whose Entrez Gene id is known, just replace 7157 in the URL by the id of the gene of interest.
Queries based on other identifiers
Of course, in many circumstances, you do not know the Entrez gene id but have some other way to refer to the gene. One common example occurs when you know the official HUGO symbol for a gene. We have designed CouchDB to allow queries based on some of these other identifiers.
Queries in CouchDB are implemented by defining views in the design document. You can use the call above to get a copy of the design document and see the complete list of views that have been defined. The view by_symbol allows you to query based on the HUGO symbol. So, the HTTP request http://app1.bioinformatics.mdanderson.org/genesmash/_design/basic/_view/by_symbol?key="EGFR"&include_docs=true will get all of the genesmash information on the gene EGFR. Other views defined at present include
Information on all genes
If you omit the key parameter when you invoke a CouchDB view, then the response contains information on all the documents that are relevant to the view. For example, the HTTP request http://app1.bioinformatics.mdanderson.org/genesmash/_design/basic/_view/by_symbol returns geneSmash information on all genes, sorted by the HUGO symbol.
Now, you might be hesitant to follow the previous link, but I encourage you to go ahead. For all permanent views, CouchDB pre-comoutes the responses. So even querying for all genes is very fast, since most of the time the server already knows the answer and just has to transmit the bytes over the network.
Because the interface to geneSmash (like the default interface for any CouchDB application) only uses HTTP and JSON, it can be integrated directly into all modern programming languages without imposing the overhead of a new specialized programming library. For instance, the following code examples shows you how to use geneSmash in the R statistical programming environment
To get the genomic coordinates of a gene based on its HUGO symbol:
library(rjson.krc)
getGeneLocations <- function(sym) {
giUrl <- "http://app1.bioinformatics.mdanderson.org/genesmash/_design/basic/_view/by_symbol"
whatever <- paste(giUrl, "?key=\"", sym, "\"&include_docs=true", sep='')
junk <- paste(readLines(whatever), collapse='')
stuff <- fromJSON(junk)
rows <- stuff[["rows"]][[1]]$doc$Maps
data.frame(Build=unlist(lapply(rows, function(x) x$NCBI)),
Chromosome=unlist(lapply(rows, function(x) x$Chromosome)),
TranscriptionStart=unlist(lapply(rows, function(x) x$TranscriptionStart)),
TranscriptionEnd=unlist(lapply(rows, function(x) x$TranscriptionEnd)))
}
getGeneLocations("TP53")
Some notes on the code:
To get the information of a gene based on the microarray probe identifier:
library(rjson.krc)
getProbeInfo <- function(Manufacturer, ProbeID) {
giUrl <- "http://app1.bioinformatics.mdanderson.org/genesmash/_design/basic/_view/by_probe2"
link <- paste(giUrl, "?startkey=[\"", Manufacturer, "\",\"", ProbeID, "\"]&endkey=[\"", Manufacturer, "\",\"", ProbeID, "\",\"\\u9999\"]&include_docs=true", sep='')
JSON_data <- paste(readLines(link), collapse='')
gene_data <- fromJSON(JSON_data)
geneInfo <- NA
if(is.list(gene_data) & (length(gene_data$rows) > 0)) {
rows <- gene_data[["rows"]][[1]]$doc
geneID = unlist(rows['_id'])
sym = unlist(rows['Symbol'])
genbankID = ifelse(is.null(unlist(rows['GenBank'])), NA, unlist(rows['GenBank']))
unigeneID = ifelse(is.null(unlist(rows['UniGene'])), NA, unlist(rows['UniGene']))
desc = ifelse(is.null(unlist(rows['Description'])), NA, unlist(rows['Description']))
chr = ifelse(is.null(unlist(rows['Chromosome'])), NA, unlist(rows['Chromosome']))
geneInfo <- data.frame(row.names = ProbeID, EntrezGeneID= geneID, GenbankID = genbankID, UnigeneID = unigeneID,
Symbol = sym, Description = desc, Chromosome = chr)
}
geneInfo
}
getProbeInfo("Affymetrix", "205241_at")
getProbeInfo("Agilent", "A_23_P142045")
Note:
Microarray probes associated with a NCBI Entrez Gene are only included in the geneSmash database.
Input and Output
All calls to read data or query results from the geneSmash web service are made using RESTful HTTP calls. All results are returned in JavaScript Object Notation (JSON).
geneSmash follows the usual CouchDb conventions. Each object has a unique identifier, which in this case is given by the Entrez gene ID. For example, the Entrez gene ID of the p53 gene happens to be “7157”. To get the JSON representation of the CouchDB document for the p53 gene, send an HTTP GET request to the following URI:
http://app1.bioinformatics.mdanderson.org/genesmash/7157
Views
CouchDB (and thus geneSmash) queries are also known as views. The available views define the main API. In the current version, all views are contained in the “basic” design document. You can get a copy of the design document by sending an HTTP GET request to the URI:
http://app1.bioinformatics.mdanderson.org/genesmash/_design/basic
To use the API, each of the calls described below should be preceeded by
http://app1.bioinformatics.mdanderson.org/genesmash/_design/basic/_view/
Although we provide examples of calls that provide query parameters, every parameter is defined by the general CouchDB interface.
HTTP Call | JSON Value | Result |
---|---|---|
all | {“total_rows”: …, “offset”: …, “rows” : [{“id”: “…”, “key”: “…”, “value”: { … }}]} | Fetch all documents from the database. |
all?limit=10 | Same as above | Fetch the first 10 documents from the database. |
by_symbol | Fetch all genes sorted by HUGO symbol. | |
by_symbol?key=“TP53” | Fetch the document for the gene TP53 | |
by_alias | For all known aliases or synonyms, fetch the corresponding genes. | |
by_alias?key=“AR” | Fetch all genes with “AR” as a synonym. | |
by_cytoband?key=“17p13.1” | Fetch all genes mapped to the given cytoband. | |
by_ensembl?key=“ENSG00000012048” | Fetch the gene with the given Ensembl identifier. | |
by_unigene?key=“Hs.654481” | Fetch the gene with the given UniGene cluster ID | |
by_probe2?key=[“Affymetrix”,“205241_at”,“HG-U133A”] | Fetch the gene with the given microarray probe identifier | |
by_mir | ||
by_location | ||
gene_location | ||
maxlength | ||
minlength |
General documentation can be found at the geneSmash Documentation page.