Department of Bioinformatics and Computational Biology

Home > Public Software > geneSmash

geneSmash

hidden rowfor table layout
Overview
DescriptiongeneSmash is a mash-up of various sources of information about human genes including the NCBI Entrez gene FTP site, UCSC Genome Browser, miRBase and human gene expression array annotation extracted from manufacturers’ websites.
Development Information
URL https://app1.bioinformatics.mdanderson.org/genesmash/_design/basic/index.html
LanguageJavaScript (ETL scripts in Perl)
Current version1.1
PlatformsPlatform independent
LicenseNot required
StatusActive
Last updatedJuly 1, 2012
References
Citation G, Payton M, Roth J, Abruzzo L, Coombes K. (2012). Relax with CouchDB - Into the non-relational DBMS era of bioinformatics. Genomics 100(1),1-7. http://doi.org/10.1016/j.ygeno.2012.05.006 
Help and Support
Contact MDACC-Bioinfo-IT-Admin@mdanderson.org 

geneSmash

geneSmash is a mash-up of various sources of information about human genes. The primary sources at the time of this writing are

  1. The gene_info file from the NCBI Entrez gene FTP site.
  2. The gene2unigene file from the NCBI Entrez gene FTP site.
  3. The refFlat.txt file from the UCSC Genome Browser.
  4. The hsa.gff file from miRBase.
  5. Human gene expression array annotation information is extracted from the Manufacturer’s (Affymetrix, Agilent and Illumina) websites.

    Currently, probe annotation information for various Human gene expression array platforms from the above specified manufacturers is available in geneSmash

Other sources may be incorporated in the future. These sources of information have been combined into a simple CouchDB database. As a consequence, we can build tools that make it possible to find the genomic location of a gene from its symbol, or to map easily between other classes of gene identifiers.

Web Site

The geneSmash web site provides one set of search tools built upon this infrastructure. You can enter the official gene symbol to get back the genome location, along with links out to the source databases at Entrez Gene or at the UCSC Genome Browser. Alternatively, you can search for genes by alias, by gene expression probe in a microarray, by cytoband location, or by giving a range of base positions in the human genome. You can also write your own progams (see below) or add your own web applications on top of a local copy of the database.

Mirrors and Replication

CouchDB provides native support for database replication . You can use those facilities to make (and maintain) a local copy of the entire geneSmash database. Because replication copies items at the granularity of an individual document (which in this instance means the collection of information about one gene), it is much gentler on network resources than copying the entire source files from the NCBI or UCSC. This advantage becomes particularly pronounced during maintenance, since a second replication will only copy the documents that have changed since the last time you replicated.

If you replicate the database, we request that you maintain links to the geneSmash logo and to the University of Texas MD Anderson Cancer Center.

Programming Interface

Because geneSmash is implemented using CouchDB , all of the data is available through a RESTful interface. Calls are made to the server using standard HTTP, and responses are sent in JSON format.

Database Overview

The database “design document” (which serves as the equivalent of a database schema from a relational database) is available (in JSON format) via the call https://app1.bioinformatics.mdanderson.org/genesmash/_design/basic

Information on individual genes

The primary key (in CouchDB terms, the _id) in geneSmash is the NCBI Entrez Gene database identifier. For example, suppose we are interested in the tumor suppressor gene p53, whose Entrez gene id is 7157. In order to get all of the geneSmash information about p53, you would make an HTTP call to the URL: http://app1.bioinformatics.mdanderson.org/genesmash/7157.

In order to get the data on a different gene whose Entrez Gene id is known, just replace 7157 in the URL by the id of the gene of interest.

Queries based on other identifiers

Of course, in many circumstances, you do not know the Entrez gene id but have some other way to refer to the gene. One common example occurs when you know the official HUGO symbol for a gene. We have designed CouchDB to allow queries based on some of these other identifiers.

Queries in CouchDB are implemented by defining views in the design document. You can use the call above to get a copy of the design document and see the complete list of views that have been defined. The view by_symbol allows you to query based on the HUGO symbol. So, the HTTP request http://app1.bioinformatics.mdanderson.org/genesmash/_design/basic/_view/by_symbol?key="EGFR"&include_docs=true will get all of the genesmash information on the gene EGFR. Other views defined at present include

Information on all genes

If you omit the key parameter when you invoke a CouchDB view, then the response contains information on all the documents that are relevant to the view. For example, the HTTP request http://app1.bioinformatics.mdanderson.org/genesmash/_design/basic/_view/by_symbol returns geneSmash information on all genes, sorted by the HUGO symbol.

Now, you might be hesitant to follow the previous link, but I encourage you to go ahead. For all permanent views, CouchDB pre-comoutes the responses. So even querying for all genes is very fast, since most of the time the server already knows the answer and just has to transmit the bytes over the network.

Using geneSmash in other programs

Because the interface to geneSmash (like the default interface for any CouchDB application) only uses HTTP and JSON, it can be integrated directly into all modern programming languages without imposing the overhead of a new specialized programming library. For instance, the following code examples shows you how to use geneSmash in the R statistical programming environment

To get the genomic coordinates of a gene based on its HUGO symbol:

library(rjson.krc)
getGeneLocations <- function(sym) {
  giUrl <- "http://app1.bioinformatics.mdanderson.org/genesmash/_design/basic/_view/by_symbol"
  whatever <- paste(giUrl, "?key=\"", sym, "\"&include_docs=true", sep='')
  junk <- paste(readLines(whatever), collapse='')
  stuff <- fromJSON(junk)
  rows <- stuff[["rows"]][[1]]$doc$Maps
  data.frame(Build=unlist(lapply(rows, function(x) x$NCBI)),
             Chromosome=unlist(lapply(rows, function(x) x$Chromosome)),
             TranscriptionStart=unlist(lapply(rows, function(x) x$TranscriptionStart)),
             TranscriptionEnd=unlist(lapply(rows, function(x) x$TranscriptionEnd)))
}
getGeneLocations("TP53")

Some notes on the code:

To get the information of a gene based on the microarray probe identifier:

library(rjson.krc)
getProbeInfo <- function(Manufacturer, ProbeID) {
  giUrl <- "http://app1.bioinformatics.mdanderson.org/genesmash/_design/basic/_view/by_probe2"
  link <- paste(giUrl, "?startkey=[\"", Manufacturer, "\",\"", ProbeID, "\"]&endkey=[\"", Manufacturer, "\",\"", ProbeID, "\",\"\\u9999\"]&include_docs=true", sep='')
  JSON_data <- paste(readLines(link), collapse='')
  gene_data <- fromJSON(JSON_data)
  geneInfo <- NA
  if(is.list(gene_data) & (length(gene_data$rows) > 0)) {
    rows <- gene_data[["rows"]][[1]]$doc
    geneID = unlist(rows['_id'])
    sym = unlist(rows['Symbol'])
    genbankID = ifelse(is.null(unlist(rows['GenBank'])), NA, unlist(rows['GenBank']))
    unigeneID = ifelse(is.null(unlist(rows['UniGene'])), NA, unlist(rows['UniGene']))
    desc = ifelse(is.null(unlist(rows['Description'])), NA, unlist(rows['Description']))
    chr = ifelse(is.null(unlist(rows['Chromosome'])), NA, unlist(rows['Chromosome']))
    geneInfo <- data.frame(row.names = ProbeID, EntrezGeneID= geneID, GenbankID = genbankID, UnigeneID = unigeneID,
            Symbol = sym, Description = desc, Chromosome = chr)
  }
  geneInfo
}
getProbeInfo("Affymetrix", "205241_at")
getProbeInfo("Agilent", "A_23_P142045")

Note:

Microarray probes associated with a NCBI Entrez Gene are only included in the geneSmash database.

geneSmash API Documentation

Input and Output

All calls to read data or query results from the geneSmash web service are made using RESTful HTTP calls. All results are returned in JavaScript Object Notation (JSON).

geneSmash follows the usual CouchDb conventions. Each object has a unique identifier, which in this case is given by the Entrez gene ID. For example, the Entrez gene ID of the p53 gene happens to be “7157”. To get the JSON representation of the CouchDB document for the p53 gene, send an HTTP GET request to the following URI:

http://app1.bioinformatics.mdanderson.org/genesmash/7157

Views

CouchDB (and thus geneSmash) queries are also known as views. The available views define the main API. In the current version, all views are contained in the “basic” design document. You can get a copy of the design document by sending an HTTP GET request to the URI:

http://app1.bioinformatics.mdanderson.org/genesmash/_design/basic

To use the API, each of the calls described below should be preceeded by

http://app1.bioinformatics.mdanderson.org/genesmash/_design/basic/_view/

Although we provide examples of calls that provide query parameters, every parameter is defined by the general CouchDB interface.

HTTP Call JSON Value Result
all {“total_rows”: …, “offset”: …, “rows” : [{“id”: “…”, “key”: “…”, “value”: { … }}]} Fetch all documents from the database.
all?limit=10 Same as above Fetch the first 10 documents from the database.
by_symbol Fetch all genes sorted by HUGO symbol.
by_symbol?key=“TP53” Fetch the document for the gene TP53
by_alias For all known aliases or synonyms, fetch the corresponding genes.
by_alias?key=“AR” Fetch all genes with “AR” as a synonym.
by_cytoband?key=“17p13.1” Fetch all genes mapped to the given cytoband.
by_ensembl?key=“ENSG00000012048” Fetch the gene with the given Ensembl identifier.
by_unigene?key=“Hs.654481” Fetch the gene with the given UniGene cluster ID
by_probe2?key=[“Affymetrix”,“205241_at”,“HG-U133A”] Fetch the gene with the given microarray probe identifier
by_mir
by_location
gene_location
maxlength
minlength

Support

General documentation can be found at the geneSmash Documentation page.