Department of Bioinformatics and Computational Biology

GeneSmash:Overview

From MD Anderson Bioinformatics
Jump to: navigation, search

geneSmash

Logo
geneSmash
Overview
Description geneSmash is a mash-up of various sources of information about human
genes including the NCBI Entrez gene FTP site, UCSC Genome Browser,
miRBase and human gene expression array annotation extracted from manufacturers' websites.
URL http://app1.bioinformatics.mdanderson.org/genesmash/_design/basic/index.html
Development Information
Language JavaScript (ETL scripts in Perl)
Current Version 1.1
License Not required
Status Active
Last Updated 2012-07-01
References
Citations Manyam G, Payton M, Roth J, Abruzzo L, Coombes K. (2012).
Relax with CouchDB - Into the non-relational DBMS era of bioinformatics.
Genomics 100(1), 1-7.
doi:10.1016/j.ygeno.2012.05.006
Help and Support
Contact Kevin R. Coombes


geneSmash is a mash-up of various sources of information about human genes. The primary sources at the time of this writing are

  1. The gene_info file from the NCBI Entrez gene FTP site.
  2. The gene2unigene file from the NCBI Entrez gene FTP site.
  3. The refFlat.txt file from the UCSC Genome Browser.
  4. The hsa.gff file from miRBase.
  5. Human gene expression array annotation information is extracted from the Manufacturer's (Affymetrix, Agilent and Illumina) websites.
Currently, probe annotation information for various Human gene expression array platforms from the above specified manufacturers is available in geneSmash

Other sources may be incorporated in the future. These sources of information have been combined into a simple CouchDB database. As a consequence, we can build tools that make it possible to find the genomic location of a gene from its symbol, or to map easily between other classes of gene identifiers.


Contents


Web Site

The geneSmash web site provides one set of search tools built upon this infrastructure. You can enter the official gene symbol to get back the genome location, along with links out to the source databases at Entrez Gene or at the UCSC Genome Browser. Alternatively, you can search for genes by alias, by gene expression probe in a microarray, by cytoband location, or by giving a range of base positions in the human genome. You can also write your own progams (see below) or add your own web applications on top of a local copy of the database.

Mirrors and Replication

CouchDB provides native support for database replication. You can use those facilities to make (and maintain) a local copy of the entire geneSmash database. Because replication copies items at the granularity of an individual document (which in this instance means the collection of information about one gene), it is much gentler on network resources than copying the entire source files from the NCBI or UCSC. This advantage becomes particularly pronounced during maintenance, since a second replication will only copy the documents that have changed since the last time you replicated.

If you replicate the database, we request that you maintain links to the geneSmash logo and to the University of Texas MD Anderson Cancer Center.

Programming Interface

Because geneSmash is implemented using CouchDB, all of the data is available through a RESTful interface. Calls are made to the server using standard HTTP, and responses are sent in JSON format.

Database Overview

The database "design document" (which serves as the equivalent of a database schema from a relational database) is available (in JSON format) via the call http://app1.bioinformatics.mdanderson.org/genesmash/_design/basic

Information on individual genes

The primary key (in CouchDB terms, the _id) in geneSmash is the NCBI Entrez Gene database identifier. For example, suppose we are interested in the tumor suppressor gene p53, whose Entrez gene id is 7157. In order to get all of the geneSmash information about p53, you would make an HTTP call to the URL: http://app1.bioinformatics.mdanderson.org/genesmash/7157.

In order to get the data on a different gene whose Entrez Gene id is known, just replace 7157 in the URL by the id of the gene of interest.

Queries based on other identifiers

Of course, in many circumstances, you do not know the Entrez gene id but have some other way to refer to the gene. One common example occurs when you know the official HUGO symbol for a gene. We have designed CouchDB to allow queries based on some of these other identifiers.

Queries in CouchDB are implemented by defining views in the design document. You can use the call above to get a copy of the design document and see the complete list of views that have been defined. The view by_symbol allows you to query based on the HUGO symbol. So, the HTTP request http://app1.bioinformatics.mdanderson.org/genesmash/_design/basic/_view/by_symbol?key="EGFR"&include_docs=true will get all of the genesmash information on the gene EGFR. Other views defined at present include

  • by_alias
  • by_cytoband
  • by_ensembl
  • by_location
  • by_mir
  • by_symbol
  • by_unigene
  • by_probe
  • gene_location
  • all
  • maxlength
  • minlength

Information on all genes

If you omit the key parameter when you invoke a CouchDB view, then the response contains information on all the documents that are relevant to the view. For example, the HTTP request http://app1.bioinformatics.mdanderson.org/genesmash/_design/basic/_view/by_symbol returns geneSmash information on all genes, sorted by the HUGO symbol.

Now, you might be hesitant to follow the previous link, but I encourage you to go ahead. For all permanent views, CouchDB pre-comoutes the responses. So even querying for all genes is very fast, since most of the time the server already knows the answer and just has to transmit the bytes over the network.


Using geneSmash in other programs

Because the interface to geneSmash (like the default interface for any CouchDB application) only uses HTTP and JSON, it can be integrated directly into all modern programming languages without imposing the overhead of a new specialized programming library. For instance, the following code examples shows you how to use geneSmash in the R statistical programming environment

To get the genomic coordinates of a gene based on its HUGO symbol:

    library(rjson.krc)
    getGeneLocations <- function(sym) {
      giUrl <- "http://app1.bioinformatics.mdanderson.org/genesmash/_design/basic/_view/by_symbol"
      whatever <- paste(giUrl, "?key=\"", sym, "\"&include_docs=true", sep='')
      junk <- paste(readLines(whatever), collapse='')
      stuff <- fromJSON(junk)
      rows <- stuff[["rows"]][[1]]$doc$Maps
      data.frame(Build=unlist(lapply(rows, function(x) x$NCBI)),
                 Chromosome=unlist(lapply(rows, function(x) x$Chromosome)),
                 TranscriptionStart=unlist(lapply(rows, function(x) x$TranscriptionStart)),
                 TranscriptionEnd=unlist(lapply(rows, function(x) x$TranscriptionEnd)))
    }
    getGeneLocations("TP53")

Some notes on the code:

  • The first line loads an R package that converts between JSON objects and R objects. The version of the rjson package currently at CRAN has some limitations that make it work poorly in the current context. We have patched the package and you can get a copy form our R repository at http://bioinformatics.mdanderson.org/OOMPA. Note that you will need to supply this repository name to the install.packages function. New: Beginning with version 0.7, the RJSONIO package that Duncan Temple Lang maintains at Omegahat can be used instead of rjson.krc. That implementation is recommended, especially since it is much faster.
  • The first two lines of the getGeneLocation function use the symbol argument to construct an appropriate URL. The call to paste(readLines(...)) actually makes the HTTP request to the geneSmash server. The next line converts the JSON response into an R object, and the final lines extract the relevant part of the response.
  • If you actually run the code, you may at first be surprised to get back an entire data frame instead of a single response. However, there are two reasons why the answer is not unique. First, we have loaded the mapping data for several different builds of the genome into geneSmash, and you get answers for every build. Second, many genes have alternative splice forms; each one has a slightly different transcription start and end (even within a single build of the genome). If you actually explore the "Maps" element, you will discover that it contains the start and end postions of all of the exons for every known alternative splice form in multiple builds of the genome.
  • This version of the code does not perform error checking on the result, so it can probably not be used in production code. Failure can occur because the server is not available, or because the symbol passed as an argument is not a valid HUGO symbol, or because no mapping location is known; all three conditions should be checked.


To get the information of a gene based on the microarray probe identifier:

    library(rjson.krc)
    getProbeInfo <- function(Manufacturer, ProbeID) {
      giUrl <- "http://app1.bioinformatics.mdanderson.org/genesmash/_design/basic/_view/by_probe2"
      link <- paste(giUrl, "?startkey=[\"", Manufacturer, "\",\"", ProbeID, "\"]&endkey=[\"", Manufacturer, "\",\"", ProbeID, "\",\"\\u9999\"]&include_docs=true", sep='')
      JSON_data <- paste(readLines(link), collapse='')
      gene_data <- fromJSON(JSON_data)
      geneInfo <- NA
      if(is.list(gene_data) & (length(gene_data$rows) > 0)) {
    	rows <- gene_data[["rows"]][[1]]$doc
    	geneID = unlist(rows['_id'])
    	sym = unlist(rows['Symbol'])
    	genbankID = ifelse(is.null(unlist(rows['GenBank'])), NA, unlist(rows['GenBank']))
    	unigeneID = ifelse(is.null(unlist(rows['UniGene'])), NA, unlist(rows['UniGene']))
    	desc = ifelse(is.null(unlist(rows['Description'])), NA, unlist(rows['Description']))
    	chr = ifelse(is.null(unlist(rows['Chromosome'])), NA, unlist(rows['Chromosome']))
    	geneInfo <- data.frame(row.names = ProbeID, EntrezGeneID= geneID, GenbankID = genbankID, UnigeneID = unigeneID,
    			Symbol = sym, Description = desc, Chromosome = chr)
      }
      geneInfo
    }
    getProbeInfo("Affymetrix", "205241_at")
    getProbeInfo("Agilent", "A_23_P142045")

Note:

Microarray probes associated with a NCBI Entrez Gene are only included in the geneSmash database.


geneSmash API Documentation

Input and Output

All calls to read data or query results from the geneSmash web service are made using RESTful HTTP calls. All results are returned in JavaScript Object Notation (JSON).

geneSmash follows the usual CouchDb conventions. Each object has a unique identifier, which in this case is given by the Entrez gene ID. For example, the Entrez gene ID of the p53 gene happens to be "7157". To get the JSON representation of the CouchDB document for the p53 gene, send an HTTP GET request to the following URI:

http://app1.bioinformatics.mdanderson.org/genesmash/7157

Views

CouchDB (and thus geneSmash) queries are also known as views. The available views define the main API. In the current version, all views are contained in the "basic" design document. You can get a copy of the design document by sending an HTTP GET request to the URI:

http://app1.bioinformatics.mdanderson.org/genesmash/_design/basic

To use the API, each of the calls described below should be preceeded by

http://app1.bioinformatics.mdanderson.org/genesmash/_design/basic/_view/

Although we provide examples of calls that provide query parameters, every parameter is defined by the general CouchDB interface.


HTTP Call JSON Value Result
all {"total_rows": ..., "offset": ..., "rows" : [{"id": "...", "key": "...", "value": { ... }}]} Fetch all documents from the database.
all?limit=10 Same as above Fetch the first 10 documents from the database.
by_symbol Fetch all genes sorted by HUGO symbol.
by_symbol?key="TP53" Fetch the document for the gene TP53
by_alias For all known aliases or synonyms, fetch the corresponding genes.
by_alias?key="AR" Fetch all genes with "AR" as a synonym.
by_cytoband?key="17p13.1" Fetch all genes mapped to the given cytoband.
by_ensembl?key="ENSG00000012048" Fetch the gene with the given Ensembl identifier.
by_unigene?key="Hs.654481" Fetch the gene with the given UniGene cluster ID
by_probe2?key=["Affymetrix","205241_at","HG-U133A"] Fetch the gene with the given microarray probe identifier
by_mir
by_location
gene_location
maxlength
minlength

Support

For Frequently Asked Questions, Bug Reports, and other concerns, please visit the forum at this link