Department of Bioinformatics and Computational Biology

Rocket:Overview

From MD Anderson Bioinformatics
Jump to: navigation, search

Rocket

Rocket
Overview
Description Rocket is a set of perl scripts and modules used to confirm that the clones sequenced in our Core Laboratory match the annotations provided by the supplier.
Development Information
Language Perl
Current Version Current Version
Platforms Linux
Status Inactive
Last Updated 2001-09-18
References
News Latest news worthy item (should really make optional)



Rocket is a set of perl scripts and modules used to confirm that the clones sequenced in our Core Laboratory match the annotations provided by the supplier. (More precisely, Rocket provides a graphical interface using perl/Tk to a command line perl script.) Rocket will require some customization before you can use it. It assumes the existence of a local database containing the supplier's clone annotations; you'll have to create such a database separately and make sure the names of the fields in the database match those used by Rocket.


Contents


Packages

Rocket itself is a perl/Tk wrapper around Blaster, a perl script called blaster.pl that actually implements the algorithm described here as a command line program. Blaster, in turn, relies on a set of functions packaged in the perl module NCBI.pm.

Download

Rocket is available for download here.

System Requirements

Rocket assumes the existence of a local database containing the supplier's clone annotations; you'll have to create such a database separately and make sure the names of the fields in the database match those used by Rocket.


Installation

Rocket will require some customization before you can use it.

Customization

The primary need for customization comes from an assumption built into the NCBI.pm perl module. This module assumes that it has an ODBC connection to a database called clones containing a table called clone_data with a moderatley large number of required fields. This table should hold the clone annotations provided by the supplier. You can find a list of fields by searching NCBI.pm for the text string "DBI:ODBC" and edit it to match whatever database you produce locally.

Testing the Installation

A sample input file has been provided (called, of all things, "sampleInput.txt") along with the sample output (you can probably guess the name) produced using our database, which was produced from the clone information provided by Research Genetics.


Documentation

Program Logic

The input to Rocket is a file containing nucleotide sequences in FASTA format, where each description line must identify the location by including values for plate, row, column. We use the location to look up the putative GenBank accession number from information supplied by the vendor.

The default assumption built into the program is that the information supplied by the vendor is likely to be correct. Thus, the algorithm is structured to try to verify correctness quickly, reverting to more complicated searches only if the initial tests fail.

The first test is to perform a BLAST2 search of the sequence determined in our laboratory against the putative GenBank accession number, using BLOSUM62 and the default parameter settings at the NCBI web site (http://www.ncbi.nlm.nih.gov:80/blast/bl2seq/wblast2.cgi). This choice of parameters does not require exact matches; thus, it will overlook minor errors in sequencing. If the BLAST2 search matches successfully, the result is written to a log file and the program moves on to consider another clone.

If the BLAST2 search fails to find a match, then Rocket uses the sequence determined in our laboratory to perform a full BLAST search against the non-redundant human database to find the ten best matches. The potential difficulty with this approach is that the accession number supplied by Research Genetics may not appear among the top ten matches. So, we look up the current UniGene cluster numbers for both the putative accession number and for each of the top ten matches to the sequence. If any of the UniGene clusters agree, then the result is logged as a match between the sequence and the putative accession number.

If the general BLAST/UniGene procedure fails to match the putative accession number, then we log an error and try to determine the correct values. The second assumption in the program is that the most likely source of errors is contamination from other clones supplied by the vendor. Thus, we try to match the UniGene clusters of the best matches to our sequence against other UniGene clusters in the information supplied by the vendor. If we find a match in this way, then we attempt to confirm the result using another BLAST2 search of our sequence against the matching accession number. If this succeeds, then it is logged as a match to a contaminating clone.

Finally, if all attempts at identifying the sequence inside the vendor's data fail, then we log an error and use the accession number of the best match from the full BLAST search to annotate the clone.

BLASTER Perl Script

Usage: perl blaster.pl FASTA-file REPORT-file

The input is a file containing nucleotide sequences in FASTA format, where each description line identifes the location by including values for plate, row, column. Valid FASTA input for this program looks like

 >plate 2 C 10
 ACTGTTGCTAGT....

To start the algorithm, we lookup the location to get a putative accession number from an ODBC database prepared from a file supplied by Research Genetics.

Next, we perform a BLAST2 search of sequence against accession If it matches, log and finish with that sequence. If it fails to match, then (1) Do a basic BLAST search with the sequence. (2) Get the report id in the response file from NCBI. (3) Loop with delay to keep sending report id until get report. (4) Log the best matches. (5) For each match in decreasing score order (a) Do a UniGene search on the matching accession number (b) Lookup up the UniGene number in the Rg database If it matches, log and finish with that sequence.

The REPORT-file contains a report for each FASTA sequnce contained in the input file. Reports are separated by a row of dashes, like this:

The FASTA sequence (including the location) is copied to the report file, along with the GenBank accession number that has been read from the local database. If the BLAST 2 search matches, the report says so. If not, it tells you that there is a mismatch. It then performs a general BLAST search on the sequence and reports the top ten matching accession numbers, which are included in the report. It looks up the UniGene cluster for each accession number, and reports it. Next, it loks up the UniGene number in the local database, and either reports a match (and halts) or a mismatch.

The main source of failure in this method is likely to be outdated UNiGene numbes in the local database. These can be updated using the companion program called, surprisingly, updateUnigene.pl.


NCBI Perl Module

The NCBI package is a set of tools to automate the process of connecting to various databases at the National Center for BioInformatics and extracting relevant results from the HTML files they send in response.

This package assumes also that there is a local ODBC database called clones, which contains a table called clone_data, which was constructed from a standard file supplied by Research Genetics. Some of the constants associated with this local database may need to be customized to match the actual installation.

Object Methods

    $object = new NCBI($logfileName);
    $object = new NCBI($logfileName, $agent);

The constructor of an NCBI object requires a file name in the standard perl format. The constructor uses this name to open a log file where it can report its results. The buffering on this file is set to flush data as quickly as possible. The file will also be closed automatically when the NCBI object goes out of scope.

The constructor also takes an optional argument, containing an object of class LWP::UserAgent. If this argument is omitted, then the constructor builds a UserAgent using the default constructor.

    $agent = $ncbi->agent();
    $oldAgent = $ncbi->agent($newAgent);

The basic data element inside an NCBI object is a UserAgent. You can read the current value using the agent method. By supplying an argument to this method, you can replace the current UserAgent with a new one; in that case, the method returns the old UserAgent.

    $debug = $ncbi->debugLevel();
    $oldDebug = $ncbi->debugLevel($newDebug);

Each NCBI object also contains a data element to indicate whether it should write debugging information to the log file. At construction time, the debug level is set to zero, which indicates no debugging. You can read the current value using the method debugLevel. By supplying an artgument to this method, you can set a new debug level; in that case, the old level is returned.

The debug level is a bitmask, with the bits having the following meaning:

    1 = debug UniGene searches
    2 = debug BLAST 2 searches
    4 = debug general BLAST searches
    8 = debug second step of BLAST searches
   16 = debug responses to second step of BLAST searches
   32 = debug all HTPP requests
  128 = debug searches for retired UniGene cluster numbers
  256 = debug routines that parse retired UniGene searches
    $ncbi->writeLog($message);

The NCBI objects use the writeLog method to record all its results and any optional debugging information. Users of the package can add there own information to this file using the same method.

    @location = $ncbi->parseDefinitionLine($FASTA)

The argument is a nucleotide sequence in FASTA format, where the definition line includes (1) The explicit word ``plate (2) A plate number, row, and column, separated by spaces. plate and column are integers; row is lowercase alphabetic. The definition line is parsed, and the three components are returned as elements of an array. We return undef if the FASTA format is violated, and an empty list if we get too few items.

Routines for local database access

      $accession = $ncbi->getAccession($plate, $row, $column)

The arguments are elements designating (plate, row, column). These are used to look up the GenBank accession number in a database built from a file provided by Research Genetics. The returned value is zero if any part of the database lookup fails.

      $hit = $ncbi->compareUnigene($unigene, $plate, $row, $column)

The first argument is a UniGene cluster ID. The other arguments are elements designating (plate, row, column). The return value is a string, which is empty if that UniGene cluster does not live on that plate, and otherwise contains the location and accession number of a match.

    $hashref = $ncbi->localUnigene();

The localUnigene method reads the local database and builds a hash whose keys are the (meaningless) primary keys in the local database, and whose values are triples [primaryKey, unigeneCluster, genBank]. The return value is a reference to this hash.

Routines to query NCBI web site

       ($howmany, @matches) = $ncbi->doUniGene($accession)

The argument is a GenBank accession number. The return value is a number and a list of at most ten matches to human entries in UniGene.

       ($active, $response) = $ncbi->checkUnigene($unigene)

The argument is a UniGene cluster number. The first return value is true when this is a valid, active UniGene ID, and is false otherwise. The second return value is the content of the HTML response file from the NIH, which can be used to find the new cluster number if you already have a GenBank accession number. You can distinguish a failure to get a response from NCBI (with an undef $response) from a retired cluster.

    $match = $ncbi->parseCluster($genbank, $content);

The first argument is a GenBank accession number. The second argument is the content of an HTML response received from a call to the checkUnigene method. The parseCluster method scrabbles around inside an HTML table to locate the actual results of the database search. It compares the results with the accession number, and returns a boolean value. If the UniGene number used in the initial search matches the supplied accession nmber, then the return value is true. If the accession number no longer matches, then the return value is false. If the content argument is improperly structured, then the return value is undefined.

       ($resultCode, @response) = $ncbi->doBLAST2($genBank, $FASTA)

The arguments are a GenBank accession number, represented as a string a nucleotide sequence in FASTA format, represented as a string

This method submits an HTTP request to NCBI to conduct a BLAST2 search comparing the sequence to the accession number. The parameters used in the request are the defaults, as found by reading the source of the web page $NCBI/gorf/bl2.html

The return value is an integer result code that is 0 in the event of an error, positive in the event of a match negative in the event of a mismatch an array of lines that, in the latter two cases, is the response file.

       ($result, @details) = $ncbi->doBLAST($FASTA)

The argument is a string representing a nucleotide sequence in FASTA format The return value consists of a result code that is nonzero only in the event of success for failure, the @details contain an explanation for success, the @details is an array of matching accession numbers

This subroutine does a BLAST search looking for the best 10 matches to the sequence. This is complicated since the initial request is queued at NCBI. You must parse the response to find the ID needed to get the result, and loop (with a delay) sending calls to retrieve it when it finally becomes available.

       $RID = $self->_initBlast($ua, $FASTA)

The arguments are passed along from doBLAST. This subroutine submits the initial request to NCBI and parses the response to get the ID needed to recover the result. The return value is the ID, or 0 if some error occurs. The parameters used in the request have names that were determined by reading the source of the web page $NCBI/blast/blast.cgi?Jform=0

       ($resultCode, @response) = $self->_getBlastResult($ua, $RID)

The arguments are an object of class LWP:UserAgent a result ID from an NCBI BLAST search The return values are a result code that is nonzero only in the event of a successfully identified response an array containing the lines of the HTML response.

       @matches = $self->_parseBlastResult(\@content)

The argument is a reference to an array each of whose entries is one line of an HTML file returned from a BLAST search The return value is an array of matching accession numbers, possibly empty.

Auxiliary routines to prepare requests

       $request = $self->_prepareRequest($userAgent, $URL, $parameters)

The arguments are an object of type LWP::UserAgent a valid URL a reference to a hash of parameter values The return value is an object of type HTTP::Request which represents a POST request with the parameters properly URL-encoded.

       $coded = _URLencode($stuff)

The argument is a string of ASCII characters. The return value is a single line representing the same characters, properly URL-encoded.

       $encoded = _charcode(n)

The argument is an integer which is the ASCII code of a character. The output is a string representing the URL-encoded form of that character. We actually encode a few more characters than absolutely necessary, but who reallty cares?

Support

For Frequently Asked Questions, Bug Reports, and other concerns, please visit the forum at this link