Sequence Quality Check Web Site

Given a sequence or a list of sequences in FASTA format, this website provides a tool to predict they represent real genes with functional products or might be originated from contamination/transcriptional noise.

Procedures behind the scene:

  1. Dinucleotide frequencies (DFs) were calculated for each sequence.
  2. DF was employed as predictor in Quadratic Discriminant Analysis (QDA) classifier constructed in matlab. QDA was used to classify two predefined classes: real gene which we call PSL (PET(prevalently expressed transcript) sequence like) and intergenic region which we call GSL (genomic sequence like).
  3. The prediction was repeated five times with mutually exclusive training sets. Each training set consisted of one GSL class, which includes 2400 vectors of DFs of genomic sequences ranging from 250 to 4000bp; and one PSL class, which includes 2400 vectors of DFs of prevalently expressed transcripts ranging from 250 to 4000bp.
  4. The posterior probability, averaged from five predictions, that the sequence belongs to PSL is reported. One minus this number is the probability that the sequence originates from intergenic region, suggestion contamination or transcriptional noise.
For any question concerning QDA behind the scene, please contact Jiexin Zhang (jiexinzhang@mdanderson.org)

Corresponding publication:

    Jiexin Zhang, Li Zhang, Kevin Coombes. Signitures of Gene Sequences Revealed by Mining the UniGene Affiliation Network. Bioinformatics 2006; 22: 385-391.

Sample data format       

Select a file
OR
Enter Sequence(s)

lower-case masked  lower-case not masked

    

Results will appear below.