### Department of Bioinformatics and Computational Biology

Home > Public Software > Archive > Sequence Quality Check

This project is archived and no longer maintained.

Sequence Quality Check

 hidden row for table layout Overview Description This service provides a tool to predict whether sequences represent real genes with functional products or possibly contamination/transcriptional noise given either a sequence, or a list of sequences in FASTA format. Development Information URL http://bioinformatics.mdanderson.org/SequenceQualityCheck/ Current version 1.0 Help and Support Contact MDACC-Bioinfo-IT-Admin@mdanderson.org

## Sequence Quality Check

Given a sequence or a list of sequences in FASTA format, this website provides a tool to predict they represent real genes with functional products or might be originated from contamination/transcriptional noise.

Procedures behind the scene:

• Dinucleotide frequencies (DFs) were calculated for each sequence.
• DF was employed as predictor in Quadratic Discriminant Analysis (QDA) classifier constructed in matlab. QDA was used to classify two predefined classes: real gene which we call PSL (PET(prevalently expressed transcript) sequence like) and intergenic region which we call GSL (genomic sequence like).
• The prediction was repeated five times with mutually exclusive training sets. Each training set consisted of one GSL class, which includes 2400 vectors of DFs of genomic sequences ranging from 250 to 4000bp; and one PSL class, which includes 2400 vectors of DFs of prevalently expressed transcripts ranging from 250 to 4000bp.
• The posterior probability, averaged from five predictions, that the sequence belongs to PSL is reported. One minus this number is the probability that the sequence originates from intergenic region, suggestion contamination or transcriptional noise.

### Corresponding publication

Jiexin Zhang, Li Zhang, Kevin Coombes.
Signatures of Gene Sequences Revealed by Mining the UniGene Affiliation Network. Bioinformatics 2006; 22: 385-391.

### Sample Input Data

The input file should be FASTA formatted DNA sequence(s), which means it begins with a single-line description, followed by lines of sequence data.The sequence must contain at least letters. Otherwise, no prediction will be made because the sequence is too short.

• The description line starts with a greater than symbol (”>”).
• The word following the greater than symbol (”>”) immediately is the “ID” (name) of the sequence, the rest of the line is the description.
• The “ID” and the description are optional.
• All lines of text should be shorter than 80 characters.
• The sequence ends if there is another greater than symbol (”>”) at the beginning of a line and another sequence begins.
• Sequences are expected to be represented in the standard IUB/IUPAC nucleic acid codes.

Sample data:

>gi|58380931|ref|XM_310883.2| Anopheles gambiae str
CAGGGTTCGGACCCAAGGGTCGCCACCTGTCGCGGCAAGCTGCAGAGCAAGCGGTGCAAGCTGAACCAGG
AAATCAACAAGGAGCTCCGGTTGCGGGCCGGTGCCGAAAACCTTTACAAGGCCACCACGAACAAGAAGCT
CAAGGACACGGTCGCACTCGAGCTGAGCTTCGTCAACTCGAACCTGCAGCTGCTGAAGGAGCAGCTGTCC
GAGCTGAACTCCTCCGTCGAGATCTACCAAAGTGAAGGCCTCGACTACGTTATACCGATGATACCGCTCG
GGCTGAAGGAAACGAAGGAGGTCAACTTTATGGAACCGTTCTCGGACTTTATTCTGGAGCACTACAGCGA
GCCGTCGCACATCTACGAGGACGCGATCGCCGACATTACCGACACGAGACAGGCCGCCAAAACGCCGACC
CGCGATGCGCAGGGCGTTTCGCTGCTGTTCCGCTACTACAACCTGCTGTACTACGTCGAGCGGCGCTTCT
TCCCGCCCGATCGCAGCCTGGGCGTGTACTTCGAATGGT
>gi|58378162|ref|XM_308283.2| Anopheles gambiae str
TTCACCGCAAACCTGCAGGGCGATTACATCAAGCATCCCGTGCTGTACGAGCTGAGCCACAAGTACGGCC
TGCCGGACAATGTGTCCGAGCAGCTGCTGCCGGACCGGCTGGAGGAGATCAAGGAGGCGATCCGGCGCGA