Department of Bioinformatics and Computational Biology

PRADA:Overview

From MD Anderson Bioinformatics
Jump to: navigation, search

PRADA

PRADA
Overview
Description PRADA is a pipeline to analyze paired end RNA-Seq data to generate gene expression values (RPKM) and gene-fusion candidates.
Development Information
Language Python
Current Version 1.1
Platforms Un*x (OpenPBS)
License MIT
Status Active
Last Updated April 2013
References
Citations No Formal Publications
Help and Support
Contact Roel Verhaak


Massively parallel sequencing of cDNA reverse transcribed from RNA (RNASeq) provides an accurate estimate of the quantity and composition of mRNAs. To characterize the transcriptome through the analysis of RNA-seq data, we developed PRADA. PRADA focuses on the processing and analysis of gene expression estimates, supervised and unsupervised gene fusion identification, and supervised intragenic deletion identification. The BAM files generated by the pipeline are readily compatible with different tools for mutation calling and to obtain read counts for further downstream analysis.

Contents


Modules

PRADA currently supports 6 modules to process and identify abnormalities from RNAseq data:

preprocess : Generates aligned and recalibrated BAM files.
fusion : Identifies candidate gene fusions.
guess-ft : Supervised search for fusion transcripts.
guess-if : Supervised search for intragenic rearrangements.
homology : Calculates homology between given two genes.
frame : Predicts functional consequence of fusion transcript

Documentation

Detail description of installation steps and the usage of each module with examples is available in the documentation.

Installation

PRADA is written in Python programing language and intended to run in a command line environment on UNIX or Linux operating systems. To run pyPRADA, download the pre-compiled package and unzip to preferred installation location.
Combined genome and transcriptome reference files are available for download:

HG19

A sample FASTQ file and resulting BAM file are also available :Sample files

Once the reference files are downloaded and extracted, generate index files for all the FASTA files in reference folder:

[pyPRADA_DIR]/tools/bwa-0.5.7-mh/bwa index -a bwtsw [HG19]/Ensembl64.transcriptome.fasta
[pyPRADA_DIR]/tools/bwa-0.5.7-mh/bwa index -a bwtsw [HG19]/Ensembl64.transcriptome.formatted.fasta
[pyPRADA_DIR]/tools/bwa-0.5.7-mh/bwa index -a bwtsw [HG19]/Ensembl64.transcriptome.plus.genome.fasta
[pyPRADA_DIR]/tools/bwa-0.5.7-mh/bwa index -a bwtsw [HG19]/Homo_sapiens_assembly19.fasta

Set the configuration file (ref.txt):

#reference files
compdb_fasta	[HG19_REF]/Ensembl64.transcriptome.plus.genome.fasta
compdb_fai	[HG19_REF]/Ensembl64.transcriptome.plus.genome.fasta.fai
compdb_map	[HG19_REF]/Ensembl64.transcriptome.plus.genome.map
genome_fasta	[HG19_REF]/Homo_sapiens_assembly19.fasta
genome_gtf	[HG19_REF]/Homo_sapiens.GRCh37.64.gtf
dbsnp_vcf	[HG19_REF]/dbsnp_135.b37.vcf
select_tx	[HG19_REF]/Ensembl64.selected.transcripts
feature_file	[HG19_REF]/Ensembl64.canonical.gene.exons.tab.txt
tx_seq_file	[HG19_REF]/Ensembl64.transcriptome.fasta
ref_anno	[HG19_REF]/Ensembl64.transcriptome.annotations
ref_map	[HG19_REF]/Ensembl64.transcriptome.formatted.map
ref_fasta	[HG19_REF]/Ensembl64.transcriptome.formatted.fasta
cds_file	[HG19_REF]/ensembl.hg19.cds.txt
txcat_file	[HG19_REF]/Ensembl64_primary_transcript.txt

#Preprocess step parameters
pbs_queue	long						#queue name, for preprocessing module
pbs_email	userid@mdanderson.org   	#email used in PBS for notification
parallel_n_threads	24					#number of cores used in alignment and recalibration