Department of Bioinformatics and Computational Biology

SpliceSeq:Methods

From MD Anderson Bioinformatics
Jump to: navigation, search

Contents

Methods

SpliceSeq is a visual tool that supports exploration of a single sample's transcriptome from RNA-Seq data or comparative analysis of transcriptomes from pairs of samples. SpliceSeq aligns sample reads to a database of known splicing patterns represented as gene transcript splice graphs. The splice graphs are constructed by our SpliceToolDBBuild process and stored in a relational MySQL database that is distributed with the tool. A custom Bowtie database generated from the splice graphs is optimized for alignment of a particular sample. Our SpliceToolAnalyzer program is then used to process the sample’s RNA-Seq data. SpliceToolAnalyzer uses Bowtie to align reads to the splice graph sequences and stores summary statistics for the sample in the MySQL database. The summary statistics include normalized count values for genes, exons, splices, attributes, etc. The SpliceToolAnalyzer detects and classifies alternative splicing and evaluates the impact of splicing changes on protein products. SpliceToolAnalyzer is also called by SpliceSeq to compare samples. The comparison process identifies changes in splicing patterns between samples, classifies them, and evaluates the impact of those changes on known protein features.

The following sections describe each of these analytic steps in greater detail.

Splice Graph Build

We construct a reference set of splice graphs and distribute them with the SpliceSeq package. The Splice Graph Build process constructs splice graphs with exons as nodes and splices as edges for each gene. Splice graphs are constructed from public databases of gene models that represent known transcript variation. Existing gene models are used to take advantage of annotations that assist with interpretation of alternative splicing impacts. A key feature of our splice graphs is that they are constructed such that each piece of transcript sequence is represented in a single location in the graph. If an exon has both long and short forms, it is split into two sub-exons, and the common portion is represented only once in the graph. That feature allows each read to be aligned unambiguously to a single location in the splice graph.

SpliceGraphAssembly.PNG

The example above demonstrates how four transcript isoforms of a gene are collapsed into a single non-redundant splice graph. The thin exon sections are untranslated regions and the thick exon sections are coding regions. In our graphics, exons are drawn to scale and connecting lines / arcs represent splice paths. In this example, the gene has alternate promoters, exons 1 and 2 that both splice to exon 3. Exon 4 is an alternative exon that is sometimes included and sometimes spliced out. Exon 5 has an alternate 5' donor site so it is split into sub-exons 5.1 and 5.2. The splice graph provides a succinct representation of all transcript splice paths represented in the source models and allows for additional permutations of observed patterns.

Gene models are downloaded from the UCSC Genome Browser database. Several different types of gene models are available from UCSC including RefSeq, UCSC Gene, Sanger Vega, Ensembl, and NCBI AceView. Any of these may be used by the Splice Graph Build process to construct splice graphs. Currently SpliceSeq is distributed with graphs that are constructed from Ensembl models. Ensembl models produce interpretable graphs with strong annotation and a nice balance between inclusion of alternative splices and complexity. We also provide a build with AceView transcripts which has the widest diversity of exons and splices but is a bit more complex and sometimes consolidates overlapping genes into a single gene complex.


The steps performed by the Splice Graph Build are:

  1. Use FTP to download gene model data from UCSC Genome Browser Tables. Currently using Ensembl (ensGene, ensPep, and ensemblToGeneName)
  2. Download chromosome sequence from NCBI
  3. Construct the SpliceSeq data schema
  4. For each gene, merge gene model transcripts into a single exon landscape map and splice list using genomic coordinates of each model exon.
  5. Generate a consolidated splice graph for each gene. Split exons into sub-exons for contiguous regions with multiple points of splice in/out paths.
  6. Annotate models with attributes for UTR, Coding, and transcript start/stop locations.
  7. Retrieve and store nucleic sequence for each exon.
  8. Download and store UniProt protein sequence and position specific feature information for each splice graph associate via gene symbol.

Alignment Sequence Database

Bowtie is used to align mRNA-Seq reads to our database of splice graphs. The Bowtie alignment database is constructed from our splice graphs using a tiled window approach to create sequence records from each possible splice path within the window. Windows are used to control the combinatorial explosion that can occur when traversing the permutations of large genes with many alternative splice paths. The window size used to generate the Bowtie database is optimized based on sample read length (generally windows are 2x read length - 1 or 2x insert length – 1 for paired-end reads). Tiles are overlapped using a step size of the read length to ensure that a read falling at any position on the graph will be able to align.

A recursive graph traversal is performed to ensure that all paths are represented in the sequence database. Where possible, redundant sequence records are eliminated. The header of each sequence record is encoded with positional information about the exons and splice junctions contained in the sequence record. The header information allows quick identification of exons and junctions aligned to a given read in the Bowtie results. Even reads that cross multiple exons and splice junctions will align, allowing accurate measurement of exon and junction expression levels in regions with small exons and complex splicing patterns.


SmallEx.PNG

In the above example, mapping reads across exon 3 splice junctions is key to understanding the gene's splice pattern, but exon 3 is too short to align to reads directly. A simple exon-exon splice junction database will also not align well because the exon 3 end of the junction will be too short to align fully with most reads that cross it. The approach we take in constructing the database includes all permutations of sequence that cross exon 3, optimized to match the sample read length. For this sample gene, database sequence would be generated for exon 1 to 3 to 4, exon 1 to 3 to 5, exon 2 to 3 to 4, and exon 2 to 3 to 5. Hence, a read of any location across exon 3 will align and be mapped to the appropriate junctions.

Single Sample Analysis

The first step in SpliceSeq analysis is to load one or more mRNA-Seq samples. Sample reads are aligned to the splice graphs using Bowtie as described above. Totals are collected for the number of times an exon or splice is observed. Exon/Splice totals are normalized by region length in 1000 bases (K) and by the total number of reads in the sample in millions (M). Note that a single read may traverse multiple exons and splices allowing a read to contribute to the observation totals of multiple graph elements. Normalized exon/splice totals are called "Observations Per Kilobases of exon/splice per Million aligned reads (OPKM)". A read must include at least 4 bases of an exon to count the exon and related splice as "observed". Raw and normalized totals for exons, splices, and certain attributes are saved to the sample tables in the database so that the load process needs to be performed only once for each sample. The exon and splice totals stored in the tables are used to present SpliceSeq sample view visualizations of genes. Next, gene level analysis is done. Each gene’s expression level is computed in reads per kilobase of transcript length per million aligned reads (RPKM). Gene RPKM values count each read that uniquely maps to the gene once. The transcript length used for the RPKM calculation includes only exons that are expressed above a background level of 1% of average exon expression. Gene expression levels are important for comparison processing when analyzing splicing difference between two samples. After gene level RPKM values are calculated, a second pass is performed to allocate non-unique read alignments. Some reads can’t be uniquely aligned to a single gene. This happens most often in families of closely related genes because they contain regions of strong sequence similarity. If all non-unique reads are discarded, the splice graphs of genes that share sequence similarity will have “holes” in the graph where exons or splices are not able to get reads. This makes subsequent graph traversal for protein sequence analysis impossible. To remedy this situation, we proportionally allocate degenerate reads based on the relative RPKM values of the genes to which the read aligns. For example, if there are 100 reads that align equally to exon 1 of gene A and exon 3 of gene B and gene A has an RPKM of 10 and gene B has an RPKM of 30, then 10/40 or ¼ of the 100 reads would be assigned as gene A exon 1 observations and 30/40 or ¾ of the 100 reads would be assigned as gene B exon 3 observations. Next, a weighted recursive traversal of the graph is done to predict transcript and protein sequences that are likely to result from the observed exons and splices. If multiple transcript start exons have reads, they are weighted based on the relative OPKM values. For example if one transcript start exon has an OPKM of 5, and an alternate start exon has an OPKM of 15, then the first gets a weight of 5/(5+15) = 25%. If alternate splice paths have reads, they are weighted in a similar fashion. By multiplying the weights of the start exon and alternate splices during traversal of the graph, we estimate a transcript level expression percentage. Transcript sequences are merged if they are identical, and they are discarded if the transcript expression level is less than 0.5%.

PathWeight.PNG

In the example above, there are two start exons and an alternative exon. The percentages shown in the graph represent the weighting of start exons and alternative splices based on observations from the reads. The most heavily weighted path starts with exon 2 and excludes the alternative exon 4. By multiplying the weights along that path, we find a predicted isoform prevalence of 68% for the sample. Other isoforms are weighted as indicated. It is possible that cellular mechanisms co-regulate splicing events that we treated by this method as independent. For example, maybe transcripts starting with exon 1 always include the alternate exon 4. In many cases, multiple splicing events are separated by more than the read length or insert length, making it impossible to identify co-regulation with current sequencing platforms. The SpliceSeq algorithm provides a good approximation of probable transcript composition and supports identification of significant shifts in splicing patterns that affect functional elements of the gene.

The transcript sequences produced by the weighted graph traversal are also translated into predicted protein sequence. The translation is done by identifying the start codon and translating the spliced sequence until a stop codon is encountered. Start codon identification is difficult because the first ATG is usually, but not always, the start codon. In constructing the splice graphs, we save the annotated start codon positions to take advantage of existing knowledge, rather than relying solely on computational methods. When conflicting start codons occur in the source gene models, the start codon with higher prevalence in the source gene models is used for translation. When stop codons occur 50+ base pairs upstream from a splice junction, the sequence is marked as a likely target for nonsense mediated decay. Redundant protein sequences are combined, and protein variant concentration estimates are summed from the original transcript concentration percentages.

The protein sequences produced in the preceding step are then aligned to the gene's UniProt sequence. The alignment allows us to identify the portions of the UniProt sequence that are / are not included in a given protein variant and, by extension, which protein features are present / removed. UniProt annotations identify domains, motifs, structural elements, and binding regions that can help identify functional impacts of alternative splicing. By aligning the protein sequences modified by alternative splicing to the UniProt sequence, we can identify functional elements affected by splicing and can estimate the magnitude of the change from the variant’s concentration estimate. For example, in the case diagrammed below, a splice form estimated at 56% of the gene expression is lacking its localization signal peptide. UniProt features are characterized by type to allow filtering for splice events that affect certain types of protein features.


ProtAnnot.PNG

In this example, two proteins are produced by HLA-DRB5 in a heart tissue sample. An alternate promoter removes some of the initial protein sequence in one variant. By aligning to UniProt features, we see that the alternate promoter results in loss of the protein's Beta-1 region, a signal peptide, a disulfide bond, and an extracellular topological domain.

Next, the gene splice graph statistics are evaluated by a splice event detection module, which identifies and characterizes splice events. It looks for evidence of alternate promoters, alternate terminators, exon skipping, mutually exclusive exons, and premature stops. For single samples, the module is fairly simple. If multiple exons that were marked in the original models as transcript start exons have 2 or more reads, then alternate promoters are used by the gene. If an exon has reads on multiple outgoing splices, then alternate splicing is occurring, and further path-following logic identifies the type of alternative splicing. Splicing events are identified and typed to assist with interactive investigation of the sample.


SpliceDetect.PNG

This figure shows an example of an exon skip event. The first exon has two splice-out paths with reads. The longer path travels through an exon and meets up with the target of the longer splice, so the detector calls the event an exon skip. In the second example, two exons identified as transcript start exons both have reads and converge at a common location downstream in the graph, so the detector identifies that there are alternate promoters. Different detector algorithms identify each type of alternative splice event using the gene's splice graph.

Finally, calculations are performed to estimate the magnitude of alternative splicing. It is useful when sorting genes to identify splice changes that affect large numbers of transcripts. The Alt Splice RPKM value is an estimate of the quantity of transcripts that include an alternative splicing path. The Alt Splice RPKM may be compared to the gene level RPKM to evaluate the proportion of transcripts that include the alternative splice pattern. The alternative splice RPKM is conservatively estimated using the largest individual minor path OPKM total. For example, if an exon inclusion path has an average OPKM of 150 and the exon exclusion path has an average OPKM of 50, the alternative splice RPKM is reported as 50. The same calculations are done for splice changes that affect protein product to produce an estimate of the magnitude of splicing changes that affect protein product (coding region changes rather than UTR changes). This figure is called the Alt Protein RPKM value.

Compare Samples

A primary motivation for the construction of SpliceSeq was to assist in identification of alternative mRNA splicing that has biological relevance. A key goal, then, is the ability to compare samples and identify changes in splicing patterns that are likely to lead to protein alterations that affect function. The difficulty in comparing sequence read counts between two samples is that they often differ greatly in expression level of a gene. That difference changes the observed read frequencies of exons and splices between samples but is not an indication of alternative splicing, just differential expression. To identify splicing changes between samples while controlling for expression differences, we borrow a technique from exon microarray analysis, the splice index. The splice index is an expression level-adjusted ratio of OPKM values for an exon or splice.

SI = log2 ((sample 1 exon OPKM / sample 1 gene RPKM) / (sample 2 exon OPKM / sample 2 gene RPKM))


In addition to the splice index, a p-value is calculated for each exon and splice junction using a Fisher's exact or 2x2 Chi Square test of OPKM values and gene expression values. The red / green coloring in the SpliceSeq comparison view is done based on the p-value and SI value. Statistically significant differences between the samples that exceed a splice index threshold are considered as candidates for alternative expression.

SIExonSkip.PNG

As with single sample analysis, a splice event detection module identifies and classifies splice events, but it is searching for splice changes between the samples. The detector looks for alternate promoters, alternate terminators, exon skips, mutually exclusive exons, and premature stop events. A statistically significant difference in a splice junction that is above the SI threshold is used as a starting point for event detection. The detector then looks for corroborating evidence. For example, the alternate paths on an exon skip event should be coherent and change in opposite directions. In the example pictured above, exon 3 is included more often in heart tissue and is skipped more often in liver tissue. The skip path has a statistically significant, negative SI, and the splices on the exon inclusion path have statistically significant, positive SI values. The coherent picture from the four measurements indicates an exon skip event in liver tissue. Events that meet the detector’s criteria are classified and stored in comparison tables.

The next step in the comparison analysis is to look for biological impacts of alternative splicing. Each gene is evaluated by comparing protein features of the two samples. When samples are imported into SpliceSeq, a prediction of protein sequence(s) is made based on the reads aligned to the splice graph. Each protein sequence is then aligned to UniProt to determine the protein features present. That comparison identifies changes in the % of protein features. For example, if a DNA binding site were present in 100% of the protein sequences produced by a gene in sample 1 but only 50% of protein sequences produced by the same gene in sample 2, it would be identified as a feature that changed because of splicing differences. Any feature with a difference in inclusion percentage above threshold is stored in the comparison database and presented in SpliceSeq as a 'Uniprot Event'.


CompFeatures.PNG

In the above example, heart tissue has only one protein sequence for the SUMO1 gene, but liver tissue has two variants of the protein that are present in approximately equal proportions. The inclusion of the Ubiquitin-like domain and cross-link feature is 100% for heart tissue but only 50% for liver tissue. That distinction would cause the two protein features to be flagged as differentially expressed because of splicing changes between liver and heart tissue. Finally, an estimate of the magnitude of differential splicing is calculated and stored as the Alt Splicing RPKM. This value represents the transcript expression level of transcripts that have differences in splicing patterns between the two samples. This value can be compared to the overall gene level RPKM values to find genes that have a large proportion of transcripts with differential splicing patterns. The Alt Splicing RPKM is estimated using the largest individual splice event difference in observed versus expected OPKM values as follows:

AS RPKM.PNG

P1 is the average OPKM for a splice path in sample 1, P2 is the average OPKM value for the same path in sample 2. R1 and R2 are the gene level RPKM values from sample 1 and 2 respectively. An Alt Protein RPKM is calculated in the same way for the subset of splice events that alter gene protein product (in coding regions not untranslated regions). The Alt Protein RPKM can be used to find splice difference between the two samples that may have a large impact on the resultant protein sequence.

Groups

The ability to detect robust differences in splicing patters increases when groups of samples are compared. SpliceSeq supports this type of analysis by allowing users to construct groups of samples. This is done on the Group Samples panel by using sample annotations to create groups of similar samples (e.g. patients with phase I tumors). Prior to comparing groups, summary statistics for the group must be generated.

SpliceSeq group statistic generation involves traversing all the samples to establish summary statistics for every exon and splice for the group. These statistics are fairly simple. For each exon, and splice the average number of samples with reads, average number of reads, and average OPMK is calculated. If there are samples without reads for a give exon or splice, zero values for those samples are included in these averages so as not to skew averages of exons or splices with sparse coverage. Another important value that is calculated is the ratio of OPKM to gene expression value (RPKM). Gene expression levels can vary across the samples so the best statistic to use in looking for splice variation is an expression adjusted value or the ratio of OPKM to gene level RPKM. This ratio is calculated for all exons and splices and the standard deviation of this value is also stored to assist with future statistical tests.

Gene level RPKM values are calculated for the group by averaging the RPKM values of each sample. The predicted percentage of each protein isoform is summarized by averaging concentrations from the samples in the group.

Group level statistics for each gene are stored in the database. The group summary statistics can be examined using the Group View. Finally, the processes for identifying alternative splice events and protein impact described in the Single-Sample Analysis section above is performed for the group summary graphs.

Comparing Groups

After groups have been, they may be compared to identify statistically significant difference between the groups. The Compare Groups processing follows analytical steps identical to those described in the Compare Samples described above. In group comparison, the calculation of splice index for each exon and splice is simplified because the ratio OPKM to gene expression was already calculated as a group statistic so the group comparison SI is just the log2 value of the group 1 ratio divided by the group 2 ratio.

SpliceSeq Group SI.PNG

The group comparison also takes advantage of the increased sample size when performing statistical tests for differential expression of splices and exons. The p-value generated for the comparison of exons and splices is a t-test that uses the average ratio of OPKM to gene RPKM, the variance of the ratio values, and the number of samples from both groups.

The threshold used to identify splice events in group comparisons are lower than those used in single sample comparison because the increased numbers of samples support a more sensitive threshold. The SI threshold for differential splicing in group comparison is configurable but by default is set to 0.5 which is approximately equivalent to a 50% ratio difference.