DEDB :: Drosophila melanogaster Exon Database
Home Query Stats Downloads Splicing Graph Viewer Alternative Splicing Events Methodology Contact Help

Methodology

Overview

A flowchart of the processes used to create DEDB is found here. The following text describes the process in detail.

Why splicing graphs?

Splicing graphs provides an elegant solution to the problem of deducing the types and effects of alternative splicing for a large number of splicing variants (see splicing graph id 924). By condensing all various splicing variants into a single graph, users can quickly determine the types and effects of alternative splicing. Each splicing variant is merely a path through the graph and alternative splicing events are easily noticed as bifurcations in the graph. Biologists would be most interested to know the effects of alternative splicing and the mapping of Pfam domains onto the splicing variants allows them to do this in a convenient way. For example, exon skipping leading to a lost of a domain can be easily deduced from the splicing graph. Furthermore, the splicing graph representation provides an easy and convenient way to create rules for alternative splicing classifications. These rules are able to detect for multiple forms of alternative splicing within the same graph. As the biological bases for different alternative splicing events are distinct, the ability to create sets of alternative splicing data based on a single type of alternative splicing events is invaluable.

Source Data

The primary source of data is from the Drosophila melanogaster genome annotations. Domain data are obtained from Pfam in the form of HMM models.

Splicing graph construction

The transcripts found in the genome annotations are clustered by their genomic position. Each cluster of transcripts being a set of transcripts having overlapping genomic positions on the same genomic strand. Exons having the same start and end genomic positions are then merged as a single node. The same goes for introns having the same start and end genomic positions. These introns are merged into a single connection. Nodes are then connected via connections to form the final splicing graph. The constructed splicing graphs are then stored into MySQL (a relational database).

The figure below shows the relationship between multiple sequence alignment representation, schematic representation and splicing graph representation:

Note that multiple exons having identical start and end positions are merged into nodes. This is the same for the introns, which are merged into connections. The splicing graph representation shows the alternative splicing in an intuitive and informative manner.

Alternative splicing event classification

Rules are then applied onto the splicing graphs to detect for specific alternative splicing events. For example, alternative acceptor sites are detected by identifying a set of overlapping nodes that have different acceptor site positions connected to a common upstream node. This allows for the detection of multiple splicing events in the same splicing graph. The various alternative splicing events detected are then stored into MySQL.

The figure below shows the various types of alternative splicing events classified as well as the rules used to detect them (click on the figure for a larger image):

Alternative transcriptional start sites (TSS) exists when more than one node is found that has no previous connection (indicating that it is an initiation node) and which contains a unique start position. The start position of initiation nodes is the transcriptional start site and multiple TSS implies alternative transcriptional start sites. Click here to see an example. Nodes 3, 5, 6, 7 and 9 contains alternative transcriptional start sites.

Alternative transcriptional termination sites (TTS) occurs when more then one node is found that has no next connection (indicating that it is a termination node) and which contains a unique end position. The end position of the termination node is the transcriptional termination site and the multiple TTS indicates alternative transcriptional termination sites. Click here to see an example. Nodes 7, 10 and 13 have alternative transcriptional termination sites.

Alternative initiation exons occurs when multiple initiation nodes (having no previous connections) are found to that have unique end positions. The rationale is that the start positions of initiation nodes are frequently incorrect as the 5' UTR (untranslated region) is rarely completely sequenced. Therefore initiation nodes differing just in the start position cannot be easily determined to be different. We have thus also used the end position as the criteria. Furthermore, the 5' end of the initiation node is not recognized by the splicing machinery, only the 3' end (donor site) is recognized. Click here to see an example. Nodes 1 and 2 are alternative initiation exons.

The same reasoning goes for alternative termination exons except that the positions are reversed. Click here for an example. Nodes 7, 10 and 15 are alternative termination exons.

As for alternative acceptor sites, these are found in a set of overlapping nodes (>1 node) that have differing start positions linked to a common node. The set of nodes should be overlapping else they would be classified as cassette exons. Click here for an example. Nodes 1 and 3 contains alternative acceptor sites. They are both linked to node 2.

The same goes for alternative donor sites. Click here to see an example. Nodes 5 and 10 contains alternative donor sites. They are linked to node 4. Take note that node 11 illustrates the point that the set of nodes have to be overlapping. From the graph, it is quite clear that node 11 is a cassette exon.

Cassette exons by definition are internal exons which are differentially included in the various splicing isoforms of a gene. The rule as far as splicing graphs are concerned requires a cassette exon to be a internal node whose start and end position falls within a connection (an intron). The fact that the node occurs as part of a connection in some other splicing isoform implies that it is skipped hence fulfilling the definition. Click here for an example. Node 1 is a classic example of a cassette exon.

Intron retention on the other hand are introns which are not spliced out resulting in it being retained forming part of an exon. The rule based on splicing graphs requires a connection whose start and end position falls within a node for a positive intron retention event. The definition is fulfilled as the connection (intron) is found as being part of a node (exon). Click here for an example. The connection between nodes 4 and 5 are clearly retained in node 8.

Domain identification

Protein sequences are searched for Pfam domains using the hmmpfam program from the HMMER suite of programs. Detected domains are then stored into MySQL for use in the Splicing Graph Viewer.

Splicing Graph Viewer

The Splicing Graph Viewer takes data from MySQL and renders it as a HTML page suitable for viewing a any modern web browser. The Apache web server is used to server the HTML pages to web browser clients.

XML downloads

Data contained in the database are converted into XML files and are provided to users for downloads. A XML schema is also available that describes the XML files containing the splicing graph data. The availability of the XML schema allows for the parsing and validating of the XML files.