| Introduction | Downloads | MGAlignIt | Examples | Comparison | Methodology | Help |

Methodolgy


Description of method

A detailed description of the methodology is available as Methodolgy.pdf. Flowcharts and pseudocodes are available as flowchart.gif and Pseudocodes.pdf. The select of the wordsize is described below together with the MGAlign performance with genomic length correlation results.


Selection of wordsize

MGAlign’s performance was first tested on a random subset using a range of Z, as shown in the table below. A range of wordsizes, from 10 to 30, was used to assess the relationship of the wordsize to the sensitivity and specificity of the first search step. The size of the genome sequence used was 45Mbp and the total number of mRNA sequences is 50 (a randomly selected subset of the data used for the comparison). Thus the total number of correctly identified hits should be 100, two for each mRNA sequence. The results shown are in accord with what is expected, i.e. as Z increases, the specificity increases. Sensitivity in all cases is 100%, therefore the considerations for the selection of the default Z value are the lowest computational time required for high specificity. The choice of a very large wordsize is likely to render the algorithm incapable of correctly locating matches in the event of sequences with errors. In view of these considerations, Z= 20 has been selected as default to optimize the time requirement factor to the specificity factor. The default wordsize can be modified by experienced users, if necessary.

Genome Sequence (Mbp)
Wordsize, Z (bp)
Time Taken (s)
Identified Hits
Correct Hits
Sensitivity
Specificity
45
10
6798.62
2037
150
100.0%
7.36%
45
12
761.69
857
107
100.0%
12.49%
45
14
648.09
302
101
100.0%
33.44%
45
16
616.58
179
101
100.0%
56.42%
45
18
631.79
156
100
100.0%
64.10%
45
20
598.22
126
100
100.0%
79.37%
45
22
595.83
121
100
100.0%
82.64%
45
24
595.06
121
100
100.0%
82.64%
45
26
595.26
119
100
100.0%
84.03%
45
28
595.00
117
100
100.0%
85.47%
45
30
595.64
117
100
100.0%
85.47%

A random set of 50 mRNA sequences were used for the alignment against a 45Mbp fragment of human chromosome 22 genomic sequence. For each Z value (column 2), the computational time (column 3) in seconds, the number of successfully extended hits identified (column 4) defining the alignment windows, the correct hits (column 5) based on annotation information, sensitivity (column 6; defined as the ratio of the number of correctly identified hits to the number of hits provided by the annotations) and specificity (column 7; the ratio of the number of correctly identified hits to the number of identified hits) is reported.


MGAlign performance with genomic length correlation

We have run a separate test with the 50 randomly selected mRNA sequences, (used above in the selection of wordsize), using genomic sequences of different lengths. We note that for this limited set, the computational savings are directly proportional to the length of the genomic sequences in the following figure. All three programs show a linear relationship between time required and genomic sequence length. In relation to sim4, The saving in computational time achieved by MGAlign increases substantially with the length of GS, in relation to sim4, with modest gains in comparison to Spidey. The speed enhancements gained by MGAlign shown below may seem small (2.3-2.4 times faster than sim4 or Spidey), however if one is to perform large numbers of these alignments, then even the smallest amount of performance increase is amplified.

Dependence of computational time required by MGAlign, sim4 and Spidey on the length of genomic sequence. The y-axis of the plot shows the average time required (in sec) by the programs while the x-axis shows the length of the genomic sequence used, with the data tabulated below. A total of 50 randomly selected mRNA sequences from the dataset used in the comparison were used for this plot.

 

 National University of Singapore[Python Powered]
 Department of Biochemistry