Presenter notes contain extra information which might be useful if you intend to use these slides for teaching.
Press P
again to switch presenter notes off
Press C
to create a new window where the same presentation will be displayed.
This window is linked to the main window. Changing slides on one will cause the
slide to change on the other.
Useful when presenting.
Presenter notes contain extra information which might be useful if you intend to use these slides for teaching.
Press P
again to switch presenter notes off
Press C
to create a new window where the same presentation will be displayed.
This window is linked to the main window. Changing slides on one will cause the
slide to change on the other.
Useful when presenting.
Before diving into this slide deck, we recommend you to have a look at:
Comparing biological sequences can be seen as an instance of common string matching problem but with some particularities
Sequence comparison is a core problem of bioinformatics that allows us to study:
Widely used in several fields!
So what is the problem?
Sequence type | Approx. length |
---|---|
Gene | ~1,000 to 10,000 bp |
Bacteria genome | ~106 bp |
Mammalian chr. | ~108 bp |
Mammalian genome | ~2 * 109 bp |
Plant genome | ~3 * 109 bp |
Large plant genomes | ~2 * 1010 bp |
Amoeba genome | ~6 * 1011 bp |
Dot-matrix comparison (no alignments)
Alignment-free methods
Aligning methods
Alignments have to take into account their surroundings!
Use dynamic programming to compute an accumulated score
The score of each cell comes from either the diagonal (no penalty but match/mismatch score) or from a row or column, which is a gap in either the query or reference sequence ("jumping" gap penalty is included then)
Match = 4
But what if the two sequences are not toy examples?
Worst case scenario this algorithm takes quadratic time and space
These are called High-scoring Segment Pairs and do not typically include gaps
This approach also enables to compare distant sequences with many evolutionary rearrangments since fragments do not require to be close to each other
Sequence comparison is inherently hard because it requires quadratic time and space to produce optimal alignments
Especially when there are more (and longer) genome sequences available everyday (Ensembl blog)
Heuristic approaches based on seeds are required to cope with the complexity of dynamic programming methods
Tools are needed to quickly compare sequences while providing insightful information
Now that we have a general background on sequence alignment, lets jump to the software that we will use in this tutorial
Pairwise genome comparison software which reduces execution time in pairwise and multiple genome comparison studies
It uses an 'out of core' strategy, i.e. it runs on secondary memory (disk) as opposed to RAM
Faster than state-of-the-art software especially for longer sequences such as chromosomes
A dictionary is computed for each sequence containing the positional information of each word (possible seed)
Once each dictionary is sorted, perfect matches between words produce a set of seeds (alignment candidates)
Seeds are then sorted (by their diagonal position xstart - ystart) and filtered
Finally, the seeds are extended (up and downstream) to generate a set of High-scoring Segment Pairs (HSPs)
In a static distribution cores can become idle if they finish before other cores → resources are wasted
GECKO uses MPI to distribute the workload dynamically based on the length
When a core (worker) has finished executing a task (comparison) the master node broadcasts more work
This reduces overall makespan time because of workload balance
GECKO also employs a second level of parallelism besides using MPI to distribute data dynamically
It consists on sorting the initial seeds found (which is the most time-consuming step) using shared memory threads (pthreads and openMP)
Each core uses up to t
threads to sort the dictionaries containing the seeds
GeckoMGV is a Web-based application aimed to visualize results from multiple genome comparison allowing deep data analysis. It features:
User friendly interface
Multilayer for displaying and overlaying comparisons
Interactive zooming and filtering
External and proprietary post-processing services
Adaptable from equivalent software (e.g. MAUVE)
Dendrograms and Multiple Sequence Alignment visualization
A typical exercise would include running a sequence comparison with GECKO
Visualizing the alignments with GECKO-MGV
Interactively selecting regions of interest and performing post-processing
GECKO is suitable for chromosome comparisons
But all vs all chromosomes between any two species generate usually over 400 comparisons
This results in extremely large computation times
Comparisons tend to be full of repetitions in large mammalians, let alone plants!
State-of-the-art methods get stuck handling large number of repetitions or resulting dotplots are too noisy
Can we extract any information from raw seeds (hits) without having to generate fragments (slower)?
CHROMEISTER (CHROmosome MEISTER) is an ultra fast heuristic approach to computing extremely large pairwise genome comparison in desktop PCs
Hybrid indexing approach
Probabilistic filtering
Apply a kernel to improve visualization
Heuristically select unique and inexact seeds:
Are these two words equal? Under what considerations?
Probabilistic filtering
Kernel and score calculation
draw=l−1∑itaxicab(max(Hm(i),Hm(i+1)))
d=drawl2
Also used for analysis of contig/scaffold reordering
(Right) Pairwise genome comparison between Aegilops tauschii (3.2 Gbps) and Triticum aestivum (4.2 Gbps) in under 16 minutes using 1 core and 1 GB of RAM
Sequence comparison is difficult due to its computational complexity and the growth of databases
There are several approaches to sequencen comparison, and we need to know which one to use (DP-based, alignment-free, seeds, etc.)
We have learnt the internals regarding GECKO, GECKO-MGV and CHROMEISTER to work with single chromosome comparisons and/or exhaustive searchs, the post-processing and for larger/noisy experiments, respectively
This material is the result of a collaborative work. Thanks to the Galaxy Training Network and all the contributors!
Author(s) |
|
Reviewers |
|
Tutorial Content is licensed under Creative Commons Attribution 4.0 International License.
Before diving into this slide deck, we recommend you to have a look at:
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |