Presenter notes contain extra information which might be useful if you intend to use these slides for teaching.
Press P
again to switch presenter notes off
Press C
to create a new window where the same presentation will be displayed.
This window is linked to the main window. Changing slides on one will cause the
slide to change on the other.
Useful when presenting.
Presenter notes contain extra information which might be useful if you intend to use these slides for teaching.
Press P
again to switch presenter notes off
Press C
to create a new window where the same presentation will be displayed.
This window is linked to the main window. Changing slides on one will cause the
slide to change on the other.
Useful when presenting.
Before diving into this slide deck, we recommend you to have a look at:
But if we introduce gaps and allow for some mismatches in bases, this matches up pretty well..
Reference: . . . A A - C G C C T T . . . | = match. | : - : | | | | | : = mismatchRead: A G G G G C C T T - = gap
But if we introduce gaps and allow for some mismatches in bases, this matches up pretty well..
Some reads may map to multiple locations
We want a way to determine best alignment if none are perfect matches..
Example (with affine gap penalty)
Final score for entire alignment in this example is 19
These reward and penalty values are just examples and will vary
More information about mapping algorithms: 10.1089/cmb.2012.0022
Many more complexities may be considered, different tools make different choices
Transitions are more likely to occur in real sequences, so may give lower penalty than transversions
Transitions are interchanges of two-ring purines (A G) or of one-ring pyrimidines (C T): they therefore involve bases of similar shape.
Transversions are interchanges of purine for pyrimidine bases, which therefore involve exchange of one-ring and two-ring structures.
Suppose we want to map this read (bottom) to this reference sequence (top)
This is one possibility, is it the only one?
This is also a possible alignment. Not easy to say which is better.
And a third option
Reference: AAA CAGTGA GAAObserved: AAA TCTCT GAA
Alignment | |
---|---|
AAA-CAGTGAGAA |||-|--|::||| AAATC--TCTGAA |
Maybe like this? |
AAACAGTGAGAA |||-::|::||| AAA-TCTCTGAA |
Or this? |
AAACAGTGAGAA |||:-:|::||| AAAT-CTCTGAA |
Or..? |
AAACAGTCA-----GAA |||-----------||| AAA------TCTCTGAA |
What about this? |
There is no one right way to do alignment
Mapping is a non-trivial problem!
Reference: AAA CAGTGA GAAObserved: AAA TCTCT GAA
Alignment | Tool |
---|---|
AAA-CAGTGAGAA |||-|--|::||| AAATC--TCTGAA |
Novoalign |
AAACAGTGAGAA |||-::|::||| AAA-TCTCTGAA |
Ssaha2 |
AAACAGTGAGAA |||:-:|::||| AAAT-CTCTGAA |
BWA |
AAACAGTCA-----GAA |||-----------||| AAA------TCTCTGAA |
Complete Genomics |
We didn't just make these up, these real aligners gave these different results
Reference: AAA CAGTGA GAAObserved: AAA TCTCT GAA
Alignment | Variant calls |
---|---|
AAA-CAGTGAGAA |||-|--|::||| AAATC--TCTGAA |
ins T del AG sub GA -> CT |
AAACAGTGAGAA |||-::|::||| AAA-TCTCTGAA |
del C sub AG -> TC sub GA -> CT |
AAACAGTGAGAA |||:-:|::||| AAAT-CTCTGAA |
snp C -> T del A snp G -> C sub GA -> CT |
AAACAGTGA-----GAA |||-----------||| AAA------TCTCTGAA |
del CAGTGA ins TCTCT |
Important: Mapping can affect downstream analysis!
These different mappings led to different variants, and hard to tell they are equivalent.
Lego time! Who wants to volunteer?
Or try this online sequence alignment game:
https://tinyurl.com/sequence-alignment
Can have learners play around with this alignment game now
Or use Lego bricks, each nucleotide a different colour
This improves our mapping
For example for multi-mapped reads, or repeats (next slide)
In the case of repeats, a single-end read alone would not have be enough for unique mapping..
In the case of repeats, a single-end read alone would not have be enough for unique mapping..
But with the additional information provided by paired-end protocol (distance to mate), this can now be resolved..
Unexpected mapping distance between two reads in a pair may indicate a variant.
Exact location of variant unknown unless more reads covering the area
FAQ: "What about mate-pair sequencing?"
When you have paired-end data, you will usually get 2 files.
_1
/_2
or _R1
/_R2
Pairing also visible in read names
/1
/2
at end or 1:
and 2:
in read IDWhen you have paired-end data, you will usually get 2 files.
_1
/_2
or _R1
/_R2
Pairing also visible in read names
/1
/2
at end or 1:
and 2:
in read IDSometimes data can be in a single interleaved file (aka interlaced)
Most tools blindly assume that first read in forward file is paired with first read in reverse file etc
Otherwise too slow
When trimming and filtering, if a read is removed from one file, its mate must be removed from other one too!
Always trim together in paired-end mode!
N
th read in forward file
N
th read in reverse file@PAIR-1 forwardGGGTGATGGCCGCTGCCGATGGCGTCAAAT+))%255CCF>>>>>>CCCCCCC65`IIII%
@PAIR-2 forwardGATTTGGGGTTCAAAGCAGTATCGATCAA+!''3((((^^d+))%%%++)(%%%%).1)
@PAIR-3 forwardTCGCACTCAACGCCCTGCATATGACAAGAC+A64;##=#B9=AAAAAAAAAA9#:AB95%^
mysample_R1.fastq
@PAIR-1 reverseAAGTTACCCTTAACAACTTAAGGGTTTTCA+fffddffeedB
IABa)^%YBBBRTT\^d
@PAIR-2 reverseAGCAGAAGTCGATGATAATACGCGTCGTTT+IIIIIII^^IIId`?III%IIIGII>IIII
@PAIR-3 reverseAATCCATTTGTTCAACTCACAGTTTACCGT+9C;=;=<9@4868>9:67AA<9>65<=>59
mysample_R2.fastq
Most tools blindly assume that first read in forward file is paired with first read in reverse file etc
Otherwise too slow
When trimming and filtering, if a read is removed from one file, its mate must be removed from other one too!
Always trim together in paired-end mode!
N
th read in forward file
N
th read in reverse file
@PAIR-1 forwardGGGTGATGGCCGCTGCCGATGGCGTCAAAT+))%255CCF>>>>>>CCCCCCC65`IIII%
@PAIR-2 forwardGATTTGGGGTTCAAAGCAGTATCGATCAA+!''3((((^^d+))%%%++)(%%%%).1)
@PAIR-3 forwardTCGCACTCAACGCCCTGCATATGACAAGAC+A64;##=#B9=AAAAAAAAAA9#:AB95%^
mysample_R1.fastq
@PAIR-1 reverseAAGTTACCCTTAACAACTTAAGGGTTTTCA+fffddffeedB
IABa)^%YBBBRTT\^d
@PAIR-2 reverseAGCAGAAGTCGATGATAATACGCGTCGTTT+IIIIIII^^IIId`?III%IIIGII>IIII
@PAIR-3 reverseAATCCATTTGTTCAACTCACAGTTTACCGT+9C;=;=<9@4868>9:67AA<9>65<=>59
mysample_R2.fastq
If a read in one file gets removed (e.g. because it is below quality threshold), but it's mate is not, the pairing between the two files is no longer correct.
If one half of pair is trimmed, the other
FAQ:" why not look at read names to determine pairing?"
N
th read in forward file
N
th read in reverse file
@PAIR-1 forwardGGGTGATGGCCGCTGCCGATGGCGTCAAAT+))%255CCF>>>>>>CCCCCCC65`IIII%
@PAIR-3 forwardTCGCACTCAACGCCCTGCATATGACAAGAC+A64;##=#B9=AAAAAAAAAA9#:AB95%^
@PAIR-4 forwardAAACTTCGTAGGTCCATTTGACAGCGTGCA+6664%!!III^(=%3333^^d^d:#32333
mysample_R1.fastq
@PAIR-1 reverseAAGTTACCCTTAACAACTTAAGGGTTTTCA+fffddffeedB
IABa)^%YBBBRTT\^d
@PAIR-2 reverseAGCAGAAGTCGATGATAATACGCGTCGTTT+IIIIIII^^IIId`?III%IIIGII>IIII
@PAIR-3 reverseAATCCATTTGTTCAACTCACAGTTTACCGT+9C;=;=<9@4868>9:67AA<9>65<=>59
mysample_R2.fastq
By cutting the yellow read only from the forward reads file, but leaving the other side of pair in the other file, an incorrect pairing is now assumed by downstream tools
Choice of mapper depends on your experiment
Or other factors
FAQ: "Why not map RNA reads to the transcriptome?"
FAQ: "Why not BLAST or BLAT?"
“... there is no tool that outperforms all of the others in all the tests. Therefore, the end user should clearly specify [their] needs in order to choose the tool that provides the best results.” - Hatem et al BMC Bioinformatics 2013, 14:184
Know the data you are working with and pick the right mapper and parameters for the job!
Not an easy task..
60+ different mappers, many comparison papers. Figure from 10.1093/bioinformatics/bts605
Many different tools available
Different strengths and weaknesses, comparison table in link
Mapping tool | Uses | Characteristics |
---|---|---|
HISAT2 | DNA/RNA | Short reads. Based on GCSA. Reference. |
RNASTAR | RNA | Short reads. Extremely fast. High sensitive and accuracy. Based on Maximal Mappable Prefixes (MMPs). Reference. |
BWA-MEM2 | DNA | Short reads. Twice as faster as BWA-MEM. Memory efficient. Based on Burrows-Wheeler. Reference. |
Minimap2 | DNA/RNA | Long reads (PacBio and ONT). Extremely fast. Based on DALIGN and MHAP. Reference. |
Bismark | DNA/RNA | Short reads. Bisulfite treated sequencing. Based on GCSA. Reference. |
BBMap | DNA/RNA | Short and long reads (PacBio and ONT). Memory demanding. Reference. |
Whisper 2 | DNA | Short reads. Indel sensitive. Variant-calling oriented. Reference. |
S-conLSH | DNA | Long reads (ONT). High sensitivity and accuracy. Reference. |
Alignment given in CIGAR string.
This is IGV (Integrative Genome Browser) DOI: 10.1038/nbt.1754
JBrowse.org DOI: 10.1186/s13059-016-0924-1
Jbrowse tool builds up a small website for you, and pre-processes the reference genome into a more efficient format. If you wanted to share this with your colleagues, you could download this dataset and directly place it on your webserver.
In the mapping hands-on tutorial you will use JBrowse and IGV
This material is the result of a collaborative work. Thanks to the Galaxy Training Network and all the contributors!
Author(s) |
![]() ![]() ![]() |
Reviewers |
|
<div> <div> <img class="funder-avatar" src="https://avatars.githubusercontent.com/elixir-europe" alt="Logo"> </div> <div> </div></div>
Tutorial Content is licensed under Creative Commons Attribution 4.0 International License.
Before diving into this slide deck, we recommend you to have a look at:
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |