Introduction to metatranscriptomics

Contributors

Subina Mehta

Pratik Jagtap

Saskia Hiltemann

Questions

How to analyze metatranscriptomics data?
What information can be extracted of metatranscriptomics data?
How to assign taxa and function to the identified sequences?

last_modification Published: Feb 26, 2020

last_modification Last Updated: Jan 15, 2024

Why study the microbiome?

.pull-left[

Health care research
Humans are full of microorganisms
Skin, gut, oral cavity, nasal cavity, eyes, ..
Affects health, drug efficacy, etc

] .pull-right[

.image-100[ Image of a human with various pie charts pointing to various regions of the body where microbe populations live ]

]

Sometimes referred to as your second genome
~10 times more cells than you
~100 times more genes than you
~1000s different species

Why study the microbiome?

Environmental studies
- Microbes in the soil affect plants and animals
- Improve agriculture

.image-75[ Rhizodeposition: image of a tree converting sun and co2 into fixed carbon used as food for soil microbes. ]

Meta- Omics

meta-momics diagram

This Tutorial: ASaiM pipeline

.pull-left[

Quality Control
- Assess Quality
- Trim and Filter raw reads
- Filter ribosomal RNA (rRNA)
Community profiling (Who?)
- Determine composition of sample
- Visualisation
Functional Analysis (What?)

]

.pull-right[ .image-90[ ASiaiM diagram ] ]

.footnote[Batut et al Gigascience. 2018 7(6) doi: 10.1093/gigascience/giy057]

Speaker Notes

For this short tutorial, while the workflow is running, these slides can be useful to explain the tools that are being run in that section. After explaining the tools, the workflows should be far enough along to start showing the results

Input: Cellulose Degradation in a Biogas Reactor

Workflow graph showing biogas reactor extract being transferred to cellulose and incubated. Time series samples are taken and run through a mass spectrometer and genomic sequencer.

Speaker Notes

A 100 µl aliquot of an enriched community from a biogas reactor was transferred to 27 anaerobic bottles containing a rich medium and 10g/L of cellulose as sole carbon source and incubated at 65 °C.

Three bottles were collected at 9 different time points (0, 8, 13, 18, 23, 28, 33, 38 and 43 h) and processed in triplicates. Metatranscriptomic analysis was performed on all time points. Metaproteomics analysis on 4 data points.

Input Format: FastQ Files

Four lines per read

Image of a fastq file with label on the first line, sequence on the second, + on the third, and quality scores on the fourth as ascii chars. A callout shows that Base=T, quality=colon, and that means a score of 25.

Speaker Notes

Four lines per read
- @ + identifier on first line, just like fasta
- sequence
- +
- quality score characters

Segue: so what do the quality chars mean?

FastQ: Quality score

Each character denotes a different Phred score
Phred scores are logarithmic

.small[ Phred Quality Score | Probability of incorrect base call | Base call accuracy — | — | — 10 | 1 in 10 | 90% 20 | 1 in 100 | 99% 30 | 1 in 1000 | 99.9% 40 | 1 in 10,000 | 99.99% … ]

Speaker Notes

Logarithmic scale
Different flavours of encoding exist

Preprocessing

In this tutorial we start with some preprocessing steps

preprocessing workflow

Preprocessing: Tools

In this tutorial we start with some preprocessing steps

preprocessing workflow

Quality Reports: FastQC

Generate a web report with quality metrics of your FastQ file

Screenshot of FastQC report, showing the table of contents with green checks on nearly every result, and the base statistics and per-base sequence quality graphs shown.

.footnote[see also our dedicated QC tutorial ]

Quality Reports: FastQC

Many different QC plots
Example: Per-base sequence Quality plot

Fastqc quality score plot, most results are in the green region but the box portion of the box and whisker plot start to dip into the yellow, medium quality (less than 30) region near 34+ base position in read. The whiskers begin extending to the red region (less than 20) by base 31 and get progressively worse.

.footnote[explanation of different plots: dedicated QC tutorial ]

Speaker Notes

in the per-base sequence quality plot, a boxplot of the base quality (y-axis) per position in the read (x-axis) is drawn
often you might observe a drop in quality towards the end of the reads, and may consider trimming ends
this example is very good

Quality Reports: FastQC

Many more plots
See QC tutorial for more information

Montage of several different fastq reports showing sequence quality graphs, and a numb er of other line graphs.

Quality Reports: MultiQC

.pull-left[

Combine multiple FastQC reports into one report
Also for outputs of other tools
Great when sequencing large numbers of samples ]

.pull-right[ Multiqc's report showing an aggregation of multiple samples. An overview at the top provides context for the 4 samples, and a sequence quality histogram shows 4 samples with similarly behaving quality scores ]

Read Trimming and Filtering: Cutadapt

Trim low-quality bases from reads
Filter reads based on length, mean quality score, ..
Remove adapters/primers

.image-75[]
Many tools: CutAdapt tool, TrimGalore tool, Trimmomatic tool ..

Speaker Notes These are some examples of ways to trim and filter data, but many more are possible and depend on your experiment what is necessary

SortMeRNA

.pull-left[

Most RNA sequences will be ribosomal RNA (rRNA)
Great for taxonomical assignment (who is there?)
Not informative for functional analysis (What are they doing?)
Filter out rRNA before doing functional analysis ]

.pull-right[ .image-90[ SortMeRNA ] ]

FastQ interlacer

Paired-end data often comes in two separate FastQ files
One file with forward reads, one with reverse reads

paired end deinterlaced file

FastQ interlacer

Some tools require a single interlaced FastQ file
Galaxy has tools for interlacing and deinterlacing FastQ files

paired end interlaced file

Speaker Notes forward and reverse files are ‘zipped’ together into a single file

Community Profile

We want to identify which organisms are present in our sample, and their relative abundances

Cartoon of several differently coloured and shaped microbes in a circle.

MetaPhlan2 tool for identification
Krona tool and Graphlan tool for visualisation

MetaPhlan2 tool

Estimates the presence and relative abundance of microbial cells
Maps reads against a set of marker sequences
Caveat: this tool is designed for DNA-seq
- Be careful interpreting abundances when using this tool with transcriptomics data

.footnote[Nat Methods. 2012 Jun 10;9(8):811-4. doi: 10.1038/nmeth.2066.]

Speaker Notes

About the caveat: The theoretical problem is that we quantify species abundance by averaging the coverage of marker genes. Marker genes are supposed to be at the same coverage as they are single copy genes from the same genome, but this is not true for their transcripts. So MetaPhlAn2 on metatranscriptomics gives an idea about the average transcriptional rate of a given species. So it can be used with caution…

Krona tool

Visualization of community composition, interactive plot

Graphlan tool

Cladogram visualisation

Colourful cladogram which begins from the center and expands outward with the lineage of the samples. Each sector of the chart is coloured differently for each group of genus and spieces. E.g. streptococcus streptococcaceae has three different leafs of the cladogram tree.

Genus Abundance

Tutorial: one timepoint
Over multiple timepoints:

stacked bar chart with timepoints along the x axis and genus abundance as a percentage along the y axis. Each of the 7 time samples consists mostly of Coprothermobacter and Clostridium.

Functional Analysis

Pathways
Gene Ontology
- Biological process
- Molecular function
- Cellular component
Gene Family

Workflow

functional analysis workflow schematic

Speaker Notes HUMAnN2

next generation
HMP Unified Metabolic Analysis Network
developed by Huttenhower lab
itself a workflow/pipeline
basically answering the question about what the microbial community is capable of?

HUMAnN2 tool

Profiles presence/absence and abundance of microbial community
Efficiently characterizes microbial metabolic pathways
Input
- Interlaced non-rRNA reads
- Taxonomic profile (MetaPhlAn2 tool)
Output
- Gene families and their abundance
- Pathways and their coverage
- Pathways and their abundance

Speaker Notes

contains 5 parts -> non rRNA reads, MetaPhlAn taxonomy, NCBI nucleotide db, Uniref 50/90 protein db, MetaCyc/Unipathway.
Show the Galaxy wrapper

HUMAnN2 Tiered Search

class: top

.left-column70[

Meta-omic sequences (DNA/RNA)

]

.right-column30[ .image-60[ Cartoon of several reads coloured into four groups, Species 1, Species 2, Unclassified, Novel. ] ]

–

.left-column70[

Initial screen through MetaPhlAn2 tool: known microbial species
- Database: merging pangenomes of identified species

]

.right-column30[ .image-60[ Four bins labelled 1 (red), 2 (blue), 3, 4 with reads from the top cartoon show piles of 1 an 2 with NO signs over 3 and 4. ] ]

–

.left-column70[

Nucleotide-level mapping against database

]

.right-column30[ .image-60[ Regions x 1 y in red and x 2 y in blue are shown, the pangenomes of each of the red and blue species are shown. Reads map to most segments of the pangenome. ] ]

–

.left-column70[

Unaligned reads searched against proteinDB (Uniref) through accelerated translated search

]

.right-column30[ .image-75[ Reads are shown matching against portions of protein sequences of X, Y, Z ] ]

Speaker Notes Takes non rRNA reads + MetaPhlAn2 gives list of abundant organism, then it does Nucleotide level pangenome mapping with Bowtie and uses CHocophlAN db giving unmapped and organims specific gene hits, the unmapped reads are further searched against accelerated translated protein database the protein hits are tehn combined with gene hits and metacyc to give the output.

Result: Gene family and pathway abundances

.image-40[![A table with two columns, Feature on left and RPK on right. GeneX has an RPK of 8. GeneX

Species1 has an RPK of 2, species2 and unclassified are listed with an RPK of 3.](../../images/metatranscriptomics/humann2_tiered5.png)]

Gene Families Abundances

Screenshot of a table in Galaxy with Gene Family on left and humann2 abundance RPKs on right.

RPK (reads per kilobase) = sum of alignment scores

Speaker Notes Gene families: groups of evolutionary related protein that perform similar function Pathway: sum over genes catalyzing the reaction Pathway coverage: presence/absence RPK relative gene copy number : is computed as the sum of all alignments scores over a particular gene family UNMAPPED: total number of reads that remained unmapped even after both alignment steps UNINTEGRATED: no pathway detected.

Gene Families to Functional Annotation

Humann2 regroup table is the left node in a flow chart with UniRef50. Multiple lines are drawn to an unlabelled right node that lists metacyc, kegg, pfam, EC, GO, informative GO, slim GO.

Speaker Notes Gene familes are too large depending on the complexity thus to simplify users can regroup gene families using grouping tool, can download mapping files. HUMAnN2 regroups Uniref 50/90 values to Go terms to get a broad overview.

Group Abundances

humann2 regroup table, lines from uniref50 to GO. group humann2 to GO slim terms shows a similar graphic, lines from uniref50 to slim GO.

Speaker Notes Group abundances converts GO terms to Go slim (subset of GO terms) into Mol function, biological process and cellular components.

Gene Families to Functional Annotation

Table from Galaxy shown with gene family and RPK

group Human2 to GO slim terms with lines from uniref50 to slim GO and boxes of Molecular Function, biological process, and cellular component below Slim GO

Another galaxy table screenshot with GO id, GO name, and abundance.

class: top

Output

.left-column30[
Molecular Function

]

.right-column70[ .image-90[ Table in Galaxy with GO ID, name, abundance. ] ] –

.left-column70[ .image-90[ Basically the same table as above. ] ]

.right-column30[
Biological Process

]

–

.right-column70[ .image-90[ Again the same columns in a table. None of the specific data is legible or important. ] ]

.left-column30[

Cellular Component

]

Speaker Notes g is genus s is species level —

Unpack pathway abundances to show genes included

Renormalize the gene and pathway abundances in copies per million or relative abundance
This tool unpacks the pathways abundance by including gene families

output file unpack pathway tool

Function: Cellulose Degradation

Quantitative analysis of gene family outputs from HUMAnN2 shows upregulation of cellulase

.image-75[ line chart shown cellulase abundance decreasing from 80 copies per million to 40 as time goes from 13 to 43. Cellulose 1,4 beta cellobiosidase starts at 140 cpm and dps at hour 23 to 120 before increasing to 200 by the end of the graph ]

Speaker Notes explain about datasets first cellulose 1,4 beta-cellulobiosidase responsible for hydrolysis of cellulose Gene encoding for the cellulose-binding domain protein shows an initial decrease and subsequent increase during cellulose degradation. —

Functions associated with a selected taxon

.image-75[ Stacked bar chart with a lot of organisms as left axis (abundance, copies per million) and time on the bottom. It is labelled Coprothermobacter: Functional Pathways ]

Speaker Notes In gene abundance, Coprothermobacter and Clostridium were observed to be the most abundant. In this figure we are looking at Coprothermobacter only->Glycolysis is observed to be the most abundant functional pathway across time points in Coprothermobacter

Taxa associated with a selected function

.image-75[ Bar chart titled Adenosine ribonucleotides de novo biosynthesis with time in hours as x axis, and Genus abundance (copies per million). Coprothermobacter and Clostridium decrease from ~2000 combined copies per million to ~800, in approximately equal amounts. ]

Speaker Notes This figure shows the contribution of genera to adenosine ribonucleotides denovo biosynthesis across time points. it shows during ATP synthesis, we see clostridium and coprothermobacter in abundance. —

Tabular Outputs from ASaIM Workflow

Taxonomy (Who?)
- Kingdom, phylum, class, order, family, genus, species, strain
Function (What?)
- Pathways
- Gene Ontology
  - Biological Process
  - Molecular Function
  - Cellular Component
- Gene Family

Thank you!

This material is the result of a collaborative work. Thanks to the Galaxy Training Network and all the contributors!

Tutorial Content is licensed under Creative Commons Attribution 4.0 International License.