name: inverse layout: true class: center, middle, inverse
---
# Introduction to Microbiome Analysis
Paul Zierep
Bérénice Batut
Saskia Hiltemann
last_modification
Updated:
purl
PURL
:
gxy.io/GTN:S00114
video-slides
Video slides
|
text-document
Plain-text slides
|
video
Recordings
video
Lecture (February 2021) - 15m
video
View All
Tip:
press
P
to view the presenter notes |
arrow-keys
Use arrow keys to move between slides
??? Presenter notes contain extra information which might be useful if you intend to use these slides for teaching. Press `P` again to switch presenter notes off Press `C` to create a new window where the same presentation will be displayed. This window is linked to the main window. Changing slides on one will cause the slide to change on the other. Useful when presenting. --- ## Requirements Before diving into this slide deck, we recommend you to have a look at: - [Introduction to Galaxy Analyses](/training-material/topics/introduction) --- ## Microbiome Analysis ??? - Microbiome analysis is the examination of the collection of microorganisms (microbes) in a particular environment, such as the human body, soil, water, or any other habitat. - This analysis aims to understand the composition, diversity, and functions of these microbial communities. --- ## Overview - Why study the microbiome? - 2 different approaches - Amplicon sequencing - Shotgun sequencing - Analysis pipelines - Visualisation options ??? - Studying the microbiome is essential because it plays a crucial role in human health by influencing digestion, immunity, and disease susceptibility. - It also helps in understanding and improving agricultural productivity and environmental sustainability through insights into soil and plant microbiomes. - Additionally, microbiome research can lead to the development of new medical therapies, diagnostics, and biotechnological applications. - DNA-based microbiome analysis is mainly based on two techniques: Amplicon sequencing and Shotgun sequencing. - Microbiome analysis required sophisticated analysis pipelines for the processing and the interpretation of the sequence data. --- ## Why study the microbiome? .pull-left[ - Health care research - Humans are full of microorganisms - Skin, gut, oral cavity, nasal cavity, eyes, .. - Affects health, drug efficacy, etc ] .pull-right[ .image-100[ ![image of a human with lines leading from various points on the body to pie charts of bacteria distribution](../../images/human_microbiome.png) ] ] - Sometimes referred to as your **second genome** - ~10 times more cells than you - ~100 times more genes than you - ~1000s different species ??? - The term "second genome" refers to the vast collection of microbial genes present in and on the human body, which outnumber human genes by about 100 to 1. - This microbial genetic material plays a crucial role in various physiological processes, such as digestion, immunity, and even mental health, making it an integral part of our overall genetic makeup and health. --- ## Why study the microbiome? - Environmental studies - Microbes in the soil affect plants and animals - Improve agriculture .image-75[ ![Rhizodeposition: image of a tree converting sun and CO2 into fixed carbon used as food for soil microbes.](../../images/environmentalstudies.png) ] ??? - In environmental studies, the microbiome refers to the diverse communities of microorganisms living in various environments such as soil, water, and plants. - These microbial communities are critical for ecosystem functions like nutrient cycling, soil fertility, pollutant degradation, and plant health, thus playing a vital role in maintaining environmental balance and sustainability. --- ## Types of microbiome analysis .image-75[ ![Experimental-workflows-for-mainstream-omics](../../images/Experimental-workflows-for-mainstream-omics-analysis-types-Arrow-colours-indicate-omics.png) ] .footnote[Image source: [10.1039/D3MO00089C](https://doi.org/10.1039/D3MO00089C)] ??? - Meta-omics refers to a suite of high-throughput techniques used to study various aspects of microbial communities in an integrated way, providing a comprehensive view of their functions, structures, and interactions. - Metagenomics involves sequencing the entire genetic material from a microbial community, offering insights into the diversity, composition, and potential functions of the microorganisms. - Metatranscriptomics examines RNA transcripts to assess gene expression and microbial responses to environmental changes. - Metametabolomics analyzes the small molecules produced by microbial metabolism to understand their metabolic activities and interactions. - Metaproteomics focuses on studying microbial proteins to uncover their roles, expression levels, and interactions. - Amplicon sequencing, though more targeted, involves sequencing specific genetic regions to profile microbial diversity and abundance. --- ## Shotgun vs Amplicon .pull-left[ .image-75[ ![jigsaw puzzle image illustrating shotgun sequencing approach](../../images/puzzles-shotgun.png) ] **Shotgun Sequencing** - Sequence all DNA - More information - Higher complexity - Higher cost ] .pull-right[ .image-75[ ![jigsaw puzzle image illustrating amplicon sequencing approach](../../images/puzzles-amplicon.png) ] **Amplicon Sequencing** - Sequence only specific gene - No functional information - Less complex to analyse - Cheaper ] ??? - Shotgun sequencing analyzes the entire genetic material in a sample, providing comprehensive and detailed information about microbial community structure and function, but it is complex and costly. - Amplicon sequencing targets specific regions of microbial DNA, such as the 16S rRNA gene, offering a cost-effective and simpler method for profiling microbial communities, though it lacks detailed functional insights and is subject to amplification bias. - Shotgun sequencing is ideal for detailed metabolic and functional studies, while amplicon sequencing is suitable for assessing microbial composition and diversity. --- ## Amplicon - Targetted approach, e.g. 16S (bacteria), 18S (eukaryotes), ITS (fungi) - Amplify gene of specific taxonomic group, i.e. not host or environment ![left is picture of e. coli ribosome with "so much rna" written, right is the RNA structure in a diagram](../../images/16S_gene.png) ??? - Amplicon sequencing is a targeted approach that focuses on sequencing specific genetic regions to identify and analyze microbial communities. - For bacteria, the 16S rRNA gene is commonly targeted, while the 18S rRNA gene is used for eukaryotes, and the ITS region is targeted for fungi. - This method allows for efficient and cost-effective profiling of microbial composition and diversity by amplifying and sequencing these specific genetic markers. --- ## Amplicon - Highly conserved gene: easy to target across all bacteria - With variable regions: distinguish between genus ![line graph showing V regions as peaks along x axis of alignment position, and y index of shannon-index/alignment-index.](../../images/16S_variableregions.jpg) ??? - The 16S rRNA gene used in amplicon sequencing for bacteria consists of both conserved and variable regions. - The conserved regions are highly similar across different bacterial species, providing anchor points for primers used in sequencing. - The variable (V) regions, on the other hand, are more diverse and unique to different species, allowing for the differentiation and identification of bacteria within a sample. - Sequencing these V regions enables researchers to profile the bacterial community and understand its composition and diversity. --- ## Amplicon - Pros - Well-established - Inexpensive ($50-$100/sample) - Cons - V-region choice can bias results - Is based on a very well conserved gene, making it hard to resolve species and strains ??? - Amplicon sequencing is cost-effective and simpler than other methods, making it accessible for routine studies and straightforward to analyze due to its focus on specific genetic regions. - However, it has limitations, such as only providing information about the targeted regions, which may miss some microbial diversity and functional insights. - Additionally, the method can introduce amplification biases and may not differentiate closely related species or strains effective --- ## Shotgun metagenomics - Aims to sequence the "whole" metagenome - Pros: - Not biased by amplicon primer set - Not limited by conservation of the amplicon - Can also provide functional information - Cons: - Environmental contamination, including host - More expensive ($1000+/sample) - Complex data analysis - Requires high performance computing, high memory, high compute capacity ??? - Shotgun metagenomics is a comprehensive approach to analyzing microbial communities by sequencing all the genetic material present in a sample, rather than focusing on specific regions. - This technique involves fragmenting the entire DNA from a sample, sequencing these fragments, and then using computational tools to assemble and analyze the data. - By capturing the complete genetic makeup of all microorganisms, including bacteria, archaea, viruses, and eukaryotes, shotgun metagenomics provides detailed insights into the diversity, composition, and functional potential of microbial communities. - This method allows for a broad and deep understanding of microbial interactions and functions, though it is often more complex and costly compared to targeted sequencing approaches. - Additionally, shotgun metagenomics can be affected by host contamination, which may complicate the interpretation of microbial data. --- ## End-to-End ![graphi with an arrow going from sampling (nose picture) to extraction to amplification to ngs to bioinformatics (computer picture)](../../images/16S_end-to-end.png) Every step in this process can have serious impact on the results ??? - In end-to-end metagenomic workflows, every step—from sample collection to data analysis—can significantly impact the final results. - Sample collection methods can introduce biases or contaminants, affecting the accuracy of microbial representation. - During DNA extraction, variations in protocols or efficiencies can alter the quantity and quality of genetic material obtained. - Sequencing technologies and their associated errors can introduce noise or gaps in the data. - Data processing, including assembly and annotation, relies on computational tools and databases that may -influence the interpretation of microbial diversity and functions. - Finally, the integration and analysis of results can be affected by the algorithms used and the quality of the reference databases. - Each of these steps must be carefully controlled and standardized to ensure reliable and reproducible outcomes in metagenomic studies. --- ## Bioinformatics ![cartoon titled drowned in NGS data with a man covered in messy data.](../../images/so_much_data.png) ``` Roche 454 GS: ~ 100.000 reads Illumina MiSeq: ~ 25.000.000 reads Shotgun: ~ ? reads ``` ??? - In metagenomics, "drowning in NGS data" refers to the overwhelming volume of data generated by next-generation sequencing technologies. - This issue arises because NGS can produce vast amounts of sequence reads from complex microbial communities in a single experiment, creating challenges in data management, processing, and interpretation. - Managing and organizing this massive data set requires substantial computational resources, while analyzing it involves quality control, read assembly, taxonomic classification, and functional annotation, all of which add complexity and computational demands. - Extracting meaningful biological insights from such extensive data requires advanced bioinformatics expertise to navigate potential noise and errors and to ensure that findings are robust and reproducible. --- ## Analysis pipelines ![comparison of amplicon and shotgun pipelines. amplicon pipeline goes to community structure viz, shotgun pipeline goes to community structure in addition to functional assignment.](../../images/metagenomics_pipeline.png) Amplicon analysis variants: - OTU: operational taxonomic units - ASV: amplicon sequence variants ??? - Amplicon sequencing focuses on specific genetic regions, such as the 16S rRNA gene for bacteria, offering a cost-effective and simpler pipeline that provides insights into microbial community composition but may miss broader diversity and functional details. - In contrast, shotgun metagenomics sequences all genetic material, delivering a comprehensive view of microbial diversity and functional potential, though it requires more complex and expensive analysis. - Operational Taxonomic Units (OTUs) group sequences based on similarity, which can obscure finer taxonomic distinctions and is often used in amplicon pipelines. - Amplicon Sequence Variants (ASVs) represent exact sequences without grouping, offering higher resolution and more accurate taxonomic identification, and are increasingly used in modern amplicon pipelines for greater precision in microbial profiling. --- ## Pre-processing ![fastqc plot with regions labelled good, okay, and bad based on the plot's colour](../../images/preprocessing.png) - There are a lot of ways to filter and trim your data - Trade-off between quality and amount of information retained ??? - Preprocessing in metagenomics involves a series of steps to prepare raw sequencing data for analysis, including quality control, trimming of low-quality sequences, removal of contaminants, and filtering out redundant or erroneous data. - This stage is crucial for ensuring that the data used for downstream analysis is accurate and reliable. - The trade-off between quality and amount of information retained is a key consideration in preprocessing. - On one hand, stringent quality control measures improve the accuracy of the data by removing errors and contaminants, but this can also lead to the loss of some useful information, particularly from low-abundance or poorly sequenced organisms. - On the other hand, retaining more data might include lower-quality sequences or noise, which can complicate data analysis and interpretation. - Balancing these factors involves optimizing preprocessing steps to retain the most informative and accurate data while minimizing errors and maintaining sufficient coverage for meaningful analysis. --- ## OTU Clustering / Denoising .pull-left[ OTU clustering: - Cluster on 97% sequence similarity for genus-level differentiation - OTUs depend on the data set: no comparability of data sets .image-75[ ![tree with clusters around the leaves labelled OTUs and 97% identity threshold](../../images/otu.png) ] ] .pull-right[ ASV Denoising: - Discriminate sequence variations from sequencing errors - Based on an error model and the assumption that sequences containing errors are less likely observed repeatedly .image-90[ ![DNA sequences with hierarchical clustering shown](../../images/clustering.png) ] ] ??? - OTU clustering groups similar sequence reads into Operational Taxonomic Units based on a predefined similarity threshold, which can simplify the data but may mask finer taxonomic differences and reduce resolution by lumping closely related species together. - In contrast, ASV denoising involves identifying and retaining exact sequence variants without grouping, providing higher resolution and more precise taxonomic identification by distinguishing unique sequences and reducing errors, though it requires more sophisticated algorithms and computational resources. --- ## Chimera Removal - During PCR multiple sequences can combine to form a hybrid - Must be removed from your data for better results ![aborted extension leads to mispriming, which gets extended with the wrong sequence, and results in a chimera](../../images/chimeras.png) ??? - Chimera removal is a crucial step in microbiome data processing that involves identifying and eliminating chimeric sequences—artificial DNA sequences that arise from the incorrect joining of two or more distinct sequences during PCR amplification. - These chimeras can distort the representation of microbial communities by introducing erroneous sequences that do not reflect actual biological entities. - The removal of chimeric sequences is essential for ensuring accurate and reliable results in downstream analyses, such as taxonomic classification and diversity assessment. --- ## Search marker database and taxonomy assignment - Homology with reference databases - Amplicon ![graphic with two databases green genes (berkely, 2013, 200k entries) and silva (max plank, 2015, 172k entries)](../../images/silva-greengenes.png) - Shotgun: MetaPhlAn2 database - ~1M unique clade-specific marker genes - ~17,000 reference genomes (bacterial and archaeal, viral and eukaryotic) - Accuracy depends on quality and completeness of database used - Databases are inevitable incomplete ??? - In microbiome studies, searching a marker database involves comparing sequencing data to a reference database to identify microbial sequences. - For example, in amplicon sequencing, sequences might be compared to the SILVA database, which contains rRNA gene sequences, to classify microorganisms based on known genetic markers. - In shotgun metagenomics, the MetaPhlAn database can be used, which contains unique marker genes for various microorganisms, enabling precise taxonomy assignment and detailed profiling of the microbial community. --- ## Search functional database - Databases to identify gene families - UniRef50 - UniRef90 - Grouping in other functional categories - MetaCyc Reactions - KEGG Orthogroups (KOs) - Pfam domains - Level-4 enzyme commission (EC) categories - EggNOG (including COGs) - Gene Ontology (GO) - Informative GO - Pathway reconstruction ??? - For functional analysis, researchers can use databases like UniRef50 and UniRef90, which cluster sequences into representative sets to help identify proteins and their functions. - MetaCyc Reactions provides detailed information on metabolic pathways and enzymatic reactions, allowing the identification of metabolic capabilities within the microbiome. - EggNOG offers orthologous group annotations, helping to predict gene functions and evolutionary relationships. - These functional databases provide deeper insights into the biological roles and potential activities of the microbial community, complementing the taxonomic data. --- ## Results: OTU table ![screenshot of an excel table](../../images/otutable.png) ??? - The results of microbiome studies are typically compiled into a comprehensive summary such as an OTU table or a similar data structures. - In this table, each row represents an OTU, while each column represents a different sample analyzed. - The cells within the table contain values indicating the abundance of each OTU in each sample, often presented as counts or relative frequencies. - This format provides a detailed overview of the microbial community composition across multiple samples, allowing researchers to compare diversity and abundance patterns, and perform statistical analyses to assess differences between sample groups, microbial diversity, and correlations with environmental or health-related variables. - Comparable tables for other modality of microbiome studies are taxonomy tables, gene tables and pathway tables. --- ## Results: Visualizations - **Krona** - interactive exploration of sample taxonomy ![doughnut plot with bands for each kingdom, down to genus, species](../../images/krona_lacto.png) ??? - Visualizations of microbiome data are crucial for several reasons. - They transform complex, high-dimensional data into interpretable and accessible formats, allowing researchers to quickly grasp the composition and diversity of microbial communities. - Krona is a visualization tool used to explore microbiome data from an OTU table interactively. - First, the OTU table, with taxonomic classifications and abundance data, is formatted into a compatible file for Krona. - The software then processes this file to create an interactive, multi-layered pie chart where each layer represents different taxonomic levels, such as phylum, class, order, family, genus, and species. - Users can click on segments to drill down into lower taxonomic levels, with the segment sizes indicating the relative abundance of each taxon. - This allows for intuitive and interactive exploration of the microbial community structure, making it easy to identify patterns and differences across samples. --- ## Results: Visualizations - **Phinch** - BIOM file input - various visualizations - multi-sample data ![screenshot of phinch website showing multiple plot types](../../../../shared/images/phinch_overviewpage.png) ??? - Phinch is another powerful tool for visualizing microbiome data, offering interactive and customizable charts that help in exploring complex datasets. --- ## Differential analyses ![Maaslin2 Example](../../images/maaslin2.jpg) .footnote[*Figure taken from: https://doi.org/10.1101/2021.01.20.427420*] ??? - Differential microbiome analysis using tools like MaAsLin2 involves comparing microbial community profiles between different groups or conditions to identify significant differences in microbial abundance or diversity. - MaAsLin2 uses linear models to handle complex, high-dimensional data and adjust for potential confounders, providing robust statistical insights into which microbes are differentially abundant. - This analysis helps in understanding how microbial shifts correlate with various factors, such as disease states or environmental changes, thereby uncovering potential biomarkers and functional implications. --- ## Diversity ![Diversity Example](../../images/diversity_differences.png) ??? - Alpha diversity measures the diversity within a single sample, reflecting the number of species and their relative abundances. - Beta diversity compares microbial communities between samples, highlighting differences in species composition and abundance. - Together, these metrics provide insights into both the internal diversity of communities and how they vary across different conditions. --- ## Assembly ![Assembly Example](../../images/individual-assembly.png) ??? - Microbiome assembly involves reconstructing the complete genomes of microorganisms from sequencing data to understand their functional potential and diversity. --- ## Binning ![Binning Example](../../images/binning.png) ??? - After the initial assembly of contigs from sequencing data, binning helps to differentiate and categorize these contigs based on their genetic similarity, coverage, and other characteristics, effectively reconstructing individual microbial genomes from the mixed sample. --- ## Thank You! This material is the result of a collaborative work. Thanks to the [Galaxy Training Network](https://training.galaxyproject.org) and all the contributors!
Authors:
Paul Zierep
Bérénice Batut
Saskia Hiltemann
Tutorial Content is licensed under
Creative Commons Attribution 4.0 International License
.