Marine Omics identifying biosynthetic gene clusters

Author(s)	Marie Josse
Reviewers

Overview
Questions:

Which biological questions are addressed by the tutorial?

Which bioinformatics techniques are important to know for this type of data?

Objectives:

Follow a marine omics analysis

Learn to conduct a secondary metabolite biosynthetic gene cluster (SMBGC) Annotation

Discover SanntiS a tool for identifying BGCs in genomic & metagenomic data

Manage fasta files

Requirements:

Introduction to Galaxy Analyses

Time estimation: 3 hours

Supporting Materials:

Workflows

FAQs

Published: Aug 19, 2024

Last modification: Jun 3, 2025

License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MIT

purl PURL: https://gxy.io/GTN:T00450

version Revision: 3

Introduction

Through this tutorial, you will learn in the first part how to produce a protein fasta file from a nucleotide fasta file using Prodigal (it is a tool that predicts protein-coding genes from DNA sequences).

Then, you’ll be using InterProscan to create a tabular. Interproscan is a batch tool to query the InterPro database. It helps identify and predict the functions of proteins by comparing them to known databases.

And finally, you will discover SanntiS both to build genbank and especially to detect and annotate biosynthetic gene clusters (BGCs).

Agenda

In this tutorial, we will cover:

Introduction

Get data

Import and launch the workflow

Prodigal Gene Predictor: generate a protein fasta file

Prodigal Gene Predictor

SanntiS for building a Genbank file

Regex Find And Replace

InterProScan

SanntiS for annotating biosynthetic gene clusters

Conclusion

Extra information

Get data

The FASTA file used in this example is, intentionally, a very small fraction of a genome (spanning exactly 1 BGC), and it’s main purpose is to quickly check that all the pieces of the workflow are working.

Hands On: Data Upload

Create a new history for this tutorial and give it a name (for example “Marine Omics: SMBGC annotation”) for you to find it again later if needed.

To create a new history simply click the new-history icon at the top of the history panel:

Import the file with this link https://figshare.com/ndownloader/files/48574534and name it BGC0001472.fna

Copy the link location

Click galaxy-upload Upload at the top of the activity panel

Select galaxy-wf-edit Paste/Fetch Data

Paste the link(s) into the text field

Press Start

Close the window

Rename the datasets BGC0001472.fna

Click on the galaxy-pencil pencil icon for the dataset to edit its attributes

In the central panel, change the Name field

Click the Save button

Hands-on: Choose Your Own Tutorial

This is a "Choose Your Own Tutorial" (CYOT) section (also known as "Choose Your Own Analysis" (CYOA)), where you can select between multiple paths. Click one of the buttons below to select how you want to follow the tutorial

Do you want to run the workflow or to discover the tools one by one ?

Tools Workflow

Import and launch the workflow

Hands On: Import the workflow

Click on Workflow on the top menu bar of Galaxy. You will see a list of all your workflows.

Option 1: use the URL - Click on galaxy-upload Import at the top-right of the screen - Paste the URL of the workflow into the box labelled “Archived Workflow URL” https://earth-system.usegalaxy.eu/u/marie.josse/w/marine-omics-identifying-biosynthetic-gene-clusters

Option 2: use the workflow name - Click on Public workflows at the top-right of the screen

Search for Marine Omics identifying biosynthetic gene clusters

In the workflow preview box click on galaxy-upload Import

Click the Import workflow button

Hands On: Run the workflow

Click on Workflow on the top menu bar of Galaxy. You will see a list of all your workflows.

Click on the workflow-run (Run workflow) button next to your workflow

/!\ Select Yes for Workflow semi automatic

Configure the workflow as needed with the 2 datasets you uploaded right before (BGC0001472.fna)

Click the Run Workflow button at the top-right of the screen

You may have to refresh your history to see the queued jobs

Now you don’t have to do anything else. You should see all the different steps of the workflow appear in your history. When the workflow is fully completed you should have the following history.

Image of the history with all the steps of the workflow. — **Figure 1**: History

Workflows are a powerful Galaxy feature that allows you to scale up your analysis by performing an end-to-end analysis with a single click of a button. In order to keep provenance of the workflow invocation (an invocation of a workflow means one run or execution of the workflow) it can be exported from Galaxy in the form of a Workflow Run Crate RO-Crate profile.

After the workflow has completed, we can export the RO-Crate. The crate does not appear in your history, but can be accessed from the User -> Workflow Invocations menu on the top bar.

Hands On: Extract a RO-Crate

In the top right of your history, go to galaxy-history-options -> Show Invocations

Open image in new tab

Figure 2: History options

Our latest workflow run should be listed at the top.

Click on it to expand it:

Open image in new tab

Figure 3: Workflow invocation

Click on the Export tab in the expanded view of the workflow invocation. You should see a page that contains three download options: - Research Object Crate (RO-Crate) - BioCompute Object - File

Click on the Generate galaxy-download option of the RO-Crate box (1st box)

Open image in new tab

Figure 4: RO-Crate

Great work! You have created a Workflow Run Crate. This makes it easy to track the provenance of the executed workflow.

Prodigal Gene Predictor: generate a protein fasta file

Prodigal Gene Predictor

Prodigal is a tool that predicts protein-coding genes from DNA sequences. It takes a nucleotide FASTA file as input and identifies regions that are likely to code for proteins. The output is a protein FASTA file where each sequence represents a predicted protein. Some of these sequences end with an asterisk (*), which marks the end of a complete protein sequence identified by Prodigal. This asterisk is added when Prodigal detects a full protein-coding region that ends with a stop codon. Sequences without an asterisk either represent partial proteins or do not end in a typical stop codon.

Hands On: Run prodigal

Prodigal Gene Predictor ( Galaxy version 2.6.3+galaxy0) with the following parameters:

param-file “Specify input file”: BGC0001472.fna (Input the nucleotide fasta file)

“Specify mode”: Meta : Anonymous sequences, analyze using preset training files, ideal for metagenomic data or single short sequences

You don’t need to change any other parameters leave them on the default input.

Click on Run Tool

You should have 4 new outputs appearing in your history. In these outputs you should have Prodigal Gene Predictor on data 1 : protein translations file. You can click on it and then click on the galaxy-eye (eye).

Image of the protein fasta file from prodigal and of the new history with all the new outputs. — **Figure 5**: Prodigal outputs

You can notice here that at each end of the sequence there’s a *. Later on we will need to remove this star. But, first we are going to use this protein file to build the Genbank that SanntiS need to make a SMBGC annotation.

SanntiS for building a Genbank file

This step combines the original nucleotide sequences with the cleaned protein sequences to create a GenBank format file. This format is widely used for storing and organising information about DNA sequences and their annotations. In this step, the DNA sequences and their corresponding coding regions are transformed into a format that is suitable for SanntiS.

Hands On: Build Genbank

SanntiS biosynthetic gene clusters ( Galaxy version 0.9.3.5+galaxy1) with the following parameters:

“Do you want to build a genbank or to make a SMBGC Annotation?”: Build genbank

param-file “Input a nucleotide fasta file”: BGC0001472.fna (Input the nucleotide fasta file)

param-file “Input a protein fasta file”: Prodigal Gene Predictor on data 1 : protein translations file (output of Prodigal Gene Predictor tool)

Click on Run Tool

Image of the Genbank file produces by SanntiS. — **Figure 6**: Genbank file

Regex Find And Replace

Remember earlier we noticed the star * in the protein fasta file ?

Now is the time to remove it !

The asterisks at the end of some protein sequences are informative but can cause issues with some analysis tools. In this step, we remove these asterisks to produce a clean protein FASTA file, making it ready for further analysis.

Hands On: Remove *

Regex Find And Replace ( Galaxy version 1.0.3) with the following parameters:

param-file “Select lines from”: Prodigal Gene Predictor on data 1 : protein translations file (output of Prodigal Gene Predictor tool)

In “Check”:

param-repeat “Insert Check”

“Find Regex”: \*$

“Replacement”: `` (leave an empty box there)

Click on Run Tool

Check if the * were well removed.

InterProScan

InterProScan is a tool that helps identify and predict the functions of proteins by comparing them to known databases. This tool analyses the protein sequences and produces an output file containing detailed information about the possible functions, domains, and families associated with these proteins.

Hands On: Task description

InterProScan ( Galaxy version 5.59-91.0+galaxy3) with the following parameters:

param-file “Protein FASTA File”: Regex Find And Replace on data ** (output of Regex Find And Replace tool)

“Use applications with restricted license, only for non-commercial use?”: No You can leave all the other parameters on the default input.

Click on Run Tool

Image of InterProScan output in tabular (tsv) in the history. — **Figure 7**: InterProScan output

SanntiS for annotating biosynthetic gene clusters

SanntiS is a tool specifically designed to detect and annotate biosynthetic gene clusters (BGCs). It uses neural networks trained on InterPro signatures to achieve high accuracy in identifying BGCs in both genomic and metagenomic datasets 1. A significant benefit of using SanntiS is that your results will be comparable with a large number of datasets, including over 5,000 marine metagenomic assemblies archived in the MGnify resource. This tool provides valuable insights into the biosynthetic potential of organisms or environmental samples. The final output is a GFF file containing detailed BGC annotations, which can be used for further analyses or applications.

Hands On: Identify biosynthetic gene clusters

SanntiS biosynthetic gene clusters ( Galaxy version 0.9.3.5+galaxy1) with the following parameters:

“Do you want to build a genbank or to make a SMBGC Annotation?”: Run SanntiS

param-file “Input the tabular file from InterProScan”: InterProScan on data ** (output of InterProScan tool)

param-file “Input a Genbank file”: SanntiS output data genbank (output of SanntiS biosynthetic gene clusters tool)

Click on Run Tool

Finally, you should have one gff3 file in your history under SanntiS output data

Image of SanntiS output. — **Figure 8**: SanntiS output

Conclusion

Here you now know how to conduct a Marine Omics analysis

Extra information

Coming up soon even more tutorials on and other Earth-System related trainings. Keep an galaxy-eye open if you are interested!

You've Finished the Tutorial

Key points

SMBGC annotation

Marine omics analysis

Frequently Asked Questions

Have questions about this tutorial? Have a look at the available FAQ pages and support channels

Useful literature

Further information, including links to documentation and original publications, regarding the tools, analysis techniques and the interpretation of results described in this tutorial can be found here.

Feedback

Did you use this material as an instructor? Feel free to give us feedback on how it went.
Did you use this material as a learner or student? Click the form below to leave feedback.

Citing this Tutorial

Marie Josse, Marine Omics identifying biosynthetic gene clusters (Galaxy Training Materials). https://training.galaxyproject.org/training-material/topics/ecology/tutorials/marine_omics_bgc/tutorial.html Online; accessed TODAY
Hiltemann, Saskia, Rasche, Helena et al., 2023 Galaxy Training: A Powerful Framework for Teaching! PLOS Computational Biology 10.1371/journal.pcbi.1010752
Batut et al., 2018 Community-Driven Data Analysis Training for Biology Cell Systems 10.1016/j.cels.2018.05.012

@misc{ecology-marine_omics_bgc,
author = "Marie Josse",
	title = "Marine Omics identifying biosynthetic gene clusters (Galaxy Training Materials)",
	year = "",
	month = "",
	day = "",
	url = "\url{https://training.galaxyproject.org/training-material/topics/ecology/tutorials/marine_omics_bgc/tutorial.html}",
	note = "[Online; accessed TODAY]"
}
@article{Hiltemann_2023,
	doi = {10.1371/journal.pcbi.1010752},
	url = {https://doi.org/10.1371%2Fjournal.pcbi.1010752},
	year = 2023,
	month = {jan},
	publisher = {Public Library of Science ({PLoS})},
	volume = {19},
	number = {1},
	pages = {e1010752},
	author = {Saskia Hiltemann and Helena Rasche and Simon Gladman and Hans-Rudolf Hotz and Delphine Larivi{\`{e}}re and Daniel Blankenberg and Pratik D. Jagtap and Thomas Wollmann and Anthony Bretaudeau and Nadia Gou{\'{e}} and Timothy J. Griffin and Coline Royaux and Yvan Le Bras and Subina Mehta and Anna Syme and Frederik Coppens and Bert Droesbeke and Nicola Soranzo and Wendi Bacon and Fotis Psomopoulos and Crist{\'{o}}bal Gallardo-Alba and John Davis and Melanie Christine Föll and Matthias Fahrner and Maria A. Doyle and Beatriz Serrano-Solano and Anne Claire Fouilloux and Peter van Heusden and Wolfgang Maier and Dave Clements and Florian Heyl and Björn Grüning and B{\'{e}}r{\'{e}}nice Batut and},
	editor = {Francis Ouellette},
	title = {Galaxy Training: A powerful framework for teaching!},
	journal = {PLoS Comput Biol}
}

                   

Funding

These individuals or organisations provided funding support for the development of this resource

Fair-Ease

FAIR-EASE is a RIA project funded under HORIZON-INFRA-2021-EOSC-01-04, and it involves a consortium of 25 partners from all over Europe.

EuroScienceGateway

EuroScienceGateway was funded by the European Union programme Horizon Europe (HORIZON-INFRA-2021-EOSC-01-04) under grant agreement number 101057388 and by UK Research and Innovation (UKRI) under the UK government’s Horizon Europe funding guarantee grant number 10038963.

Congratulations on successfully completing this tutorial!

You can use Ephemeris's shed-tools install command to install the tools used in this tutorial.

shed-tools install [-g GALAXY] [-a API_KEY] -t <(curl https://training.galaxyproject.org/training-material/api/topics/ecology/tutorials/marine_omics_bgc/tutorial.json | jq .admin_install_yaml -r)

Alternatively you can copy and paste the following YAML

---
install_tool_dependencies: true
install_repository_dependencies: true
install_resolver_dependencies: true
tools:
- name: interproscan
  owner: bgruening
  revisions: 74810db257cc
  tool_panel_section_label: Annotation
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: sanntis_marine
  owner: ecology
  revisions: 9d689f8c9ce4
  tool_panel_section_label: Annotation
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: regex_find_replace
  owner: galaxyp
  revisions: 503bcd6ebe4b
  tool_panel_section_label: Text Manipulation
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: prodigal
  owner: iuc
  revisions: 91791b6ecd38
  tool_panel_section_label: Annotation
  tool_shed_url: https://toolshed.g2.bx.psu.edu/

No feedback has been recieved yet for this training. Be the first one by filling in the feedback form.