GO Enrichment Analysis on Single-Cell RNA-Seq Data

Author(s)	Menna Gamal
Editor(s)	Wendi Bacon Björn Grüning Pablo Moreno Nicola Soranzo
Reviewers

Overview
Questions:

What is Gene Ontology (GO) enrichment analysis, and why should I perform it on my marker genes?

How can I use GO enrichment analysis to better understand the biological functions of the genes in my clusters?

Can GO enrichment analysis help me confirm that my clusters represent distinct cell types or states?

How can I visualize my GO enrichment results to make them easier to understand and interpret?

Objectives:

Understand the role of GO Enrichment in Single-Cell Analysis.

Use marker genes from different cell clusters or conditions for GO enrichment analysis.

Compare enrichment across experimental conditions (e.g., wild type vs. knockout) to uncover functional changes associated with genetic or environmental perturbations.

Link GO enrichment results with previously annotated cell clusters, providing a clearer picture of the functional roles of different cell populations.

Requirements:

Introduction to Galaxy Analyses

tutorial Hands-on: Filter, plot and explore single-cell RNA-seq data with Scanpy

Time estimation: 3 hours

Supporting Materials:

Slides

Datasets

Workflows

FAQs

instances Available on these Galaxies

Known Working

UseGalaxy.eu ✅ ⭐️

UseGalaxy.org (Main) ✅ ⭐️

UseGalaxy.cz ✅

UseGalaxy.no ✅

Published: Sep 17, 2024

Last modification: Apr 8, 2025

License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MIT

purl PURL: https://gxy.io/GTN:T00456

rating Rating: 5.0 (1 recent ratings, 1 all time)

version Revision: 8

In the tutorial Filter, plot and explore single-cell RNA-seq data with Scanpy, we took an important step in our single-cell RNA sequencing analysis by identifying marker genes for each of the clusters in our dataset. These marker genes are crucial, as they help us distinguish between different cell types and states, giving us a clearer picture of the cellular diversity within our samples. However, simply identifying marker genes is just the beginning. To truly understand what makes each cluster unique, we need to dive deeper into the biological functions these genes are involved in. This is where Gene Ontology (GO) enrichment analysis comes into play. We will perform GO enrichment analysis as a type of over-representation analysis (ORA). ORA is a statistical method that determines whether genes from pre-defined sets (e.g. genes belonging to a specific GO term) are expressed more than would be expected in a subset of your data. The most commonly used statistical tests are Fischer’s exact test and hypergeometric test, more details about them are explained in the tutorial slides.

Agenda

In this tutorial, we will cover:

Data description

[A] Marker Genes:

[B] GO Enrichment Files:

Get data

Important tips for easier analysis

Data processing

GO Analysis using GOEnrichment tool

GO Analysis using gProfiler GOSt tool

Conclusion

Data description

In this tutorial will use the following datasets:

[A] Marker Genes:

We’ll start with two input datasets of marker genes (Study sets):

Marker genes per cell cluster: This dataset lists the genes that are differentially enriched in each cell cluster.
Marker genes per condition (wt and ko): This dataset lists the genes that are differentially enriched between the wild-type (wt) and knockout (ko) conditions.

Note: Marker genes were obtained using Scanpy FindMarkers tool. The top 50 marker genes were included in the downstream GO enrichment analysis. Scanpy FindMarkers tool selects the marker genes based on their log2 fold change and p-values. Focusing on the top-ranked genes helps to filter out less relevant genes, thereby addressing the concern of high false positives that can arise from traditional methods.

[B] GO Enrichment Files:

We’ll also use three additional files for GO enrichment analysis.

Gene Ontology file: This file contains information about Gene Ontology terms.
GO Annotations file: This file maps genes to their corresponding GO terms.
Population set file: This file provides a list of all genes involved in the experiment and is used as a background gene set for the analysis.

Note: There are several online databases available for downloading GO and GO Annotations files, including the Gene Ontology website, Ensembl, and the UCSC Genome Browser.

Comment: Concept behind GO Enrichment Analysis

The goal of GO enrichment analysis is to interpret the biological significance of long lists of marker genes by summarizing these genes into a shorter list of enriched GO terms. The analysis works by comparing each GO term between your list of marker genes and a background gene set. Statistical tests are then used to calculate a p-value that indicates whether a particular GO term is significantly enriched in the marker gene list compared to the background.

Get data

You can access the data for this tutorial in multiple ways:

Importing from a history - You can import this history
1. Open the link to the shared history
2. Click on the Import this history button on the top left
3. Enter a title for the new history
4. Click on Copy History
Uploading from Zenodo (see below)

Hands On: Data Upload from Zenodo
Create a new history for this tutorial

To create a new history simply click the new-history icon at the top of the history panel:
Import the files from Zenodo
https://zenodo.org/records/13461890/files/Galaxy3-[GO].obo
https://zenodo.org/records/13461890/files/Galaxy2-[GO_annotations_Mus_musculus].tabular
https://zenodo.org/records/13461890/files/Galaxy5-[Markers_-_clusters].tabular
https://zenodo.org/records/13461890/files/Galaxy4-[Background_gene_set].tabular
https://zenodo.org/records/13461890/files/Galaxy1-[Markers_-_genotype_].tabular
Copy the link location

Click galaxy-upload Upload Data at the top of the tool panel

Select galaxy-wf-edit Paste/Fetch Data

Paste the link(s) into the text field

Press Start

Close the window

As an alternative to uploading the data from a URL or your computer, the files may also have been made available from a shared data library:

Go into Libraries (left panel)

Navigate to the correct folder as indicated by your instructor.

On most Galaxies tutorial data will be provided in a folder named GTN - Material –> Topic Name -> Tutorial Name.

Select the desired files

Click on Add to History galaxy-dropdown near the top and select as Datasets from the dropdown menu

In the pop-up window, choose

“Select history”: the history you want to import the data to (or create a new one)

Click on Import
Rename the datasets

Check that the datatype is tabular

Click on the galaxy-pencil pencil icon for the dataset to edit its attributes

In the central panel, click galaxy-chart-select-data Datatypes tab on the top

In the galaxy-chart-select-data Assign Datatype, select tabular from “New Type” dropdown

Tip: you can start typing the datatype into the field to filter the dropdown menu

Click the Save button

Important tips for easier analysis

Tools are frequently updated to new versions. Your Galaxy may have multiple versions of the same tool available. By default, you will be shown the latest version of the tool. This may NOT be the same tool used in the tutorial you are accessing. Furthermore, if you use a newer tool in one step, and try using an older tool in the next step… this may fail! To ensure you use the same tool versions of a given tutorial, use the Tutorial mode feature.

Open your Galaxy server

Click on the curriculum icon on the top menu, this will open the GTN inside Galaxy.

Navigate to your tutorial

Tool names in tutorials will be blue buttons that open the correct tool for you

Note: this does not work for all tutorials (yet)

You can click anywhere in the grey-ed out area outside of the tutorial box to return back to the Galaxy analytical interface

Warning: Not all browsers work!

We’ve had some issues with Tutorial mode on Safari for Mac users.

Try a different browser if you aren’t seeing the button.

Did you know we have a unique Single Cell Omics Lab with all our single cell tools highlighted to make it easier to use on Galaxy? We recommend this site for all your single cell analysis needs, particularly for newer users.

The Single Cell Omics Lab is a different view of the underlying Galaxy server that organises tools and resources better for single-cell users! It also provides a platform for communities to engage and connect; distribute more targeted news and events; and highlight community-specific funding sources.

Try it out!

subdomain Europe: Single Cell Omics Lab

subdomain USA: Single Cell Omics Lab

subdomain Australia: Single Cell Omics Lab

When something goes wrong in Galaxy, there are a number of things you can do to find out what it was. Error messages can help you figure out whether it was a problem with one of the settings of the tool, or with the input data, or maybe there is a bug in the tool itself and the problem should be reported. Below are the steps you can follow to troubleshoot your Galaxy errors.

Expand the red history dataset by clicking on it.

Sometimes you can already see an error message here

View the error message by clicking on the bug icon galaxy-bug

Check the logs. Output (stdout) and error logs (stderr) of the tool are available:

Expand the history item

Click on the details icon

Scroll down to the Job Information section to view the 2 logs:

Tool Standard Output

Tool Standard Error

For more information about specific tool errors, please see the Troubleshooting section

Submit a bug report! If you are still unsure what the problem is.

Click on the bug icon galaxy-bug

Write down any information you think might help solve the problem

See this FAQ on how to write good bug reports

Click galaxy-bug Report button

Ask for help!

Where?

In the User community chatspace in Slack in our #single-cell-users channel

In the GTN Matrix Channel

In the Galaxy Matrix Channel

Browse the Galaxy Help Forum to see if others have encountered the same problem before (or post your question).

When asking for help, it is useful to share a link to your history

Data processing

To perform GO enrichment analysis on each cell cluster individually, we need to separate our “Markers_clusters Dataset” into seven files, one for each cluster and the “Markers_genotype Dataset” into 2 files, one for each condition. We’ll use the “Split file” tool for this step.

Hands On: File splitting

Split file ( Galaxy version 0.4) with the following parameters:

param-file “File to select”: Markers_cluster (Input dataset)

“on column”: c1

“Include the header in all splitted files?”: Yes

Comment: Input Dataset

As we have two datasets, one with the marker genes for all the seven clusters and one with the marker genes for the knockout (KO) and wild-type (WT) conditions. Make sure to repeat the analysis twice for the two different datasets. Alternatively, you can run this workflow for parallel analysis of the datasets, under Marker genes choose the second icon to select multiple datasets as shown in the below image.

Open image in new tab

Figure 1: Multiple input datasets workflow

Next, we need to isolate the Ensembl gene IDs column from each file. We’ll use the “Cut Columns” tool to achieve this.

Hands On: Extract Ensembl IDs

Cut with the following parameters:

“Cut columns”: c4

param-file “From”: split_output (output of Split file tool)

Comment: The gene format to use

In this example we extract column 4 because it contains the Ensembl gene IDs on which the subsequent steps are ideally working. While there are other gene formats like gene symbols, Entrez gene IDs, and more, make sure to check the specific format accepted by the tool you are using. There are also tools available to convert between different gene formats if needed.

GO Analysis using GOEnrichment tool

Now we will perform the GO Enrichment analysis on the list of ensembl gene IDs.

Hands On: GOEnrichment

GOEnrichment ( Galaxy version 2.0.1) with the following parameters:

param-file “Gene Ontology File”: GO (Input dataset)

param-file “Gene Product Annotation File”: GO annotations Mus musculus (Input dataset)

param-file “Study Set File”: out_file1 (output of Cut tool)

param-file “Population Set File (Optional)”: Background gene set (Input dataset)

Comment: Population Set File Selection for GO Enrichment

When choosing a background for GO enrichment analysis (Population Set File), it’s important to consider the context of your data. While using a broad background (like all genes in the organism) is common, it might be more informative to limit the background to genes expressed in the specific tissue or cell type being profiled. In this tutorial we used only genes involved in the experiment before selecting the marker genes.

Question

Take a look at the enriched terms for the different clusters, Can you find any GO terms that are specific to cluster 7?

Can we perform manual annotation of cluster 7 based on GO enrichment results?

Cluster 7 is enriched for terms like “regulation of cell death”, “T cell-mediated cytotoxicity”, and “peptidase activator activity involved in the apoptotic process”.

By looking at the most enriched functions and using our biological knowledge, we can figure out the cell types for many clusters. For example, since the data comes from thymus tissue, we already have an idea of the cell types we might find. The enriched terms in cluster 7 confirm that the cell type is macrophages, which support thymocyte maturation by cleaning up dead cells and debris.

GO Analysis using gProfiler GOSt tool

The gProfiler GOSt (Gene Ontology Sequential Testing) is another popular tool used to perform gene ontology (GO) enrichment analysis. In addition to providing enrichment results for the standard GO categories of Biological Process (BP), Cellular Component (CC), and Molecular Function (MF), the tool also analyzes enrichment across several other functional annotation databases, including KEGG Pathways, Reactome Pathways, WikiPathways and TF Targets. It also gives a plot to better visualize the results.

Hands On: gProfiler GOSt

gProfiler GOSt ( Galaxy version 0.1.7+galaxy11) with the following parameters:

param-file “Input is whitespace-separated list of genes, proteins, probes, term IDs or chromosomal regions.”: out_file1 (output of Cut tool)

“Organism”: Common organisms

“Common organisms”: Mus musculus (Mouse)

In “Tool settings”:

“Export plot”: Yes

Comment: Picking the right species matter

The species you select should match the species your genes come from. If you choose the wrong species, the tool might use incorrect information, leading to inaccurate results. For example, human genes behave differently from mouse genes, so selecting the correct species ensures the analysis is relevant to your data.

Question

Can you find enriched GO terms that are inline with the published study findings in KO results file?

What might be happening to the stem cells in the KO mice compared to the WT mice?

In the KO g:GOSt result file, enrichment for the GO term “Negative regulation of stem cell differentiation” is found.

This suggests that the KO condition is causing a delay in the differentiation of stem cells into mature T cells in the thymus which is inline with the study findings.

Conclusion

In this tutorial, we have performed GO enrichment analysis on the differentially expressed genes between 2 different conditions and between different cell types. This analysis provided valuable insights into the biological processes, molecular functions, and cellular components associated with the gene sets, enhancing our understanding of the underlying mechanisms involved in the studied conditions.

You've Finished the Tutorial

Key points

GO enrichment helps make sense of your data and understand what makes each cell cluster/condition unique.

GO enrichment analysis is used to discover new insights about how cells work, which can lead to better understanding of biological processes and diseases.

Frequently Asked Questions

Have questions about this tutorial? Have a look at the available FAQ pages and support channels

Useful literature

Further information, including links to documentation and original publications, regarding the tools, analysis techniques and the interpretation of results described in this tutorial can be found here.

Feedback

Did you use this material as an instructor? Feel free to give us feedback on how it went.
Did you use this material as a learner or student? Click the form below to leave feedback.

Citing this Tutorial

Menna Gamal, GO Enrichment Analysis on Single-Cell RNA-Seq Data (Galaxy Training Materials). https://training.galaxyproject.org/training-material/topics/single-cell/tutorials/GO-enrichment/tutorial.html Online; accessed TODAY
Hiltemann, Saskia, Rasche, Helena et al., 2023 Galaxy Training: A Powerful Framework for Teaching! PLOS Computational Biology 10.1371/journal.pcbi.1010752
Batut et al., 2018 Community-Driven Data Analysis Training for Biology Cell Systems 10.1016/j.cels.2018.05.012

@misc{single-cell-GO-enrichment,
author = "Menna Gamal",
	title = "GO Enrichment Analysis on Single-Cell RNA-Seq Data (Galaxy Training Materials)",
	year = "",
	month = "",
	day = "",
	url = "\url{https://training.galaxyproject.org/training-material/topics/single-cell/tutorials/GO-enrichment/tutorial.html}",
	note = "[Online; accessed TODAY]"
}
@article{Hiltemann_2023,
	doi = {10.1371/journal.pcbi.1010752},
	url = {https://doi.org/10.1371%2Fjournal.pcbi.1010752},
	year = 2023,
	month = {jan},
	publisher = {Public Library of Science ({PLoS})},
	volume = {19},
	number = {1},
	pages = {e1010752},
	author = {Saskia Hiltemann and Helena Rasche and Simon Gladman and Hans-Rudolf Hotz and Delphine Larivi{\`{e}}re and Daniel Blankenberg and Pratik D. Jagtap and Thomas Wollmann and Anthony Bretaudeau and Nadia Gou{\'{e}} and Timothy J. Griffin and Coline Royaux and Yvan Le Bras and Subina Mehta and Anna Syme and Frederik Coppens and Bert Droesbeke and Nicola Soranzo and Wendi Bacon and Fotis Psomopoulos and Crist{\'{o}}bal Gallardo-Alba and John Davis and Melanie Christine Föll and Matthias Fahrner and Maria A. Doyle and Beatriz Serrano-Solano and Anne Claire Fouilloux and Peter van Heusden and Wolfgang Maier and Dave Clements and Florian Heyl and Björn Grüning and B{\'{e}}r{\'{e}}nice Batut and},
	editor = {Francis Ouellette},
	title = {Galaxy Training: A powerful framework for teaching!},
	journal = {PLoS Comput Biol}
}

                   

Funding

These individuals or organisations provided funding support for the development of this resource

ELIXIR Europe

de.NBI

UFR

Congratulations on successfully completing this tutorial!

Do you want to extend your knowledge?
Follow one of our recommended follow-up trainings:

tutorial Hands-on: Pseudobulk Analysis with Decoupler and EdgeR

You can use Ephemeris's shed-tools install command to install the tools used in this tutorial.

shed-tools install [-g GALAXY] [-a API_KEY] -t <(curl https://training.galaxyproject.org/training-material/api/topics/single-cell/tutorials/GO-enrichment/tutorial.json | jq .admin_install_yaml -r)

Alternatively you can copy and paste the following YAML

---
install_tool_dependencies: true
install_repository_dependencies: true
install_resolver_dependencies: true
tools:
- name: split_file_on_column
  owner: bgruening
  revisions: 37a53100b67e
  tool_panel_section_label: Text Manipulation
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: goenrichment
  owner: iuc
  revisions: 2c7c9646ccf0
  tool_panel_section_label: Annotation
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: gprofiler_gost
  owner: iuc
  revisions: bf39cdd007f5
  tool_panel_section_label: Annotation
  tool_shed_url: https://toolshed.g2.bx.psu.edu/

t{ hist[0] | to_stars }} 1

September 2024

5 stars: Liked: the introduction and data description part was especially helpful, in that it explained thoroughly the input data, what it is used for, and the concept overview for how GO analysis works. Disliked: i understand this tutorial was for people who are using galaxy, but i was curious about other tools that could be used for single cell GO enrichment analysis.