GO Enrichment Analysis on Single-Cell RNA-Seq Data

Overview
Creative Commons License: CC-BY Questions:
  • What is Gene Ontology (GO) enrichment analysis, and why should I perform it on my marker genes?

  • How can I use GO enrichment analysis to better understand the biological functions of the genes in my clusters?

  • Can GO enrichment analysis help me confirm that my clusters represent distinct cell types or states?

  • How can I visualize my GO enrichment results to make them easier to understand and interpret?

Objectives:
  • Understand the Role of GO Enrichment in Single-Cell Analysis.

  • Use marker genes from different cell clusters or conditions for GO enrichment analysis.

  • Compare enrichment across experimental conditions (e.g., wild type vs. knockout) to uncover functional changes associated with genetic or environmental perturbations.

  • Link GO enrichment results with previously annotated cell clusters, providing a clearer picture of the functional roles of different cell populations.

Requirements:
Time estimation: 3 hours
Supporting Materials:
Published: Sep 17, 2024
Last modification: Sep 17, 2024
License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MIT
version Revision: 1
Agenda

In this tutorial, we will cover:

  1. Introduction
  2. Data description
    1. [A] Marker Genes:
    2. [B] GO Enrichment Files:
  3. Get data
    1. Important tips for easier analysis
  4. Data processing
  5. GO Analysis using GOEnrichment tool
  6. GO Analysis using gProfiler GOSt tool
  7. Conclusion

Introduction

In the tutorial Filter, plot and explore single-cell RNA-seq data with Scanpy, we took an important step in our single-cell RNA sequencing analysis by identifying marker genes for each of the clusters in our dataset. These marker genes are crucial, as they help us distinguish between different cell types and states, giving us a clearer picture of the cellular diversity within our samples. However, simply identifying marker genes is just the beginning. To truly understand what makes each cluster unique, we need to dive deeper into the biological functions these genes are involved in. This is where Gene Ontology (GO) enrichment analysis comes into play. We will perform GO enrichment analysis as a type of over-representation analysis (ORA), ORA is a statistical method that determines whether genes from pre-defined sets (e.g. genes belonging to a specific GO term) are expressed more than would be expected in a subset of your data. The most commonly used statistical tests are Fischer’s exact test and hypergeometric test, more details about them are explained in the tutorial slides.

Data description

In this tutorial will use the following datasets:

[A] Marker Genes:

We’ll start with two input datasets of marker genes (Study sets):

  • Marker genes per cell cluster: This dataset lists the genes that are significantly different in each cell cluster.
  • Marker genes per condition (wt and ko): This dataset lists the genes that are significantly different between the wild-type (wt) and knockout (ko) conditions.

Note: Marker genes were obtained using Scanpy FindMarkers tool. The top 50 marker genes were included in the downstream GO enrichment analysis. Scanpy FindMarkers tool selects the marker genes based on their log2 fold change and p-values. Focusing on the top-ranked genes helps to filter out less relevant genes, thereby addressing the concern of high false positives that can arise from traditional methods.

[B] GO Enrichment Files:

We’ll also use three additional files for GO enrichment analysis.

  • Gene Ontology file: This file contains information about Gene Ontology terms.
  • GO Annotations file: This file maps genes to their corresponding GO terms.
  • Population set file: This file provides a list of all genes involved in the experiment and is used as a background gene set for the analysis.

Note: There are several online databases available for downloading GO and GO Annotations files, including the Gene Ontology website, Ensembl, and the UCSC Genome Browser.

Comment: Concept behind GO Enrichment Analysis

The goal of GO enrichment analysis is to interpret the biological significance of long lists of marker genes. By summarizing these genes into a shorter list of enriched GO terms. The analysis works by comparing each GO term between your list of marker genes and a background gene set. Statistical tests are then used to calculate a p-value that indicates whether a particular GO term is significantly enriched in the marker gene list compared to the background.

Get data

You can access the data for this tutorial in multiple ways:

  1. Importing from a history - You can import this history

    1. Open the link to the shared history
    2. Click on the Import this history button on the top left
    3. Enter a title for the new history
    4. Click on Copy History

  2. Uploading from Zenodo (see below)

Hands-on: Data Upload from Zenodo
  1. Create a new history for this tutorial
  2. Import the files from Zenodo

    https://zenodo.org/records/13461890/files/Galaxy3-[GO].obo
    https://zenodo.org/records/13461890/files/Galaxy2-[GO_annotations_Mus_musculus].tabular
    https://zenodo.org/records/13461890/files/Galaxy5-[Markers_-_clusters].tabular
    https://zenodo.org/records/13461890/files/Galaxy4-[Background_gene_set].tabular
    https://zenodo.org/records/13461890/files/Galaxy1-[Markers_-_genotype_].tabular
    
    • Copy the link location
    • Click galaxy-upload Upload Data at the top of the tool panel

    • Select galaxy-wf-edit Paste/Fetch Data
    • Paste the link(s) into the text field

    • Press Start

    • Close the window

    As an alternative to uploading the data from a URL or your computer, the files may also have been made available from a shared data library:

    1. Go into Data (top panel) then Data libraries
    2. Navigate to the correct folder as indicated by your instructor.
      • On most Galaxies tutorial data will be provided in a folder named GTN - Material –> Topic Name -> Tutorial Name.
    3. Select the desired files
    4. Click on Add to History galaxy-dropdown near the top and select as Datasets from the dropdown menu
    5. In the pop-up window, choose

      • “Select history”: the history you want to import the data to (or create a new one)
    6. Click on Import

  3. Rename the datasets
  4. Check that the datatype is tabular

    • Click on the galaxy-pencil pencil icon for the dataset to edit its attributes
    • In the central panel, click galaxy-chart-select-data Datatypes tab on the top
    • In the galaxy-chart-select-data Assign Datatype, select tabular from “New type” dropdown
      • Tip: you can start typing the datatype into the field to filter the dropdown menu
    • Click the Save button

Important tips for easier analysis

Tools are frequently updated to new versions. Your Galaxy may have multiple versions of the same tool available. By default, you will be shown the latest version of the tool. This may NOT be the same tool used in the tutorial you are accessing. Furthermore, if you use a newer tool in one step, and try using an older tool in the next step… this may fail! To ensure you use the same tool versions of a given tutorial, use the Tutorial mode feature.

  • Open your Galaxy server
  • Click on the curriculum icon on the top menu, this will open the GTN inside Galaxy.
  • Navigate to your tutorial
  • Tool names in tutorials will be blue buttons that open the correct tool for you
  • Note: this does not work for all tutorials (yet) gif showing how GTN-in-Galaxy works
  • You can click anywhere in the grey-ed out area outside of the tutorial box to return back to the Galaxy analytical interface
Warning: Not all browsers work!
  • We’ve had some issues with Tutorial mode on Safari for Mac users.
  • Try a different browser if you aren’t seeing the button.

Did you know we have a unique Single Cell Omics Lab with all our single cell tools highlighted to make it easier to use on Galaxy? We recommend this site for all your single cell analysis needs, particularly for newer users.

The Single Cell Omics Lab currently uses the main European Galaxy infrastructure and power, it’s just organised better for users of particular analyses…like single cell!

Try it out! All your histories/workflows/logins from the general European Galaxy server will be there!

When something goes wrong in Galaxy, there are a number of things you can do to find out what it was. Error messages can help you figure out whether it was a problem with one of the settings of the tool, or with the input data, or maybe there is a bug in the tool itself and the problem should be reported. Below are the steps you can follow to troubleshoot your Galaxy errors.

  1. Expand the red history dataset by clicking on it.
    • Sometimes you can already see an error message here
  2. View the error message by clicking on the bug icon galaxy-bug

  3. Check the logs. Output (stdout) and error logs (stderr) of the tool are available:
    • Expand the history item
    • Click on the details icon
    • Scroll down to the Job Information section to view the 2 logs:
      • Tool Standard Output
      • Tool Standard Error
    • For more information about specific tool errors, please see the Troubleshooting section
  4. Submit a bug report! If you are still unsure what the problem is.
    • Click on the bug icon galaxy-bug
    • Write down any information you think might help solve the problem
      • See this FAQ on how to write good bug reports
    • Click galaxy-bug Report button
  5. Ask for help!

Data processing

To perform GO enrichment analysis on each cell cluster individually, we need to separate our “Markers_clusters Dataset” into seven files, one for each cluster and the “Markers_genotype Dataset” into 2 files, one for each condition. We’ll use the “Split file” tool for this step.

Hands-on: File splitting
  1. Split file ( Galaxy version 0.4) with the following parameters:
    • param-file “File to select”: Markers_cluster (Input dataset)
    • “on column”: c1
    • “Include the header in all splitted files?”: Yes
    Comment: Input Dataset

    As we have two datasets, one with the marker genes for all the seven clusters and one with the marker genes for the knockout (KO) and wild-type (WT) conditions. Make sure to repeat the analysis twice for the two different datasets. Alternatively, you can run this workflow for parallel analysis of the datasets, under Marker genes choose the second icon to select multiple datasets as shown in the below image.

Multiple input datasets workflow. Open image in new tab

Figure 1: Multiple input datasets workflow

Next, we need to isolate the Ensembl gene IDs column from each file. We’ll use the “Cut Columns” tool to achieve this.

Hands-on: Extract Ensembl IDs
  1. Cut with the following parameters:
    • “Cut columns”: c4
    • param-file “From”: split_output (output of Split file tool)
    Comment: The gene format to use

    In this example we extract column 4 because it contains the Ensembl gene IDs on which the subsequent steps are ideally working. While there are other gene formats like gene symbols, Entrez gene IDs, and more, make sure to check the specific format accepted by the tool you are using. There are also tools available to convert between different gene formats if needed.

GO Analysis using GOEnrichment tool

Now we will perform the GO Enrichment analysis on the list of ensembl gene IDs.

Hands-on: GOEnrichment
  1. GOEnrichment ( Galaxy version 2.0.1) with the following parameters:
    • param-file “Gene Ontology File”: GO (Input dataset)
    • param-file “Gene Product Annotation File”: GO annotations Mus musculus (Input dataset)
    • param-file “Study Set File”: out_file1 (output of Cut tool)
    • param-file “Population Set File (Optional)”: Background gene set (Input dataset)
    Comment: Population Set File Selection for GO Enrichment

    When choosing a background for GO enrichment analysis (Population Set File), it’s important to consider the context of your data. While using a broad background (like all genes in the organism) is common, it might be more informative to limit the background to genes expressed in the specific tissue or cell type being profiled. In this tutorial we used only genes involved in the experiment before selecting the marker genes.

Question
  1. Take a look at the enriched terms for the different clusters, Can you find any GO terms that are specific to cluster 7?
  2. Can we perform manual annotation of cluster 7 based on GO enrichment results?
  1. Cluster 7 is enriched for terms like “regulation of cell death”, “T cell-mediated cytotoxicity”, and “peptidase activator activity involved in the apoptotic process”.
  2. By looking at the most enriched functions and using our biological knowledge, we can figure out the cell types for many clusters. For example, since the data comes from thymus tissue, we already have an idea of the cell types we might find. The enriched terms in cluster 7 confirm that the cell type is macrophages, which support thymocyte maturation by cleaning up dead cells and debris.

GO Analysis using gProfiler GOSt tool

The gProfiler GOSt (Gene Ontology Sequential Testing) is another popular tool used to perform gene ontology (GO) enrichment analysis. In addition to providing enrichment results for the standard GO categories of Biological Process (BP), Cellular Component (CC), and Molecular Function (MF), the tool also analyzes enrichment across several other functional annotation databases, including KEGG Pathways, Reactome Pathways, WikiPathways and TF Targets. It also gives a plot to better visualize the results.

Hands-on: gProfiler GOSt
  1. gProfiler GOSt ( Galaxy version 0.1.7+galaxy11) with the following parameters:
    • param-file “Input is whitespace-separated list of genes, proteins, probes, term IDs or chromosomal regions.”: out_file1 (output of Cut tool)
    • “Organism”: Common organisms
      • “Common organisms”: Mus musculus (Mouse)
    • In “Tool settings”:
      • “Export plot”: Yes
    Comment: Picking the right species matter

    The species you select should match the species your genes come from. If you choose the wrong species, the tool might use incorrect information, leading to inaccurate results. For example, human genes behave differently from mouse genes, so selecting the correct species ensures the analysis is relevant to your data.

Question
  1. Can you find enriched GO terms that are inline with the published study findings in KO results file?
  2. What might be happening to the stem cells in the KO mice compared to the WT mice?
  1. In the KO g:GOSt result file, enrichment for the GO term “Negative regulation of stem cell differentiation” is found.
  2. This suggests that the KO condition is causing a delay in the differentiation of stem cells into mature T cells in the thymus which is inline with the study findings.

Conclusion

In this tutorial, we have performed GO enrichment analysis on the differentially expressed genes between 2 different conditions and between different cell types. This analysis provided valuable insights into the biological processes, molecular functions, and cellular components associated with the gene sets, enhancing our understanding of the underlying mechanisms involved in the studied conditions.