Inferring single cell trajectories with Scanpy
Author(s) | Marisa Loach Wendi Bacon Julia Jakiela Mehmet Tekman |
Editor(s) | Pavankumar Videm Helena Rasche |
OverviewQuestions:Objectives:
How can I infer lineage relationships between single cells based on their RNA, without a time series?
Requirements:
Execute multiple plotting methods designed to identify lineage relationships between cells
Interpret these plots
- Introduction to Galaxy Analyses
- tutorial Hands-on: Generating a single cell matrix using Alevin
- tutorial Hands-on: Combining single cell datasets after pre-processing
- tutorial Hands-on: Filter, plot and explore single-cell RNA-seq data with Scanpy
Time estimation: 3 hoursSupporting Materials:Published: Dec 8, 2023Last modification: Oct 28, 2024License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MITpurl PURL: https://gxy.io/GTN:T00379rating Rating: 3.7 (3 recent ratings, 3 all time)version Revision: 9
Introduction
You’ve done all the hard work of preparing a single-cell matrix, processing it, plotting it, interpreting it, and finding lots of lovely genes. Now you want to infer trajectories, or relationships between cells… you can do that here, using the Galaxy interface, or head over to the Jupyter notebook version of this tutorial to learn how to perform the same analysis using Python.
Traditionally, we thought that differentiating or changing cells jumped between discrete states, so ‘Cell A’ became ‘Cell B’ as part of its maturation. However, most data shows otherwise. Generally, there is a spectrum (a ‘trajectory’, if you will…) of small, subtle changes along a pathway of that differentiation. Trying to analyse cells every 10 seconds can be pretty tricky, so ‘pseudotime’ analysis takes a single sample and assumes that those cells are all on slightly different points along a path of differentiation. Some cells might be slightly more mature and others slightly less, all captured at the same ‘time’. These cells are sorted accordingly along these pseudotime paths of differentiation to build a continuum of cells from one state to the next. We therefore ‘assume’ or ‘infer’ relationships from this continuum of cells.
We will use the same sample from the previous three tutorials, which contains largely T-cells in the thymus. We know T-cells differentiate in the thymus, so we would assume that we would capture cells at slightly different time points within the same sample. Furthermore, our cluster analysis alone showed different states of T-cells. Now it’s time to look further!
Comment: Tutorial from ScanpyPlease note, this tutorial is largely based on the trajectories tutorial found on the Scanpy site itself (Alex Wolf 2023).
AgendaIn this tutorial, we will cover:
Important tips for easier analysis
Tools are frequently updated to new versions. Your Galaxy may have multiple versions of the same tool available. By default, you will be shown the latest version of the tool. This may NOT be the same tool used in the tutorial you are accessing. Furthermore, if you use a newer tool in one step, and try using an older tool in the next step… this may fail! To ensure you use the same tool versions of a given tutorial, use the Tutorial mode feature.
- Open your Galaxy server
- Click on the curriculum icon on the top menu, this will open the GTN inside Galaxy.
- Navigate to your tutorial
- Tool names in tutorials will be blue buttons that open the correct tool for you
- Note: this does not work for all tutorials (yet)
- You can click anywhere in the grey-ed out area outside of the tutorial box to return back to the Galaxy analytical interface
Warning: Not all browsers work!
- We’ve had some issues with Tutorial mode on Safari for Mac users.
- Try a different browser if you aren’t seeing the button.
Did you know we have a unique Single Cell Omics Lab with all our single cell tools highlighted to make it easier to use on Galaxy? We recommend this site for all your single cell analysis needs, particularly for newer users.
The Single Cell Omics Lab currently uses the main European Galaxy infrastructure and power, it’s just organised better for users of particular analyses…like single cell!
Try it out!
- subdomain Europe: Single Cell Omics Lab
- subdomain USA: Single Cell Omics Lab
- subdomain Australia: Single Cell Omics Lab
When something goes wrong in Galaxy, there are a number of things you can do to find out what it was. Error messages can help you figure out whether it was a problem with one of the settings of the tool, or with the input data, or maybe there is a bug in the tool itself and the problem should be reported. Below are the steps you can follow to troubleshoot your Galaxy errors.
- Expand the red history dataset by clicking on it.
- Sometimes you can already see an error message here
View the error message by clicking on the bug icon galaxy-bug
- Check the logs. Output (stdout) and error logs (stderr) of the tool are available:
- Expand the history item
- Click on the details icon
- Scroll down to the Job Information section to view the 2 logs:
- Tool Standard Output
- Tool Standard Error
- For more information about specific tool errors, please see the Troubleshooting section
- Submit a bug report! If you are still unsure what the problem is.
- Click on the bug icon galaxy-bug
- Write down any information you think might help solve the problem
- See this FAQ on how to write good bug reports
- Click galaxy-bug Report button
- Ask for help!
- Where?
- In the User community chatspace in Slack in our #single-cell-users channel
- In the GTN Matrix Channel
- In the Galaxy Matrix Channel
- Browse the Galaxy Help Forum to see if others have encountered the same problem before (or post your question).
- When asking for help, it is useful to share a link to your history
Prepare datasets
Get data
We’ve provided you with experimental data to analyse from a mouse dataset of fetal growth restriction Bacon et al. 2018. This is the full dataset generated from this tutorial (see the study in Single Cell Expression Atlas and the project submission). You can find the final dataset in this input history or download from Zenodo below.
Hands-on: Choose Your Own TutorialThis is a "Choose Your Own Tutorial" section, where you can select between multiple paths. Click one of the buttons below to select how you want to follow the tutorial
Importing via History is quickest. Works only on Galaxy EU for now.
Hands-on: Option 1: Data upload - Import history
Import history from: input history
- Open the link to the shared history
- Click on the Import this history button on the top left
- Enter a title for the new history
- Click on Copy History
Rename galaxy-pencil the the history to your name of choice.
Hands-on: Option 2: Data upload - Add to history
- Create a new history for this tutorial
Import the AnnData object from Zenodo
https://zenodo.org/record/7075718/files/Final_cell_annotated_object.h5ad
- Copy the link location
Click galaxy-upload Upload Data at the top of the tool panel
- Select galaxy-wf-edit Paste/Fetch Data
Paste the link(s) into the text field
Press Start
- Close the window
Rename galaxy-pencil the .h5ad object as
Final cell annotated object
- Click on the galaxy-pencil pencil icon for the dataset to edit its attributes
- In the central panel, change the Name field to
Final cell annotated object
- Click the Save button
Check that the datatype is
h5ad
- Click on the galaxy-pencil pencil icon for the dataset to edit its attributes
- In the central panel, click galaxy-chart-select-data Datatypes tab on the top
- In the galaxy-chart-select-data Assign Datatype, select
h5ad
from “New type” dropdown
- Tip: you can start typing the datatype into the field to filter the dropdown menu
- Click the Save button
Filtering for T-cells
One problem with our current dataset is that it’s not just T-cells: we found in the previous tutorial that it also contains macrophages. This is a problem, because trajectory analysis will generally try to find relationships between all the cells in the sample. We need to remove those cell types to analyse the trajectory.
Hands-on: Removing macrophages
- Manipulate AnnData ( Galaxy version 0.7.5+galaxy1) with the following parameters:
- param-file “Annotated data matrix”:
Final cell annotated object
(Input dataset)- “Function to manipulate the object”:
Filter observations or variables
- “What to filter?”:
Observations (obs)
- “Type of filtering?”:
By key (column) values
- “Key to filter”:
cell_type
- “Type of value to filter”:
Text
- “Filter”:
not equal to
- “Value”:
Macrophages
-- Rename galaxy-pencil output h5ad
T-cell_object.h5ad
You should now have 8569
cells, as opposed to the 8605
you started with. You’ve only removed a few cells (the contaminants!), but it makes a big difference in the next steps.
Force-directed graph
First, we will calculate a force-directed graph (FDG), as an alternate to tSNE, which will likely work better for trajectory analysis.
Calculate force-directed graph
Hands-on: Draw FDG
- Scanpy RunFDG ( Galaxy version 1.8.1+galaxy9) with the following parameters:
- param-file “Input object in AnnData/Loom format”:
T-cell_object.h5ad
(output of Manipulate AnnData tool)- “Use programme defaults”: galaxy-toggle
No
- “Graph layout”:
fa
Comment: Graph LayoutWe’re using the fa or ForceAtlas2 layout for our FDGs. It is the same layout used in the Jupyter notebook version of this tutorial and works well for our data. As well as choosing the fa layout when we create the FDGs, we will also specify the
draw_graph_fa
embedding when drawing the plots.
Plot the FDG
And now time to plot it!
Hands-on: Plot the FDG
- Scanpy PlotEmbed ( Galaxy version 1.8.1+galaxy9) with the following parameters:
- param-file “Input object in AnnData/Loom format”:
FDG object Anndata
(output of Scanpy RunFDG tool)- “name of the embedding to plot”:
draw_graph_fa
- “color by attributes, comma separated texts”:
cell_type
- “Use raw attributes”: galaxy-toggle
No
- “Location of legend”:
On data
Question
- What has the FDG done to our clusters of T-cells and what might this suggest about the relationships between these groups?
- Well now this is exciting! Our DP-late is more clearly separating, and we might also suppose that DP-M1, DP-M2, and DP-M3 are actually earlier on in the differentiation towards mature T-cells. And we’re only just getting started!
Diffusion maps
We’ll now perform an optional step, that basically takes the place of the standard Principle Component Analysis (PCA). Instead of using PCs, we can use diffusion maps.
Draw diffusion map
Hands-on: Draw the Diffusion Map
- Scanpy DiffusionMap ( Galaxy version 1.8.1+galaxy9) with the following parameters:
- param-file “Input object in AnnData/Loom format”:
FDG object Anndata
(output of Scanpy RunFDG tool)- “Number of diffusion components to calculate”:
15
Comment: Choosing the number of diffusion componentsWe could change the number of diffusion components and end up with a slightly different plot - a bit like if we changed the number of principal components used in the PCA we ran in the Filter, Plot and Explore tutorial. 15 seems to work well for this dataset and matches the number used in the Jupyter version of this tutorial, so we’ll stick with that.
Re-calculate Nearest Neighbours
Now that we have our diffusion map, we need to re-calculate neighbors using the diffusion map instead of the PCs.
Hands-on: Compute neighbours using diffusion map
- Scanpy ComputeGraph ( Galaxy version 1.8.1+galaxy9) with the following parameters:
- param-file “Input object in AnnData/Loom format”:
DiffusionMap Anndata
(output of Scanpy DiffusionMap tool)- “Use programme defaults”: param-toggle
No
- “Use the indicated representation”:
X_diffmap
CommentIf you’re using the latest versions of these tools (e.g. Scanpy ComputeGraph ( Galaxy version 1.9.3+galaxy0), rather than the ones suggested in the tutorial (e.g. Scanpy ComputeGraph ( Galaxy version 1.8.1+galaxy9) then you may need to change one more parameter here to set the
Number of PCs to use
to 15. These are the 15 diffusion components we just calculated, rather than actual PCs.
Re-draw the FDG
Now that we’ve re-calculated the nearest neighbours, we can use these new neighbours to re-draw the FDG to see how this changes the plot.
Hands-on: Plot a new FDG
- Scanpy RunFDG ( Galaxy version 1.8.1+galaxy9) with the following parameters:
- param-file “Input object in AnnData/Loom format”:
Graph object Anndata
(output of Scanpy ComputeGraph tool)- “Use programme defaults”: param-toggle
No
- “Graph layout”:
fa
- Scanpy PlotEmbed ( Galaxy version 1.8.1+galaxy9) with the following parameters:
- param-file “Input object in AnnData/Loom format”:
FDG object Anndata
(output of Scanpy RunFDG tool)- “name of the embedding to plot”:
draw_graph_fa
- “color by attributes, comma separated texts”:
cell_type
- “Use raw attributes”:
No
- “Location of legend”:
On data
Question
- Does this plot seem better or worse than before? Remember that we’re trying to understand the relationships between our groups of cells in time.
- Oh dear! This doesn’t look great. Maybe the DP-M4 cells are a whole other trajectory? That doesn’t seem right. Saying that, this does spread out our T-mature cells, which makes a lot more sense when it comes to T-cell biology (we expect T-cells to differentiate into two types of T-cells, Cd8+Cd4- and Cd4+Cd8-). If you wanted to, you could also re-cluster your cells (since you’ve changed the neighborhood graph on which the clusterisation depends). However, we tried that, and it called far too many clusters given the depth of sequencing in this dataset. Let’s stick with our known cell types and move from there.
If you are working in a group, you can now divide up a decision here with one control and the rest can vary in numbers so that you can compare results throughout the tutorials.
- Control
- Go straight to the Partition-based Graph Abstraction (PAGA) section
- Everyone else:
- you could recluster your cells using Scanpy FindCluster ( Galaxy version 1.8.1+galaxy0) at a different resolution, perhaps lower than the 0.6 we used before (Take a look at the Cell clusters step in the Filter, Plot and Explore tutorial if you need help with this.) Please note that in this case, you will want to change the PAGA step Scanpy PAGA to group by
louvain
rather thancell_type
. You can certainly still plot both, we only didn’t because with using our old Louvain calls, thecell_type
andlouvain
categories are identical.- you could undo the optional diffusion map step by recalculating the neighbours again using
X_pca
instead ofX_diffmap
- you could also try changing the number of neighbors used in that step when running Scanpy ComputeGraph ( Galaxy version 1.8.1+galaxy9)
- Everyone else: You will want to compare FREQUENTLY with your control team member.
Partition-based Graph Abstraction (PAGA)
PAGA is used to generalise relationships between groups, or likely clusters, in this case. It will make it much easier to see the trajectories between our clusters of T-cells.
Hands-on: Plot PAGA
- Scanpy PAGA ( Galaxy version 1.8.1+galaxy9) with the following parameters:
- param-file “Input object in AnnData/Loom format”:
FDG object Anndata
(output of Scanpy RunFDG tool)- “Name of the clustering”:
cell_type
Comment: Plotting gene expressionWe can now draw our PAGA plot and we might also be interested in colouring our plot by genes as well. In this case, remembering that we are dutifully counting our genes by their EnsemblIDs rather than Symbols (which do not exist for all EnsemblIDs), we have to look up our genes of interest (CD4, CD8a) and plot the corresponding IDs in the next step.
- Scanpy PlotTrajectory ( Galaxy version 1.8.1+galaxy9) with the following parameters:
- param-file “Input object in AnnData/Loom format”:
PAGA object Anndata
(output of Scanpy PAGA tool)- “Layout functions”:
ForceAtlas2
- “Location of legend”:
On data
- “Use programme defaults”: param-toggle
No
- “Name of cell annotation or gene that is used to color the nodes”:
cell_type,ENSMUSG00000023274,ENSMUSG00000053977
Comment: Choosing the layoutWe’re going to pick the
ForceAtlas2
layout function for our PAGA plots as this is the same type of layout that we used for our FDG.
Question
- How have the relationships between our cell clusters changed now?
- Which clusters are expressing our genes of interest, Cd4 and Cd8, at the highest levels?
- The way the clusters are arranged has changed a bit now. The M4 cluster is right in the middle of the M1-3 clusters, rather than heading off on its own. The M1 cluster is looking like it is driving towards differentiation, which is not something we had necessarily been able to specify before by just looking at our cluster graphs or applying our biological knowledge.
- Cd4 and Cd8 expression appear highest in the DP-L cluster. The expression of both Cd4 and Cd8 also appears higher than we might expect in the DP-M4 cluster - perhaps this is a sign that it is closer to the DP-L cluster than it seems in this simple plot.
Re-draw force-directed graph (again!)
Force-directed graphs can be initialised randomly, or we can prod it in the right direction. We’ll prod it with our PAGA calculations. Note that you could also try prodding it with tSNE or UMAP. A lot of these tools can be used on top of each other or with each other in different ways, this tutorial is just one example. Similarly, you could be using any obs information for grouping, so could do this for louvain or cell_type for instance.
Hands-on: Initialise FDG using PAGA
- Scanpy RunFDG ( Galaxy version 1.8.1+galaxy9) with the following parameters:
- param-file “Input object in AnnData/Loom format”:
Plotted PAGA Anndata
(output of Scanpy PlotTrajectory tool)- “Use programme defaults”: param-toggle
No
- “Method to initialise embedding, any key for adata.obsm or choose from the preset methods”:
paga
- “Graph layout”:
fa
- Scanpy PlotEmbed ( Galaxy version 1.8.1+galaxy9) with the following parameters:
- param-file “Input object in AnnData/Loom format”:
FDG object Anndata
(output of Scanpy RunFDG tool)- “name of the embedding to plot”:
draw_graph_fa
- “color by attributes, comma separated texts”:
cell_type
- “Use raw attributes”:
No
- “Location of legend”:
On data
Question
- How has basing our FDG on the PAGA plot changed the relationships between our cells?
- The interesting change here occurs between the Double Negative (DN) and Double Positive Module 4 (DP-M4) cells. Our DP-M4 cells are now heading on a clear trajectory towards differentiation. It looks like we’ve got the correct ordering of cells from DN through the DP groups and on towards T-mature. We didn’t see this in our previous plots.
Exploring the results
The experiment that produced this data used two different groups of mice - the control or wildtype group and the knockout mice that were missing a gene involved in placental development, which impacted thymus development. Since we know the genotype of the mice from which each sample was collected, we can colour in our plots to see if there are any differences in the cells present in wildtype and knockout mice.
Plotting by Genotype
The easiest way to do this is just to rerun galaxy-refresh the previous step, but change the attribute we want to use to colour the FDG plot.
Hands-on: Plot by genotype
- Scanpy PlotEmbed ( Galaxy version 1.8.1+galaxy9) with the following parameters:
- param-file “Input object in AnnData/Loom format”:
FDG object Anndata
(output of Scanpy RunFDG tool)- “name of the embedding to plot”:
draw_graph_fa
- “color by attributes, comma separated texts”:
genotype
- “Use raw attributes”:
No
- “Location of legend”:
Right margin
Question
- Are there any differences in the distribution of the wildtype and knockout cells?
- We’re seeing a clear trajectory issue whereby the knockout cells are not found along the trajectory into T-mature (which, well, we kind of already figured out with just the cluster analysis, but we can feel even more confident about our results!)
Plotting Gene Expression
We’re also interested in the expression of the two genes that are known to be markers of the two different types of mature T-cells: Cd4 and Cd8. We can colour in our plot to show which cells are expressing these genes.
Hands-on: Plot for gene expression
- Scanpy PlotEmbed ( Galaxy version 1.8.1+galaxy9) with the following parameters:
- param-file “Input object in AnnData/Loom format”:
FDG object Anndata
(output of Scanpy RunFDG tool)- “name of the embedding to plot”:
draw_graph_fa
- “color by attributes, comma separated texts”:
ENSMUSG00000023274,ENSMUSG00000053977
- “Use raw attributes”:
No
- “Location of legend”:
On data
Comment: Gene SymbolsWe’re using the EnsemblIDs during this tutorial, as discussed above. If you like, you could change the names of these plots to the gene symbols by filling in the optional “Figure title” field with
Cd4,Cd8
. Make sure that the order of your figure titles matches the order of the EnsemblIDs in the colour by field. ENSMUSG00000023274 is Cd4 and ENSMUSG00000053977 is Cd8.
Question
- Does the expression pattern of these genes tell us anything about our cells?
- It’s clear that both Cd4 and Cd8 are being expressed mainly in the later stages of T-cell development, as they head towards the DP-L cluster - although Cd4 expression is a bit more widespread. This is what we would expect to see in genes that are associated with mature T-cells. You might also be able to spot some differences in the expression of the two genes in our mature T-cell group, but there doesn’t seem to be a very clear division between the cells that express Cd4 and those that express Cd8. This is a bit disappointing, as we know that there are two types of mature T-cells, which each express a different gene.
Diffusion pseudotime
Now that we have a reasonable FDG plot for our cells, based on the diffusion map (if used) and PAGA plot, we can place our cells into pseudotime. Pseudotime lets us imagine that instead of looking at a sample of cells taken at a single timepoint, we are looking at cells moving through time. Our sample included cells at different stages of their development, but we can use pseudotime to think of these as different timepoints in the journey of individual cells.
We know that our cells are initialising at DN. We can feed that information into our algorithms by naming DN as the root cell type to then calculate a trajectory starting from these cells.
If you called new clusters using Scanpy FindCluster ( Galaxy version 1.8.1+galaxy0), you might want to choose one of those clusters to be your root cell instead, so change the
cell_type
forlouvain
and then name the cluster number. Use the plots you created to help you pick the number!
On to the diffusion pseudotime, where we infer multiple time points within the same piece of data!
Hands-on: DPT Plot
- Scanpy DPT ( Galaxy version 1.8.1+galaxy9) with the following parameters:
- param-file “Input object in AnnData/Loom format”:
FDG object Anndata
(output of Scanpy RunFDG tool)- “Name of attribute that defines clustering”:
cell_type
- “Name of the clustering that defines the root cell type”:
DN
- Scanpy PlotEmbed ( Galaxy version 1.8.1+galaxy9) with the following parameters:
- param-file “Input object in AnnData/Loom format”:
Diffusion pseudotime inference Anndata
(output of Scanpy DPT tool)- “name of the embedding to plot”:
draw_graph_fa
- “color by attributes, comma separated texts”:
cell_type,dpt_pseudotime
- “Use raw attributes”:
No
- “Location of legend”:
On data
Question
- Does the pseudotime plot match with your expectations of which cells represent earlier or later stages of T-cell development?
- When we look at the cell type and DPT plots, we can see that there’s a clear progression from our root DN cells, through the various groups of DP cells, into DP-L and then T-mat. This matches with our expectations that the DP-L and DP-mat clusters represent later stages in T-cell development. The DPT plot also confirms that the DP-M4 cluster is heading towards differentiation, which makes sense given its position on our FDG plot.
This is nice, as it supports our conclusions thus far on the trajectory of the T-cell differentiation. With single-cell, the more ways you can prove to yourself what you’re seeing is real, the better! If we did not find consistent results, we would need to delve in further to see if the cause is the algorithm (not all algorithms fit all data!) or the biology.
Where might we go from here? We might consider playing with our louvain resolutions, to see if we can get the two groups of Cd4+ and Cd8+ cells to be called as different clusters, and then comparing them to each other for gene differences or genotype differences. We might also use different objects (for instance, what if we regressed out cell cycle genes?) and see if that changes the results. What would you do?
Look at each others images! How do yours differ, what decisions were made? Previously, when calling clusters in the Filter, Plot and Explore Single-cell RNA-seq Data tutorial, the interpretation at the end is largely consistent, no matter what decisions are made throughout (mostly!). Is this the case with your trajectory analyses? You may find that it is not, which is why pseudotime analysis even more crucially depends on your understanding of the underlying biology (we have to choose the root cells, for instance, or recognise that DN cells should not be found in the middle of the DPs) as well as choosing the right analysis. That’s why it is a huge field! With analysing scRNA-seq data, it’s almost like you need to know about 75% of your data and make sure your analysis shows that, for you to then identify the 25% new information.
Conclusion
Congratulations! You’ve made it to the end! You might be interested in the workflowworkflow for this tutorial or this galaxy-historyExample History which shows the results you should expect to see if you follow this tutorial.
In this tutorial, you moved from called clusters to inferred relationships and trajectories using pseudotime analysis. You found an alternative to PCA (diffusion map), an alternative to tSNE (force-directed graph), a means of identifying cluster relationships (PAGA), and a metric for pseudotime (diffusion pseudotime) to identify early and late cells. If you were working in a group, you found that such analysis is slightly more sensitive to your decisions than the simpler filtering/plotting/clustering is. We are inferring and assuming relationships and time, so that makes sense!