Filter, plot and explore single-cell RNA-seq data with Scanpy

Author(s)	Wendi Bacon
Editor(s)	Helena Rasche Julia Jakiela Wendi Bacon
Tester(s)	Julia Jakiela
Reviewers
Infrastructure	Pavankumar Videm Amirhossein Naghsh Nilchi

Overview
Questions:

How can I assess my dataset quality?

How do I pick thresholds and parameters in my analysis? What’s a “reasonable” number, and will the world collapse if I pick the wrong one?

How do I generate and annotate cell clusters?

Objectives:

Interpret quality control plots to inform parameter-decision making

Repeat analysis from matrix to clustering

Identify decision-making points

Appraise data outputs and make informed decisions

Explain why single cell analysis is an iterative (i.e. the first plots you generate are not final, but rather you go back and re-analyse your data repeatedly) process

Requirements:

Introduction to Galaxy Analyses

tutorial Hands-on: Generating a single cell matrix using Alevin

tutorial Hands-on: Combining single cell datasets after pre-processing

Time estimation: 3 hours

Supporting Materials:

Datasets

Workflows

galaxy-history-input Input Histories

UseGalaxy.eu

help How to Use This

galaxy-history-answer Answer Histories

UseGalaxy.eu_Answer_Key_History
2024-03-26

UseGalaxy.eu-ARCHIVEDv2
2023-10-10

UseGalaxy.eu-ARCHIVEDv1
2024-12-10

Older_version
2024-02-28

help How to Use This

FAQs

video Recordings

video Tutorial (February 2021) - 30m

video View All

instances Available on these Galaxies

Known Working

UseGalaxy.eu ✅ ⭐️

Published: Mar 24, 2021

Last modification: Apr 23, 2025

License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MIT

purl PURL: https://gxy.io/GTN:T00247

rating Rating: 4.3 (3 recent ratings, 8 all time)

version Revision: 24

You’ve done all the work to make a single cell matrix, with mitochondrial genes flagge and buckets of cell metadata from all your variables of interest. Now it’s time to fully process our data, to remove low quality cells, to reduce the many dimensions of the data that make it difficult to work with, and ultimately to try to define our clusters and to find our biological meaning and insights! There are many packages for analysing single cell data - Seurat Satija et al. 2015, Scanpy Wolf et al. 2018, Monocle Trapnell et al. 2014, Scater McCarthy et al. 2017, and so forth. We’re working with Scanpy, although Galaxy has training using other packages, which you can explore on our level Single-cell training topic.

You are in one of the four tutorials associated with a Case Study, which replicates and expands on the analysis performed in a manuscript Bacon et al. 2018.

Open image in new tab

Figure 1: Overview of the four parts of the case study, represented by boxes connected with noodles. There is a signpost specifying that you are currently in the first part.

Agenda

In this tutorial, we will cover:

Get data

Important tips for easier analysis

Quality control

Calculate QC Metrics

Inspect the AnnData Object

Generate QC Plots

Interpret the plots

Apply the thresholds

Processing

Preparing coordinates

Principal components

Neighborhood graph

Dimensionality reduction for visualisation

Cell clusters & gene markers

Find Gene Markers

Plotting!

Insights into the beyond

Biological Interpretation

Technical Assessment

Interactive visualisations

Conclusion

Get data

We’ve provided you with experimental data to analyse from a mouse dataset of fetal growth restriction Bacon et al. 2018.

We are particularly keen for learners to be able to go from raw FASTQ files all the way through analysis. We aren’t handing you a curated dataset that we specially modified in order for this tutorial to work. Instead, this tutorial’s input dataset is the full dataset generated from the level previous tutorial in this case study.

The only difference is that in that previous tutorial, we downsampled the datasets in so the tutorial would run faster. Our input data, however, now uses the full dataset, analysed exactly the same was as in the tutorial.

You can also find this dataset in the study in Single Cell Expression Atlas and the project submission.

Hands-on: Choose Your Own Tutorial

This is a "Choose Your Own Tutorial" (CYOT) section (also known as "Choose Your Own Analysis" (CYOA)), where you can select between multiple paths. Click one of the buttons below to select how you want to follow the tutorial

If you’re on the EU server, (if your usegalaxy has an EU anywhere in the URL), then the quickest way to Get the Data for this tutorial is via importing a history. Otherwise, you can also import from Zenodo - it just might take a moment longer if you’re in a live course and everyone is importing the same dataset at the same time! The SCXA is specifically for learners who are focusing on Reusing public data, so are not beginners.

Import History on EU server Zenodo SCXA

Hands On: Import History from EU server

Import the galaxy-history-input Input history by following the link below

UseGalaxy.eu Input History

Open the link to the shared history

Click on the Import this history button on the top left

Enter a title for the new history

Click on Copy History

Hands On: Import from Zenodo
Create a new history for this tutorial
Import the AnnData object from Zenodo
https://zenodo.org/record/15092468/files/Batched_Object
Copy the link location

Click galaxy-upload Upload Data at the top of the tool panel

Select galaxy-wf-edit Paste/Fetch Data

Paste the link(s) into the text field

Press Start

Close the window
Rename galaxy-pencil the datasets QC_Object

Check that the datatype is h5ad

Click on the galaxy-pencil pencil icon for the dataset to edit its attributes

In the central panel, click galaxy-chart-select-data Datatypes tab on the top

In the galaxy-chart-select-data Assign Datatype, select h5ad from “New Type” dropdown

Tip: you can start typing the datatype into the field to filter the dropdown menu

Click the Save button

You can also pull the data from publicly available Single Cell Expression Atlas.

Hands On: Import from the EBI Single Cell Expression Atlas

EBI SCXA Data Retrieval ( Galaxy version v0.0.2+galaxy2) with the following parameters:

SC-Atlas experiment accession: E-MTAB-6945

levelFollow tutorial to reformat dataset: This short tutorial will show you how to use this tool and modify the output so that it’s compatible with this tutorial and its workflow.

Important tips for easier analysis

Tools are frequently updated to new versions. Your Galaxy may have multiple versions of the same tool available. By default, you will be shown the latest version of the tool. This may NOT be the same tool used in the tutorial you are accessing. Furthermore, if you use a newer tool in one step, and try using an older tool in the next step… this may fail! To ensure you use the same tool versions of a given tutorial, use the Tutorial mode feature.

Open your Galaxy server

Click on the curriculum icon on the top menu, this will open the GTN inside Galaxy.

Navigate to your tutorial

Tool names in tutorials will be blue buttons that open the correct tool for you

Note: this does not work for all tutorials (yet)

You can click anywhere in the grey-ed out area outside of the tutorial box to return back to the Galaxy analytical interface

Warning: Not all browsers work!

We’ve had some issues with Tutorial mode on Safari for Mac users.

Try a different browser if you aren’t seeing the button.

Did you know we have a unique Single Cell Omics Lab with all our single cell tools highlighted to make it easier to use on Galaxy? We recommend this site for all your single cell analysis needs, particularly for newer users.

The Single Cell Omics Lab is a different view of the underlying Galaxy server that organises tools and resources better for single-cell users! It also provides a platform for communities to engage and connect; distribute more targeted news and events; and highlight community-specific funding sources.

Try it out!

subdomain Europe: Single Cell Omics Lab

subdomain USA: Single Cell Omics Lab

subdomain Australia: Single Cell Omics Lab

When something goes wrong in Galaxy, there are a number of things you can do to find out what it was. Error messages can help you figure out whether it was a problem with one of the settings of the tool, or with the input data, or maybe there is a bug in the tool itself and the problem should be reported. Below are the steps you can follow to troubleshoot your Galaxy errors.

Expand the red history dataset by clicking on it.

Sometimes you can already see an error message here

View the error message by clicking on the bug icon galaxy-bug

Check the logs. Output (stdout) and error logs (stderr) of the tool are available:

Expand the history item

Click on the details icon

Scroll down to the Job Information section to view the 2 logs:

Tool Standard Output

Tool Standard Error

For more information about specific tool errors, please see the Troubleshooting section

Submit a bug report! If you are still unsure what the problem is.

Click on the bug icon galaxy-bug

Write down any information you think might help solve the problem

See this FAQ on how to write good bug reports

Click galaxy-bug Report button

Ask for help!

Where?

In the User community chatspace in Slack in our #single-cell-users channel

In the GTN Matrix Channel

In the Galaxy Matrix Channel

Browse the Galaxy Help Forum to see if others have encountered the same problem before (or post your question).

When asking for help, it is useful to share a link to your history

If you get stuck, you can first check your history against an galaxy-history-answer Answer Key history found in the header of (some) tutorials.

First, import the target history.

Open the link to the shared history

Click on the Import this history button on the top left

Enter a title for the new history

Click on Copy History

Next, compare the answer key history with your own history.

You can view multiple Galaxy histories at once. This allows to better understand your analyses and also makes it possible to drag datasets between histories. This is called “History multiview”. The multiview can be enabled either view History menu or via the Activity Bar:

Enabling Multiview via History menu is done by first clicking on the galaxy-history-options “History options” drop-down and selecting galaxy-multihistory “Show Histories Side-by-Side option”:

Clicking the galaxy-multihistory “History Multiview” button within the Activity Bar:

You can compare there, or if you’re really stuck, you can also click and drag a given dataset to your history to continue the tutorial from there.

There 3 ways to copy datasets between histories

From the original history

Click on the galaxy-gear icon which is on the top of the list of datasets in the history panel

Click on Copy Datasets

Select the desired files

Give a relevant name to the “New history”

Validate by ‘Copy History Items’

Click on the new history name in the green box that have just appear to switch to this history

Using the galaxy-columns Show Histories Side-by-Side

Click on the galaxy-dropdown dropdown arrow top right of the history panel (History options)

Click on galaxy-columns Show Histories Side-by-Side

If your target history is not present

Click on ‘Select histories’

Click on your target history

Validate by ‘Change Selected’

Drag the dataset to copy from its original history

Drop it in the target history

From the target history

Click on User in the top bar

Click on Datasets

Search for the dataset to copy

Click on its name

Click on Copy to current History

You can also use our handy troubleshooting guide.

When something goes wrong in Galaxy, there are a number of things you can do to find out what it was. Error messages can help you figure out whether it was a problem with one of the settings of the tool, or with the input data, or maybe there is a bug in the tool itself and the problem should be reported. Below are the steps you can follow to troubleshoot your Galaxy errors.

Expand the red history dataset by clicking on it.

Sometimes you can already see an error message here

View the error message by clicking on the bug icon galaxy-bug

Check the logs. Output (stdout) and error logs (stderr) of the tool are available:

Expand the history item

Click on the details icon

Scroll down to the Job Information section to view the 2 logs:

Tool Standard Output

Tool Standard Error

For more information about specific tool errors, please see the Troubleshooting section

Submit a bug report! If you are still unsure what the problem is.

Click on the bug icon galaxy-bug

Write down any information you think might help solve the problem

See this FAQ on how to write good bug reports

Click galaxy-bug Report button

Ask for help!

Where?

In the GTN Matrix Channel

In the Galaxy Matrix Channel

Browse the Galaxy Help Forum to see if others have encountered the same problem before (or post your question).

When asking for help, it is useful to share a link to your history

Quality control

You have generated an annotated AnnData object from your raw scRNA-seq fastq files. However, you have only completed a ‘rough’ filter of your dataset - there will still be a number of ‘cells’ that are actually just background from empty droplets or simply low-quality. There will also be genes that could be sequencing artifacts or that appear with such low frequency that statistical tools will fail to analyse them. This background garbage of both cells and genes not only makes it harder to distinguish real biological information from the noise, but also makes it computationally heavy to analyse. These spurious reads take a lot of computational power to analyse! First on our agenda is to filter this matrix to give us cleaner data to extract meaningful insight from, and to allow faster analysis.

Calculate QC Metrics

To filter the object, we need to calculate some metrics for each cell and gene.

Hands On: Compute QC metrics

Scanpy Inspect and manipulate ( Galaxy version 1.10.2+galaxy2) with the following parameters:

param-file “Annotated data matrix”: Batched_Object

“Method used for inspecting”: Calculate quality control metrics, using 'pp.calculate_qc_metrics'

“Name of kind of values in X”: counts

“The kind of thing the variables are”: genes

“Keys for boolean columns of .var which identify variables you could want to control for”: mito

Rename the generated file QC_Object

Inspect the AnnData Object

What has this tool calculated?

Question

What information is stored in your AnnData object? For example, the last tool to generate this object counted the mitochondrial associated genes in your matrix. Where is that data stored?

While you are figuring that out, how many genes and cells are in your object?

You want to use the same tool you used in the previous tutorial to examine your AnnData. Sometimes you can get the answers from peeking at your param-file AnnData object in your galaxy-history history, but sometimes it’s not quite that simple!

Hands On: Inspecting AnnData Objects

Inspect AnnData ( Galaxy version 0.10.9+galaxy1) with the following parameters:

param-file “Annotated data matrix”: QC_Object

“What to inspect?”: Key-indexed observations annotation (obs)

Inspect AnnData ( Galaxy version 0.10.9+galaxy1) with the following parameters:

param-file “Annotated data matrix”: QC_Object

“What to inspect?”: Key-indexed annotation of variables/features (var)

If you examine your AnnData object, you’ll find a number of different quality control metrics for:

cells, found in the param-file Key-index observations annotation (obs) output dataset

For example, you can find both discrete and log-based metrics for n_genes (how many genes are counted in a given cell), and n_counts (how many UMIs are counted in a given cell). This distinction between counts/UMIs or genes is because you might count multiple GAPDHs in a single cell. This would be 1 gene but multiple counts, therefore your n_counts should be higher than n_genes for an individual cell.

But what about the mitochondria?? You can also find total_counts_mito, log1p_total_counts_mito, and pct_counts_mito, which has been calculated for each cell. - and genes, found in the param-file Key-index observations variables/features (var) output dataset.

For example, you can find n_cells_by_counts (number of cells that gene appears in).

There are 31670 cells and 35734 genes in the matrix.

You can peek at your param-file Anndata Object in your galaxy-history history by selecting it to reveal a drop-down window that has this same information in it.

The matrix is 31670 x 35734. This is n_obs x n_vars, or rather, cells x genes.

Generate QC Plots

We want to filter our cells, but first we need to know what our data looks like. There are a number of subjective choices to make within scRNA-seq analysis, for instance we now need to make our best informed decisions about where to set our thresholds (more on that soon!). We’re going to plot our data a few different ways. Different bioinformaticians might prefer to see the data in different ways, and here we are only generating some of the myriad of plots you can use. Ultimately you need to go with what makes the most sense to you.

Hands On: Making QC visualisations - Violin Plots

Scanpy plot ( Galaxy version 1.10.2+galaxy2) with the following parameters:

param-file “Annotated data matrix”: QC_Object

“Method used for plotting”: Generic: Violin plot, using 'pl.violin'

“Keys for accessing variables”: Subset of variables in 'adata.var_names' or fields of '.obs'

“Keys for accessing variables”: log1p_total_counts,log1p_n_genes_by_counts,pct_counts_mito

“The key of the observation grouping to consider”: genotype

Rename galaxy-pencil output Violin_log_genotype

Scanpy plot ( Galaxy version 1.10.2+galaxy2) with the following parameters:

param-file “Annotated data matrix”: QC_Object

“Method used for plotting”: Generic: Violin plot, using 'pl.violin'

“Keys for accessing variables”: Subset of variables in 'adata.var_names' or fields of '.obs'

“Keys for accessing variables”: log1p_total_counts,log1p_n_genes_by_counts,pct_counts_mito

“The key of the observation grouping to consider”: sex

Rename galaxy-pencil output Violin_log_sex

Scanpy plot ( Galaxy version 1.10.2+galaxy2) with the following parameters:

param-file “Annotated data matrix”: QC_Object

“Method used for plotting”: Generic: Violin plot, using 'pl.violin'

“Keys for accessing variables”: Subset of variables in 'adata.var_names' or fields of '.obs'

“Keys for accessing variables”: log1p_total_counts,log1p_n_genes_by_counts,pct_counts_mito

“The key of the observation grouping to consider”: batch

Rename galaxy-pencil output Violin_log_batch

Hands On: Making QC visualisations - Scatterplots

Scanpy plot ( Galaxy version 1.10.2+galaxy2) with the following parameters:

param-file “Annotated data matrix”: QC_Object

“Method used for plotting”: Generic: Scatter plot along observations or variables axes, using 'pl.scatter'

“Plotting tool that computed coordinates”: Using coordinates

“x coordinate”: log1p_total_counts

“y coordinate”: pct_counts_mito

Rename galaxy-pencil output Scatter_UMIxMito

Scanpy plot ( Galaxy version 1.10.2+galaxy2) with the following parameters:

param-file “Annotated data matrix”: QC_Object

“Method used for plotting”: Generic: Scatter plot along observations or variables axes, using 'pl.scatter'

“Plotting tool that computed coordinates”: Using coordinates

“x coordinate”: log1p_n_genes_by_counts

“y coordinate”: pct_counts_mito

Rename galaxy-pencil output Scatter_GenesxMito

Scanpy plot ( Galaxy version 1.10.2+galaxy2) with the following parameters:

param-file “Annotated data matrix”: QC_Object

“Method used for plotting”: Generic: Scatter plot along observations or variables axes, using 'pl.scatter'

“Plotting tool that computed coordinates”: Using coordinates

“x coordinate”: log1p_n_genes_by_counts

“y coordinate”: log1p_total_counts

“Color by”: pct_counts_mito

Rename galaxy-pencil output Scatter_GenesxUMI

Interpret the plots

That’s a lot of information! Let’s attack this in sections and see what questions these plots can help us answer.

If you would like to view two or more datasets at once, you can use the Window Manager feature in Galaxy:

Click on the Window Manager icon galaxy-scratchbook on the top menu bar.

You should see a little checkmark on the icon now

View galaxy-eye a dataset by clicking on the eye icon galaxy-eye to view the output

You should see the output in a window overlayed over Galaxy

You can resize this window by dragging the bottom-right corner

Click outside the file to exit the Window Manager

View galaxy-eye a second dataset from your history

You should now see a second window with the new dataset

This makes it easier to compare the two outputs

Repeat this for as many files as you would like to compare

You can turn off the Window Manager galaxy-scratchbook by clicking on the icon again

Question: Batch Variation

Are there differences in sequencing depth across the samples?

Which plot(s) addresses this?

How do you interpret it?

The plot Violin_Log_Batch will have what you’re looking for!
Open image in new tab

Figure 2: Violin_Log_Batch (Raw)

Keeping in mind that this is a log scale - which means that small differences can mean large differences - the violin plots probably look pretty similar.

N703 and N707 might be a bit lower on genes and counts (or UMIs), but the differences aren’t catastrophic.

The pct_counts_mito looks pretty similar across the batches, so this also looks good.

Nothing here would cause us to eliminate a sample from our analysis, but if you see a sample looking completely different from the rest, you would need to question why that is and consider eliminating it from your experiment!

Question: Biological Variables

Are there differences in sequencing depth across sex? Genotype?

Which plot(s) addresses this?

How do you interpret the sex differences?

How do you interpret the genotype differences?

Similar to above, the plots violin - sex - log and violin - genotype - log will have what you’re looking for!
Open image in new tab

Figure 3: Violin_Log_Sex (Raw)

Open image in new tab

Figure 4: Violin_Log_Genotype (Raw)

There isn’t a major difference in sequencing depth across sex, I would say - though you are welcome to disagree!

It is clear there are far fewer female cells, which makes sense given that only one sample was female. Note - that was an unfortunate discovery made long after generating libraries. It’s quite hard to identify the sex of a neonate in the lab! In practice, try hard to not let such a confounding factor into your data! You could consider re-running all the following analysis without that female sample, if you wish.

In Violin - genotype - log, however, we can see there is a difference. The knockout samples clearly have fewer genes and counts. From an experimental point of view, we can consider, does this make sense?

Would we biologically expect that those cells would be smaller or having fewer transcripts? Possibly, in this case, given that these cells were generated by growth restricted neonatal mice, and in which case we don’t need to worry about our good data, but rather keep this in mind when generating clusters, as we don’t want depth to define clusters, we want biology to!

On the other hand, it may be that those cells didn’t survive dissociation as well as the healthy ones (in which case we’d expect higher mitochondrial-associated genes, which we don’t see, so we can rule that out!).

Maybe we unluckily poorly prepared libraries for specifically those knockout samples. There are only three, so maybe those samples are under-sequenced.

So what do we do about all of this?

Ideally, we consider re-sequencing all the samples but with a higher concentration of the knockout samples in the library. Any bioinformatician will tell you that the best way to get clean data is in the lab, not the computer! Sadly, absolute best practice isn’t necessarily always a realistic option in the lab - for instance, that mouse line was long gone! - so sometimes, we have to make the best of it. There are options to try and address such discrepancy in sequencing depth. Thus, we’re going to take these samples forward and see if we can find biological insight despite the technical differences.

Now that we’ve assessed the differences in our samples, we will look at the libraries overall to identify appropriate thresholds for our analysis.

Question: Filter Thresholds: genes

What threshold should you set for log1p_n_genes_by_counts?

Which plot(s) addresses this?

What number would you pick?

Any plot with log1p_n_genes_by_counts would do here, actually! Some people prefer scatterplots to violins.

Open image in new tab

Figure 5: Scatter_GenesxMito (Raw)

In Scatter - mito x genes you can see how cells with log1p_n_genes_by_counts up to around, perhaps, 5.7 (around 300 genes) often have high pct_counts_mito.

You can plot this as just n_counts and see this same trend at around 300 genes, but with this data the log format is clearer so that’s how we’re presenting it.

You could also use the violin plots to come up with the threshold, and thus also take batch into account. It’s good to look at the violins as well, because you don’t want to accidentally cut out an entire sample (i.e. N703 and N707).

Some bioinformaticians would recommend filtering each sample individually, but this is difficult in larger scale and in this case (you’re welcome to give it a go! You’d have to filter separately and then concatenate), it won’t make a notable difference in the final interpretation.

Question: Filter Thresholds: UMIs

What threshold should you set for log1p_total_counts?

Which plot(s) addresses this?

What number would you pick?

As before, any plot with log1p_total_counts will do! Again, we’ll use a scatterplot here, but you can use a violin plot if you wish!

Open image in new tab

Figure 6: Scatter_UMIxMito (Raw)

We can see that we will need to set a higher threshold (which makes sense, as you’d expect more UMI’s per cell rather than unique genes!). Again, perhaps being a bit aggressive in our threshold, we might choose 6.3, for instance (which amounts to around 500 counts/cell).

In an ideal world, you’ll see a clear population of real cells separated from a clear population of debris. Many samples, like this one, are under-sequenced, and such separation would likely be seen after deeper sequencing!

Question: Filter Thresholds: mito

What threshold should you set for pct_counts_mito?

Which plot(s) addresses this?

What number would you pick?

Any plot with pct_counts_mito would do here, however the scatterplots are likely the easiest to interpret. We’ll use the same as last time.

Open image in new tab

Figure 7: Scatter_UMIxMito (Raw)

We can see a clear trend wherein cells that have around 5% mito counts or higher also have far fewer total counts. These cells are low quality, will muddy our data, and are likely stressed or ruptured prior to encapsulation in a droplet. While 5% is quite a common cut-off, this is quite messy data, so just for kicks we’ll go more aggressive with a 4.5%.

In general, you must adapt all cut-offs to your data - metabolically active cells might have higher mitochondrial RNA in general, and you don’t want to lose a cell population because of a cut-off.

Apply the thresholds

It’s now time to apply these thresholds to our data! First, a reminder of how many cells and genes are in your object: 31670 cells and 35734 genes. Let’s see how that changes each time!

If you are working in a group, you can now divide up a decision here with one control and the rest varied numbers so that you can compare results throughout the tutorials.

Control

log1p_n_genes_by_counts > 5.7

log1p_total_counts > 6.3

pct_counts_mito < 4.5%

Everyone else: Choose your own thresholds and compare results!

Genes/cell

Hands On: Filter cells by log1p_n_genes_by_counts

Scanpy filter ( Galaxy version 1.10.2+galaxy3) with the following parameters:

param-file “Annotated data matrix”: QC_Object

“Method used for filtering”: Filter on any column of observations or variables

“What to filter?”: Observations (obs)

“Type of filtering?”: By key (column) values

“Key to filter”: log1p_n_genes_by_counts

“Type of value to filter”: Number

“Filter”: greater than

“Value”: 5.7

Scanpy filter ( Galaxy version 1.10.2+galaxy3) with the following parameters:

param-file “Annotated data matrix”: anndata_out (output of Scanpy filter tool)

“Method used for filtering”: Filter on any column of observations or variables

“What to filter?”: Observations (obs)

“Type of filtering?”: By key (column) values

“Key to filter”: log1p_n_genes_by_counts

“Type of value to filter”: Number

“Filter”: less than

“Value”: 20.0

Rename galaxy-pencil output as Genes_Filtered_Object

Scanpy plot ( Galaxy version 1.10.2+galaxy2) with the following parameters:

param-file “Annotated data matrix”: Genes_Filtered_Object

“Method used for plotting”: Generic: Violin plot, using 'pl.violin'

“Keys for accessing variables”: Subset of variables in 'adata.var_names' or fields of '.obs'

“Keys for accessing variables”: log1p_total_counts,log1p_n_genes_by_counts,pct_counts_mito

“The key of the observation grouping to consider”: genotype

Rename galaxy-pencil output Violin_log_genotype-Genes

Question

Interpret the violin plot

How many genes & cells do you have in your object now?

Open image in new tab

Figure 8: Raw

Open image in new tab

Figure 9: 1st filter - genes/cell

The only part that seems to change is the log1p_n_genes_by_counts. You can see a flatter bottom to the violin plot - this is the lower threshold set. Ideally, this would create a beautiful violin plot because there would be a clear population of low-gene number cells. Sadly not the case here, but still a reasonable filter.

If you peek at the AnnData object in your galaxy-history, you will find that you now have 17,104 cells x 35,734 genes.

UMIs/cell

Hands On: Filter cells by log1p_total_counts

Scanpy filter ( Galaxy version 1.10.2+galaxy3) with the following parameters:

param-file “Annotated data matrix”: Genes_Filtered_Object

“Method used for filtering”: Filter on any column of observations or variables

“What to filter?”: Observations (obs)

“Type of filtering?”: By key (column) values

“Key to filter”: log1p_total_counts

“Type of value to filter”: Number

“Filter”: greater than

“Value”: 6.3

Scanpy filter ( Galaxy version 1.10.2+galaxy3) with the following parameters:

param-file “Annotated data matrix”: anndata_out (output of Scanpy filter tool)

“Method used for filtering”: Filter on any column of observations or variables

“What to filter?”: Observations (obs)

“Type of filtering?”: By key (column) values

“Key to filter”: log1p_total_counts

“Type of value to filter”: Number

“Filter”: less than

“Value”: 20.0

Rename galaxy-pencil output as UMIs_Filtered_Object

Scanpy plot ( Galaxy version 1.10.2+galaxy2) with the following parameters:

param-file “Annotated data matrix”: UMIs_Filtered_Object

“Method used for plotting”: Generic: Violin plot, using 'pl.violin'

“Keys for accessing variables”: Subset of variables in 'adata.var_names' or fields of '.obs'

“Keys for accessing variables”: log1p_total_counts,log1p_n_genes_by_counts,pct_counts_mito

“The key of the observation grouping to consider”: genotype

Rename galaxy-pencil output Violin_log_genotype-UMIs

Question

Interpret the violin plot

How many genes & cells do you have in your object now?

Open image in new tab

Figure 10: 1st filter - genes/cell

Open image in new tab

Figure 11: 2nd filter - UMIs/cell

We will focus on the log1p_total_counts as that shows the biggest change. Similar to above, the bottom of the violin shape has flattered due to the threshold.

You now have 8,677 cells x 35,734 genes in the AnnData object.

% Mito/cell

Hands On: Filter cells by pct_counts_mito

Scanpy filter ( Galaxy version 1.10.2+galaxy3) with the following parameters:

param-file “Annotated data matrix”: UMIs_Filtered_Object

“Method used for filtering”: Filter on any column of observations or variables

“What to filter?”: Observations (obs)

“Type of filtering?”: By key (column) values

“Key to filter”: pct_counts_mito

“Type of value to filter”: Number

“Filter”: less than

“Value”: 4.5

Rename galaxy-pencil output as Mito_Filtered_Object

Scanpy plot ( Galaxy version 1.10.2+galaxy2) with the following parameters:

param-file “Annotated data matrix”: Mito_Filtered_Object

“Method used for plotting”: Generic: Violin plot, using 'pl.violin'

“Keys for accessing variables”: Subset of variables in 'adata.var_names' or fields of '.obs'

“Keys for accessing variables”: log1p_total_counts,log1p_n_genes_by_counts,pct_counts_mito

“The key of the observation grouping to consider”: genotype

Rename galaxy-pencil output Violin_log_genotype-Mito

Question

Interpret the violin plot

How many genes & cells do you have in your object now?

Open image in new tab

Figure 12: 2nd filter - UMIs/cell

Open image in new tab

Figure 13: 3rd filter - % count Mito/cell

If we carefully check the axes, we can see that the pct_counts_mito has shrunk.

Your object now has 8,604 cells x 35,734 genes.

Open image in new tab

Figure 14: Raw

Open image in new tab

Figure 15: 1st filter - genes/cell

Open image in new tab

Figure 16: 2nd filter - UMIs/cell

Open image in new tab

Figure 17: 3rd filter - % count Mito/cell

Fantastic work! However, you’ve now removed a whole heap of cells, and since the captured genes are sporadic (i.e. a small percentage of the overall transcriptome per cell) this means there are a number of genes in your matrix that are currently not in any of the remaining cells. Genes that do not appear in any cell, or even in only 1 or 2 cells, will make some analytical tools break and overall will not be biologically informative. So let’s remove them! Note that 3 is not necessarily the best number, rather it is a fairly conservative threshold. You could go as high as 10 or more.

If you are working in a group, you can now divide up a decision here with one control and the rest varied numbers so that you can compare results throughout the tutorials.

Variable: n_cells

Control > 3

Everyone else: Choose your own thresholds and compare results! Note if you go less than 3 (or even remove this step entirely), future tools are likely to fail due to empty gene data.

Cells/gene

Hands On: Filter genes

Scanpy filter ( Galaxy version 1.10.2+galaxy3) with the following parameters:

param-file “Annotated data matrix”: Mito-filtered Object

“Method used for filtering”: Filter on any column of observations or variables

“Type of filtering?”: By key (column) values

“Key to filter”: n_cells_by_counts

“Type of value to filter”: Number

“Filter”: greater than

“Value”: 3.0

Rename galaxy-pencil output as Cells_Filtered_Object

In practice, you’ll likely choose your thresholds then set up all these filters to run without checking plots in between each one. But it’s nice to see how they work!

We can summarise the results of our filtering:

	Cells	Genes
Raw	31670	35734
Filter genes/cell	17104	35734
Filter UMIs/cell	8677	35734
Filter mito/cell	8604	35734
Filter cells/gene	8604	15950

congratulations Congratulations! You have filtered your object! Now it should be a lot faster to analyse and easier to interpret.

Processing

So currently, you have a matrix that is 8604 cells by 15950 genes. This is still quite big data. We have two issues here - firstly, you already know there are differences in how many transcripts and genes have been counted per cell. This technical variable can obscure biological differences. Secondly, we like to plot things on x/y plots, so for instance Gapdh could be on one axis, and Actin can be on another, and you plot cells on that 2-dimensional axis based on how many of each transcript they possess. While that would be fine, adding in a 3rd dimension (or, indeed, in this case, 15950 more dimensions), is a bit trickier! So our next steps are to transform our big data object into something that is easy to analyse and easy to visualise.

Hands On: Normalisation

Scanpy normalize ( Galaxy version 1.10.2+galaxy0) with the following parameters:

param-file “Annotated data matrix”: Cells_Filtered_Object

“Method used for normalization”: Normalize counts per cell, using 'pp.normalize_total'

“Target sum”: 10000.0

“Exclude (very) highly expressed genes for the computation of the normalization factor (size factor) for each cell”: No

“Name of the field in ‘adata.obs’ where the normalization factor is stored”: norm

Scanpy Inspect and manipulate ( Galaxy version 1.10.2+galaxy1) with the following parameters:

param-file “Annotated data matrix”: anndata_out (output of Scanpy normalize tool)

“Method used for inspecting”: Logarithmize the data matrix, using 'pp.log1p'

Manipulate AnnData ( Galaxy version 0.10.9+galaxy1) with the following parameters:

param-file “Annotated data matrix”: anndata_out (output of Scanpy Inspect and manipulate tool)

“Function to manipulate the object”: Freeze the current state into the 'raw' attribute

Normalisation helps reduce the differences between gene and UMI counts by fitting total counts to 10,000 per cell. The subsequent log-transform (by log(count+1)) aligns the gene expression level better with a normal distribution. This is fairly standard to prepare for any future dimensionality reduction. Finally, we freeze this information in the ‘raw’ attribute before we further manipulate the values.

We next need to look at reducing our gene dimensions. We have loads of genes, but not all of them are different from cell to cell. For instance, housekeeping genes are defined as not changing much from cell to cell, so we could remove these from our data to simplify the dataset. We will flag genes that vary across the cells for future analysis.

Hands On: Find variable genes

Scanpy filter ( Galaxy version 1.10.2+galaxy3) with the following parameters:

param-file “Annotated data matrix”: anndata (output of Manipulate AnnData tool)

“Method used for filtering”: Annotate (and filter) highly variable genes, using 'pp.highly_variable_genes'

“Choose the flavor for identifying highly variable genes”: Seurat

Would you like to know how many genes were flagged as Highly variable genes?

Hands On: Find the number of variable genes

Select galaxy-refresh Run Job Again on the anndata_out (output of Scanpy filter tool) in your galaxy-history history

Scanpy filter ( Galaxy version 1.10.2+galaxy3) with the following parameters:

param-file “Annotated data matrix”: anndata (output of Manipulate AnnData tool)

“Method used for filtering”: Annotate (and filter) highly variable genes, using 'pp.highly_variable_genes'

“Choose the flavor for identifying highly variable genes”: Seurat

“Inplace subset to highly-variable genes?”: galaxy-toggle Yes

If you peek at the output, you will see that the number of genes in your AnnData object has drastically reduced to around 3216 - this dataset has only the highly variable genes! Some people prefer to only perform analysis on this dataset, however I have found that sometimes (for various reasons) important biological marker genes get excluded. For this reason, I personally will flag highly variable genes for use in the next analytical steps, however I keep all the genes in my AnnData object so that I can check for key ones in the future.

warning For this tutorial, you must keep all genes in your AnnData object. Therefore, delete the output that contains only the highly variable genes from your galaxy-history history now.

Next up, we’re going to scale our data so that all genes have the same variance and a zero mean. This is important to set up our data for further dimensionality reduction. It also helps negate sequencing depth differences between samples, since the gene levels across the cells become comparable. Note, that the differences from scaling etc. are not the values you have at the end - i.e. if your cell has average GAPDH levels, it will not appear as a ‘0’ when you calculate gene differences between clusters.

Hands On: Scale data

Scanpy Inspect and manipulate ( Galaxy version 1.10.2+galaxy1) with the following parameters:

param-file “Annotated data matrix”: anndata_out (output of Scanpy filter tool)

“Method used for inspecting”: Scale data to unit variance and zero mean, using 'pp.scale'

“Maximum value”: 10.0

Rename galaxy-pencil output Scaled_Object

congratulations Congratulations! You have processed your object!

At this point, you might want to remove or regress out the effects of unwanted variation on our data. A common example of this is the cell cycle, which can affect which genes are expressed and how much material is present in our cells. If you’re interested in learning how to do this, then you can move over to the level Removing the Effects of the Cell Cycle tutorial now and return here to complete your analysis.

warning If you are in a live course, the time to do this bonus tutorial will not be factored into the schedule. Please instead return to this after your course is finished, or if you finish early!

Preparing coordinates

We still have too many dimensions. Transcript changes are not usually singular - which is to say, genes were in pathways and in groups. It would be easier to analyse our data if we could more easily group these changes.

Principal components

Principal components are calculated from highly dimensional data to find the most spread in the dataset. Given that our object has around 3216 highly variable genes, that’s 3216 dimensions. There will, however, be one line/axis/dimension that yields the most spread and variation across the cells. That will be our first principal component. We can calculate the first x principal components in our data to drastically reduce the number of dimensions.

Warning: Check the size of your AnnData object!

Your AnnData object should have far more than 3216 genes in it (if you followed our settings and tool versions, you’ll have a matrix around 8604 × 15950 (cells x genes). If you followed the More details on the Highly Variable Genes above, you may have created an object with *only the highly variable genes. Please delete this object and do not use it! Carry forward the object with around 8604 × 15950 (cells x genes)!

Hands On: Calculate Principal Components

Scanpy cluster, embed ( Galaxy version 1.10.2+galaxy2) with the following parameters:

param-file “Annotated data matrix”: Scaled_Object

“Method used”: Computes PCA (principal component analysis) coordinates, loadings and variance decomposition, using 'pp.pca'

“Type of PCA?”: Full PCA

“Change to use different initial states for the optimization”: 1

Rename galaxy-pencil output PCA_Object

Scanpy plot ( Galaxy version 1.10.2+galaxy2) with the following parameters:

param-file “Annotated data matrix”: PCA_Object

“Method used for plotting”: PCA: Scatter plot in PCA coordinates, using 'pl.pca_variance_ratio'

“Number of PCs to show”: 50

Rename galaxy-pencil plot output PCA_Variance_Plot

Why 50 principal components you ask? Well, we’re pretty confident 50 is an over-estimate. Examine PCA Variance.

We can see that there is really not much variation explained past component 19. So we might save ourselves a great deal of time and muddied data by focusing on the top 20 PCs. (You could probably even go as low as 10!)

Neighborhood graph

We’re still looking at around 20 dimensions at this point in our analysis. We need to identify how similar a cell is to another cell, across every cell across these dimensions. For this, we will use the k-nearest neighbor (kNN) graph, to identify which cells are close together and which are not. The kNN graph plots connections between cells if their distance (when plotted in this 20 dimensional space!) is amongst the k-th smallest distances from that cell to other cells. This will be crucial for identifying clusters, and is necessary for plotting a UMAP. From UMAP developers: “Larger neighbor values will result in more global structure being preserved at the loss of detailed local structure. In general this parameter should often be in the range 5 to 50, with a choice of 10 to 15 being a sensible default”.

If you are working in a group, you can now divide up a decision here with one control and the rest varied numbers so that you can compare results throughout the tutorials.

Control

Number of PCs to use = 20

Maximum number of neighbours used = 15

Everyone else: Use the PC variance plot to pick your own PC number, and choose your own neighbour maximum as well!b

Hands On: ComputeGraph

Scanpy Inspect and manipulate ( Galaxy version 1.10.2+galaxy1) with the following parameters:

param-file “Annotated data matrix”: PCA_Object

“Method used for inspecting”: Compute a neighborhood graph of observations, using 'pp.neighbors'

“Number of PCs to use”: 20

“Use the indicated representation”: X_pca

“Numpy random seed”: 1

Rename galaxy-pencil output Neighbours_Object

Dimensionality reduction for visualisation

Two major visualisations for this data are tSNE and UMAP. We must calculate the coordinates for both prior to visualisation. For tSNE, the parameter perplexity can be changed to best represent the data, while for UMAP the main change would be to change the kNN graph above itself, by changing the neighbours.

If you are working in a group, you can now divide up a decision here with one control and the rest varied numbers so that you can compare results throughout the tutorials.

Control

Perplexity = 30

Everyone else: Choose your own perplexity, between 5 and 50!

Hands On: Calculating tSNE & UMAP

Scanpy cluster, embed ( Galaxy version 1.10.2+galaxy2) with the following parameters:

param-file “Annotated data matrix”: Neighbours_Object

“Method used”: t-distributed stochastic neighborhood embedding (tSNE), using 'tl.tsne'

“Number of PCs to use”: 20

“Use the indicated representation”: X_pca

“Random state”: 1

Scanpy cluster, embed ( Galaxy version 1.10.2+galaxy2) with the following parameters:

param-file “Annotated data matrix”: anndata_out (output of Scanpy cluster, embed tool)

“Method used”: Embed the neighborhood graph using UMAP, using 'tl.umap'

“Seed used by the random number generator”: 1

Rename galaxy-pencil output UMAP_Object

congratulations Congratulations! You have prepared your object and created neighborhood coordinates. We can now use those to call some clusters!

Cell clusters & gene markers

Question

Let’s take a step back here. What is it, exactly, that you are trying to get from your data? What do you want to visualise, and what information do you need from your data to gain insight?

Really we need two things - firstly, we need to make sure our experiment was set up well. This is to say, our biological replicates should overlap and our variables should, ideally, show some difference. Secondly, we want insight - we want to know which cell types are in our data, which genes drive those cell types, and in this case, how they might be affected by our biological variable of growth restriction. How does this affect the developing cells, and what genes drive this? So let’s add in information about cell clusters and gene markers!

Finally, let’s identify clusters! Unfortunately, it’s not as majestic as biologists often think - the maths doesn’t necessarily identify true cell clusters. Every algorithm for identifying cell clusters falls short of a biologist knowing their data, knowing what cells should be there, and proving it in the lab. Sigh. So, we’re going to make the best of it as a starting point and see what happens! We will define clusters from the kNN graph, based on how many connections cells have with one another. Roughly, this will depend on a resolution parameter for how granular you want to be.

Oh yes, yet another decision! Single cell analysis is sadly not straight forward.

Control

Resolution, high value for more and smaller clusters = 0.5

Clustering algorithm = Louvain

Everyone else: Pick your own number. If it helps, this sample should have a lot of very similar cells in it. It contains developing T-cells, so you aren’t expecting massive differences between cells, like you would in, say, an entire embryo, with all sorts of unrelated cell types.

Everyone else: Consider the newer Leiden clustering method. Note that in future parameters, you will likely need to specify ‘leiden’ rather than ‘louvain’, which is the default, if you choose this clustering method.

Hands On: FindClusters

Scanpy cluster, embed ( Galaxy version 1.10.2+galaxy2) with the following parameters:

param-file “Annotated data matrix”: UMAP_Object

“Method used”: Cluster cells into subgroups, using 'tl.louvain'

“Flavor for the clustering”: vtraag (much more powerful than igraph)

“Resolution”: 0.5

“Random state”: 1

Rename galaxy-pencil output Clustered_Object

Nearly plotting time! But one final piece is to add in SOME gene information. Let’s focus on genes that distinguish the clusters.

Find Gene Markers

Hands On: Find Gene markers for each cluster

Scanpy Inspect and manipulate ( Galaxy version 1.10.2+galaxy1) with the following parameters:

param-file “Annotated data matrix”: Clustered_Object

“Method used for inspecting”: Rank genes for characterizing groups, using 'tl.rank_genes_groups'

“Get ranked genes as a Tabular file?”: True

“Column name in [.var] DataFrame that stores gene symbols.”: Symbol

“The key of the observations grouping to consider”: louvain

“Use ‘raw’ attribute of input if present”: Yes

“Comparison”: Compare each group to the union of the rest of the group

“The number of genes that appear in the returned tables”: 100

“Method”: t-test with overestimate of variance of each group

Rename galaxy-pencil output table (not h5ad) Ranked_Genes-by_cluster

Rename galaxy-pencil output h5ad file DEG_Object

Given that we are also interested in differences across genotype, we can also use the find markers function to check that (or any other Obs metadata)… roughly.

Hands On: Comparing across genotypes

Scanpy Inspect and manipulate ( Galaxy version 1.10.2+galaxy1) with the following parameters:

param-file “Annotated data matrix”: Clustered_Object

“Method used for inspecting”: Rank genes for characterizing groups, using 'tl.rank_genes_groups'

“Get ranked genes as a Tabular file?”: True

“Column name in [.var] DataFrame that stores gene symbols.”: Symbol

“The key of the observations grouping to consider”: genotype

“Use ‘raw’ attribute of input if present”: Yes

“Comparison”: Compare each group to the union of the rest of the group

“The number of genes that appear in the returned tables”: 100

“Method”: t-test with overestimate of variance of each group

Rename galaxy-pencil output table (not h5ad) Ranked_Genes-by_Genotype

Do not Rename the output AnnData object (in fact, you can delete it!). You have the genotype marker table to enjoy, but we want to keep the cluster comparisons, rather than gene comparisons, stored in the AnnData object for later.

However, this analysis only give you a rough idea. It is more statistically accurate to convert each cluster into a pseudobulk sample, and analyse those. You can find more details about that in our level pseudobulk tutorial

warning If you are in a live course, the time to do this bonus tutorial will not be factored into the schedule. Please instead return to this after your course is finished, or if you finish early!

congratulations Well done! You have cool tables of genes. It’s now time for the best bit, the plotting!

Plotting!

It’s time! Let’s plot it all! But first, let’s pick some known marker genes that can distinguish different cell types. I’ll be honest, in practice, you’d now be spending a lot of time looking up what each gene does (thank you google!). There are burgeoning automated-annotation tools, however, so long as you have a good reference (a well annotated dataset that you’ll use as the ideal). In the mean time, let’s do this the old-fashioned way, and just copy a bunch of the markers in the original paper.

Hands On: Plot the cells!

Scanpy plot ( Galaxy version 1.10.2+galaxy2) with the following parameters:

param-file “Annotated data matrix”: DEG_Object

“Method used for plotting”: Embeddings: Scatter plot in tSNE basis, using 'pl.tsne'

“Keys for annotations of observations/cells or variables/genes”: louvain,sex,batch,genotype,Il2ra,Cd8b1,Cd8a,Cd4,Itm2a,Aif1,log1p_total_counts

“Key for field in ‘.var’ that stores gene symbols”: Symbol

Scanpy plot ( Galaxy version 1.10.2+galaxy2) with the following parameters:

param-file “Annotated data atrix”: DEG_Object

“Method used for plotting”: PCA: Scatter plot in PCA coordinates, using 'pl.pca'

“Keys for annotations of observations/cells or variables/genes”: louvain,sex,batch,genotype,Il2ra,Cd8b1,Cd8a,Cd4,Itm2a,Aif1,log1p_total_counts

“Key for field in ‘.var’ that stores gene symbols”: Symbol

Scanpy plot ( Galaxy version 1.10.2+galaxy2) with the following parameters:

param-file “Annotated data matrix”: DEG_Object

“Method used for plotting”: Embeddings: Scatter plot in UMAP basis, using 'pl.umap'

“Keys for annotations of observations/cells or variables/genes”: louvain,sex,batch,genotype,Il2ra,Cd8b1,Cd8a,Cd4,Itm2a,Aif1,log1p_total_counts

“Key for field in ‘.var’ that stores gene symbols”: Symbol

congratulations Congratulations! You now have plots galore!

Insights into the beyond

Now it’s the fun bit! We can see where genes are expressed, and start considering and interpreting the biology of it. At this point, it’s really about what information you want to get from your data - the following is only the tip of the iceberg. However, a brief exploration is good, because it may help give you ideas going forward with for your own data. Let us start interrogating our data!

Warning: Your results may look different!

These tools rely on machine learning, which involves randomisation. While we have used the options to ‘set random seed’ to 1 where we can, it’s not perfect at ensuring every analysis is identical.

Your results may look different.

Your clusters may be in different orders.

You will have to adjust your annotation and interpretation accordingly…which is exactly what scientists have to do!

Biological Interpretation

Question: Appearance is everything

Which visualisation is the most useful for getting an overview of our data, pca, tsne, or umap?

Open image in new tab

Figure 19: PCA Plots

Open image in new tab

Figure 20: tSNE Plots

Open image in new tab

Figure 21: UMAP Plots

You can see why a PCA is generally not enough to see clusters in samples - keep in mind, you’re only seeing components 1 and 2! - and therefore why the tSNE and UMAP visualisation dimensionality reductions are so useful. But there is not necessarily a clear winner between tSNE and UMAP, but I think UMAP is slightly clearer with its clusters, so we’ll stick with that for the rest of the analysis.

Note that the cluster numbering is based on size alone - clusters 0 and 1 are not necessarily related, they are just the clusters containing the most cells. It would be nice to know what exactly these cells are. This analysis (googling all of the marker genes, both checking where the ones you know are as well as going through the marker tables you generated!) is a fun task for any individual experiment, so we’re going to speed past that and nab the assessment from the original paper!

Clusters	Marker	Cell type
3	Il2ra	Double negative (early T-cell)
0,1,4	Cd8b1, Cd8a, Cd4	Double positive (middle T-cell)
5	Cd8b1, Cd8a, Cd4 - high	Double positive (late middle T-cell)
2	Itm2a	Mature T-cell
6	Aif1	Macrophages

warning Remember, your clusters may be in a different order! Look for the expression of the marker genes in order to annotate your clusters.

Question

Let’s consider how you might handle a different output. I personally re-ran the same workflow from this tutorial five times and got two different results. Here’s one of the other outputs I got.

Open image in new tab

Figure 23: Another UMAP

What is different about that plot?

How would you adjust your annotation?

While the cells are in the same places (which may not always be the case!), the clustering is different. The large Double positive (middle T-cell) cluster has more evenly divided into three clusters, which has therefore changed the ordering of cluster size.

The cluster annotation would be different: | Clusters | Marker | Cell type | |———-|——————————|———————————-| | 2 | Il2ra | Double negative (early T-cell) | | 0,1,5 | Cd8b1, Cd8a, Cd4 | Double positive (middle T-cell)| | 4 | Cd8b1, Cd8a, Cd4 - high | Double positive (late middle T-cell) | 3 | Itm2a | Mature T-cell | 6 | Aif1 | Macrophages |

The authors weren’t interested in further annotation of the DP cells, so neither are we. Sometimes that just happens. The maths tries to call similar (ish) sized clusters, whether it is biologically relevant or not. Or, the question being asked doesn’t really require such granularity of clusters.

If you have deviated from any of the original parameters in this tutorial, you will likely have a different number of clusters. You will, therefore, need to change the upcoming ‘Annotating clusters’ “Comma-separated list of new categories” accordingly. Best of luck!

Annotating Clusters

To annotate the clusters, we write a list of new cluster names in order from Cluster 0 onwards. In this case, that list is: DP-M3,DP-M1,DN,T-mat,DP-L,DP-M2,Macrophages

Question

Imagine you had that second version of an analysis shared above. The cluster annotation was different: | Clusters | Marker | Cell type | |———-|——————————|———————————-| | 3 | Il2ra | Double negative (early T-cell) | | 0,1,5 | Cd8b1, Cd8a, Cd4 | Double positive (middle T-cell)| | 4 | Cd8b1, Cd8a, Cd4 - high | Double positive (late middle T-cell) | 3 | Itm2a | Mature T-cell | 6 | Aif1 | Macrophages |

What would your cluster names list look like?

Given this new order, your list would be: DP-M3,DP-M1,DN,T-mat,DP-L,DP-M2,Macrophages

Adjust your list according to the expression of the gene markers.

Hands On: Annotating clusters

Manipulate AnnData ( Galaxy version 0.10.9+galaxy1) with the following parameters:

param-file “Annotated data matrix”: DEG_Object

“Function to manipulate the object”: Rename categories of annotation

“Key for observations or variables annotation”: louvain

“Comma-separated list of new categories”: DP-M3,DP-M1,T-mat,DN,DP-M2,DP-L,Macrophages

“Add categories to a new key?”: Yes

“Key name”: cell_type

Rename galaxy-pencil output h5ad Annotated_Object

Now, it’s time to re-plot with these annotations!

Scanpy plot ( Galaxy version 1.10.2+galaxy2) with the following parameters:

param-file “Annotated data matrix”: Annotated_Object

“Method used for plotting”: Embeddings: Scatter plot in UMAP basis, using 'pl.umap'

“Keys for annotations of observations/cells or variables/genes”: batch,Il2ra,Itm2a,sex,Cd8b1,Cd8a,Cd4,genotype,Aif1,Hba-a1,log1p_total_counts,cell_type

“Key for field in ‘.var’ that stores gene symbols”: Symbol

Rename galaxy-pencil output plot Annotated_Plots

Annotated cell types. — **Figure 24**: Our annotated UMAP

Now that we know what we’re dealing with, let’s examine the effect of our variable, proper science!

Question: Genotype

Are there any differences in genotype? Or in biological terms, is there an impact of growth restriction on T-cell development in the thymus?

Open image in new tab

Figure 25: Genotype differences

We can see that DP-L, which seems to be extending away from the DP-M bunch, as well as the mature T-cells (or particularly the top half) are missing some knockout cells. Perhaps there is some sort of inhibition here? INTERESTING! What next? We might look further at the transcripts present in both those populations, and perhaps also look at the genotype marker table… So much to investigate! But before we set you off to explore to your heart’s delight, let’s also look at this a bit more technically.

Technical Assessment

Is our analysis real? Is it right? Well, we can assess that a little bit.

Question: Batch effect

Is there a batch effect?

Open image in new tab

Figure 26: Batch effect?

While some shifts are expected and nothing to be concerned about, DP-L looks to be mainly comprised of N705. There might be a bit of batch effect, so you could consider using batch correction on this dataset. However, if we focus our attention on the other cluster - mature T-cells - where there is batch mixing, we can still assess this biologically even without batch correction. Additionally, we will also look at the confounding effect of sex.

Open image in new tab

Figure 27: Sex differences

We note that the one female sample - unfortunately one of the mere three knockout samples - seems to be distributed in the same areas as the knockout samples at large, so luckily, this doesn’t seem to be a confounding factor and we can still learn from our data. Ideally, this experiment would be re-run with either more female samples all around or swapping out this female from the male sample.

Question: Depth effect

Are there any clusters or differences being driven by sequencing depth, a technical and random factor?

Open image in new tab

Figure 28: Sequencing depth differences

Eureka! This explains the odd DP shift between wildtype and knockout cells - the left side of the DP cells simply have a higher sequencing depth (UMIs/cell) than the ones on the right side. Well, that explains some of the sub-cluster that we’re seeing in that splurge. Importantly, we don’t see that the DP-L or (mostly) the mature T-cell clusters are similarly affected. So, whilst again, this variable of sequencing depth might be something to regress out somehow, it doesn’t seem to be impacting our dataset. The less you can regress/modify your data, in general, the better - you want to stay as true as you can to the raw data, and only use maths to correct your data when you really need to (and not to create insights where there are none!).

Question: Sample purity

Do you think we processed these samples well enough?

Open image in new tab

Figure 29: Hemoglobin across clusters

We have seen in the previous images that these clusters are not very tight or distinct, so we could consider stronger filtering. Additionally, hemoglobin - a red blood cell marker that should NOT be found in T-cells - appears throughout the entire sample in low numbers. This suggests some background in the media the cells were in, and we might consider in the wet lab trying to get a purer, happier sample, or in the dry lab, techniques such as SoupX or others to remove this background. Playing with filtering settings (increasing minimum counts/cell, etc.) is often the place to start in these scenarios.

Question: Clustering resolution

Do you think the clustering is appropriate? i.e. are there single clusters that you think should be separate, and multiple clusters that could be combined?

Open image in new tab

Figure 30: Itm2a across clusters

Important to note, lest all bioinformaticians combine forces to attack the biologists: just because a cluster doesn’t look like a cluster by eye is NOT enough to say it’s not a cluster! But looking at the biology here, we struggled to find marker genes to distinguish the DP population, which we know is also affected by depth of sequencing. That’s a reasonable argument that DP-M1, DP-M2, and DP-M3 might not be all that different. Maybe we need more depth of sequencing across all the DP cells, or to compare these explicitly to each other (consider variations on FindMarkers!). However, DP-L is both seemingly leaving the DP cluster and also has fewer knockout cells, so we might go and look at what DP-L is expressing in the marker genes. If we look at T-mat further, we can see that its marker gene - Itm2a - is only expressed in half of the cluster. You might consider sub-clustering this to investigate further, either through changing the resolution or through analysing this cluster alone. If we look at the differences between genotypes alone (so the pseudo-bulk), we can see that most of the genes in that list are actually ribosomal. This might be a housekeeping background, this might be cell cycle related, this might be biological, or all three. You might consider investigating the cycling status of the cells, or even regressing this out (which is what the authors did).

Ultimately, there are quite a lot ways to analyse the data, both within the confines of this tutorial (the many parameters that could be changed throughout) and outside of it (batch correction, sub-clustering, cell-cycle scoring, inferred trajectories, etc.) Most analyses will still yield the same general output, though: there are fewer knockout cells in the mature T-cell population.

congratulations Congratulations! You have interpreted your plots in several important ways!

Interactive visualisations

Before we leave you to explore the unknown, you might have noticed that the above interpretations are only a few of the possible options. Plus you might have had fun trying to figure out which sample is which genotype is which sex and flicking back and forth between plots repeatedly. Figuring out which plots will be your final publishable plots takes a lot of time and testing. Luckily, there is a helpful interactive viewer Cakir et al. 2020 export tool that can help you explore without having to produce new plots over and over!

Hands On: Cellxgene

Interactive CELLxGENE VIP Environment with the following parameters:

param-file “Concatenate dataset”: Annotated_Object

“Var field for gene symbols”: Symbol

“Make specified var field unique”: galaxy-toggle Yes

When ready, you will see a message

details There is an InteractiveTool result view available, click here to display <—- Click there!

Sometimes this link can aggravate a firewall or something similar. It should be fine to go to the site.

You will be asked to name your annotation. Do so, then you can start playing around!

You will need to STOP this active environment in Galaxy by going to User, Interactive Tools, selecting the environment, and selecting Stop. You may also want to delete the dataset in the history, because otherwise it continues appearing as if it’s processing.

Be warned - this visualisation tool is a powerful option for exploring your data, but it takes some time to get used to. Consider exploring it as your own tutorial for another day!

Conclusion

Hopefully, no matter which pathway of analysis you took, you found the same general interpretations. If not, this is a good time to discuss and consider with your group why that might be - what decision was ‘wrong’ or ‘ill-advised’, and how would you go about ensuring you correctly interpreted your data in the future? Top tip - trial and error is a good idea, believe it or not, and the more ways you find the same insight, the more confident you can be! But nothing beats experimental validation… For those that did not take the ‘control’ options, please

Rename your history (by clicking on the history title) as DECISION-Filtering and Plotting Single-cell RNA-seq Data

Add a history annotation history-annotate that includes which parameters you changed/steps you changed from the control

Sharing your history allows others to import and access the datasets, parameters, and steps of your history.

Access the history sharing menu via the History Options dropdown (galaxy-history-options), and clicking “history-share Share or Publish”

Share via link

Open the History Options galaxy-history-options menu at the top of your history panel and select “history-share Share or Publish”

galaxy-toggle Make History accessible

A Share Link will appear that you give to others

Anybody who has this link can view and copy your history

Publish your history

galaxy-toggle Make History publicly available in Published Histories

Anybody on this Galaxy server will see your history listed under the Published Histories tab opened via the galaxy-histories-activity Histories activity

Share only with another user.

Enter an email address for the user you want to share with in the Please specify user email input below Share History with Individual Users

Your history will be shared only with this user.

Finding histories others have shared with me

Click on the galaxy-histories-activity Histories activity in the activity bar on the left

Click the Shared with me tab

Here you will see all the histories others have shared with you directly

Note: If you want to make changes to your history without affecting the shared version, make a copy by going to History Options galaxy-history-options icon in your history and clicking Copy this History

Feel free to explore any other similar histories

congratulations Congratulations! You’ve made it to the end!

You might find the galaxy-history-answer Answer Key Histories helpful to check or compare with:

UseGalaxy.eu_Answer_Key_History

UseGalaxy.eu-ARCHIVEDv2

UseGalaxy.eu-ARCHIVEDv1

Older_version

You can also run this entire tutorial via a galaxy-workflows-activity Workflow, after performing the Get data step initially.

Tutorial Workflow

In this tutorial, you moved from technical processing to biological exploration. By analysing real data - both the exciting and the messy! - you have, hopefully, experienced what it’s like to analyse and question a dataset, potentially without clear cut-offs or clear answers. If you were working in a group, you each analysed the data in different ways, and most likely found similar insights. One of the biggest problems in analysing scRNA-seq is the lack of a clearly defined pathway or parameters. You have to make the best call you can as you move through your analysis, and ultimately, when in doubt, try it multiple ways and see what happens!

feedback To discuss with like-minded scientists, join our Galaxy Training Network chatspace in Slack and discuss with fellow users of Galaxy single cell analysis tools on #single-cell-users

We also post new tutorials / workflows there from time to time, as well as any other news.

point-right If you’d like to contribute ideas, requests or feedback as part of the wider community building single-cell and spatial resources within Galaxy, you can also join our Single cell & sPatial Omics Community of Practice.

tool You can request tools here on our Single Cell and Spatial Omics Community Tool Request Spreadsheet

You've Finished the Tutorial

Key points

Single-cell data is huge. To analyse single-cell data, you must reduce its many dimensions (cells x genes) via numerous methods

Analysis is more subjective than we think. We need both prior, biological knowledge of the samples as well as many iterations of analysis, in order to derive novel, real biological insights

Frequently Asked Questions

Have questions about this tutorial? Have a look at the available FAQ pages and support channels

Useful literature

Further information, including links to documentation and original publications, regarding the tools, analysis techniques and the interpretation of results described in this tutorial can be found here.

References

Trapnell, C., D. Cacchiarelli, J. Grimsby, P. Pokharel, S. Li et al., 2014 The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells. Nature Biotechnology 32: 381–386. 10.1038/nbt.2859
Satija, R., J. A. Farrell, D. Gennert, A. F. Schier, and A. Regev, 2015 Spatial reconstruction of single-cell gene expression data. Nature Biotechnology 33: 495–502. 10.1038/nbt.3192
McCarthy, D. J., K. R. Campbell, A. T. L. Lun, and Q. F. Wills, 2017 Scater: pre-processing, quality control, normalization and visualization of single-cell RNA-seq data in R. Bioinformatics 33: 10.1093/bioinformatics/btw777
Bacon, W. A., R. S. Hamilton, Z. Yu, J. Kieckbusch, D. Hawkes et al., 2018 Single-Cell Analysis Identifies Thymic Maturation Delay in Growth-Restricted Neonatal Mice. Frontiers in Immunology 9: 10.3389/fimmu.2018.02523
Wolf, F. A., P. Angerer, and F. J. Theis, 2018 SCANPY: large-scale single-cell gene expression data analysis. Genome Biology 19: 10.1186/s13059-017-1382-0
Cakir, B., M. Prete, N. Huang, S. van Dongen, P. Pir et al., 2020 Comparison of visualization tools for single-cell RNAseq data. NAR Genomics and Bioinformatics 2: 10.1093/nargab/lqaa052

Feedback

Did you use this material as an instructor? Feel free to give us feedback on how it went.
Did you use this material as a learner or student? Click the form below to leave feedback.

Citing this Tutorial

Wendi Bacon, Filter, plot and explore single-cell RNA-seq data with Scanpy (Galaxy Training Materials). https://training.galaxyproject.org/training-material/topics/single-cell/tutorials/scrna-case_basic-pipeline/tutorial.html Online; accessed TODAY
Hiltemann, Saskia, Rasche, Helena et al., 2023 Galaxy Training: A Powerful Framework for Teaching! PLOS Computational Biology 10.1371/journal.pcbi.1010752
Batut et al., 2018 Community-Driven Data Analysis Training for Biology Cell Systems 10.1016/j.cels.2018.05.012

@misc{single-cell-scrna-case_basic-pipeline,
author = "Wendi Bacon",
	title = "Filter, plot and explore single-cell RNA-seq data with Scanpy (Galaxy Training Materials)",
	year = "",
	month = "",
	day = "",
	url = "\url{https://training.galaxyproject.org/training-material/topics/single-cell/tutorials/scrna-case_basic-pipeline/tutorial.html}",
	note = "[Online; accessed TODAY]"
}
@article{Hiltemann_2023,
	doi = {10.1371/journal.pcbi.1010752},
	url = {https://doi.org/10.1371%2Fjournal.pcbi.1010752},
	year = 2023,
	month = {jan},
	publisher = {Public Library of Science ({PLoS})},
	volume = {19},
	number = {1},
	pages = {e1010752},
	author = {Saskia Hiltemann and Helena Rasche and Simon Gladman and Hans-Rudolf Hotz and Delphine Larivi{\`{e}}re and Daniel Blankenberg and Pratik D. Jagtap and Thomas Wollmann and Anthony Bretaudeau and Nadia Gou{\'{e}} and Timothy J. Griffin and Coline Royaux and Yvan Le Bras and Subina Mehta and Anna Syme and Frederik Coppens and Bert Droesbeke and Nicola Soranzo and Wendi Bacon and Fotis Psomopoulos and Crist{\'{o}}bal Gallardo-Alba and John Davis and Melanie Christine Föll and Matthias Fahrner and Maria A. Doyle and Beatriz Serrano-Solano and Anne Claire Fouilloux and Peter van Heusden and Wolfgang Maier and Dave Clements and Florian Heyl and Björn Grüning and B{\'{e}}r{\'{e}}nice Batut and},
	editor = {Francis Ouellette},
	title = {Galaxy Training: A powerful framework for teaching!},
	journal = {PLoS Comput Biol}
}

                   

Congratulations on successfully completing this tutorial!

Do you want to extend your knowledge?
Follow one of our recommended follow-up trainings:

tutorial Hands-on: Inferring single cell trajectories with Scanpy (Python)

tutorial Hands-on: Inferring single cell trajectories with Monocle3

You can use Ephemeris's shed-tools install command to install the tools used in this tutorial.

shed-tools install [-g GALAXY] [-a API_KEY] -t <(curl https://training.galaxyproject.org/training-material/api/topics/single-cell/tutorials/scrna-case_basic-pipeline/tutorial.json | jq .admin_install_yaml -r)

Alternatively you can copy and paste the following YAML

---
install_tool_dependencies: true
install_repository_dependencies: true
install_resolver_dependencies: true
tools:
- name: anndata_ops
  owner: ebi-gxa
  revisions: 890fb06a2893
  tool_panel_section_label: Single-cell
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: anndata_ops
  owner: ebi-gxa
  revisions: 7ebc22f77d86
  tool_panel_section_label: Single-cell
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: retrieve_scxa
  owner: ebi-gxa
  revisions: 72b6e1747e37
  tool_panel_section_label: Single-cell
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: scanpy_compute_graph
  owner: ebi-gxa
  revisions: bc9e8de21efb
  tool_panel_section_label: Single-cell
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: scanpy_compute_graph
  owner: ebi-gxa
  revisions: 6417cccad133
  tool_panel_section_label: Single-cell
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: scanpy_filter_cells
  owner: ebi-gxa
  revisions: 55ccc0a121bc
  tool_panel_section_label: Single-cell
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: scanpy_filter_cells
  owner: ebi-gxa
  revisions: e9283529cba1
  tool_panel_section_label: Single-cell
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: scanpy_filter_genes
  owner: ebi-gxa
  revisions: da611a744112
  tool_panel_section_label: Single-cell
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: scanpy_filter_genes
  owner: ebi-gxa
  revisions: 474a69c47184
  tool_panel_section_label: Single-cell
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: scanpy_find_cluster
  owner: ebi-gxa
  revisions: ad6995afd2e6
  tool_panel_section_label: Single-cell
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: scanpy_find_cluster
  owner: ebi-gxa
  revisions: 2ccd9f9e2cd0
  tool_panel_section_label: Single-cell
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: scanpy_find_markers
  owner: ebi-gxa
  revisions: 49653e680ea4
  tool_panel_section_label: Single-cell
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: scanpy_find_markers
  owner: ebi-gxa
  revisions: 81c34e67f262
  tool_panel_section_label: Single-cell
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: scanpy_find_variable_genes
  owner: ebi-gxa
  revisions: 4667fe07bd68
  tool_panel_section_label: Single-cell
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: scanpy_find_variable_genes
  owner: ebi-gxa
  revisions: 2861ca3bfe4d
  tool_panel_section_label: Single-cell
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: scanpy_normalise_data
  owner: ebi-gxa
  revisions: 8d9c22419c56
  tool_panel_section_label: Single-cell
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: scanpy_normalise_data
  owner: ebi-gxa
  revisions: 6b97ffba31da
  tool_panel_section_label: Single-cell
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: scanpy_plot_embed
  owner: ebi-gxa
  revisions: 56d22aa6cb3b
  tool_panel_section_label: Single-cell
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: scanpy_plot_embed
  owner: ebi-gxa
  revisions: 35f6611fb8c3
  tool_panel_section_label: Single-cell
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: scanpy_run_pca
  owner: ebi-gxa
  revisions: 3ee74b58c72d
  tool_panel_section_label: Single-cell
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: scanpy_run_pca
  owner: ebi-gxa
  revisions: 59028945c857
  tool_panel_section_label: Single-cell
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: scanpy_run_tsne
  owner: ebi-gxa
  revisions: 8b0d6f6f44c8
  tool_panel_section_label: Single-cell
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: scanpy_run_tsne
  owner: ebi-gxa
  revisions: 181d61d2f3cd
  tool_panel_section_label: Single-cell
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: scanpy_run_umap
  owner: ebi-gxa
  revisions: fe562424a0a6
  tool_panel_section_label: Single-cell
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: scanpy_run_umap
  owner: ebi-gxa
  revisions: bd4a244faeb8
  tool_panel_section_label: Single-cell
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: scanpy_scale_data
  owner: ebi-gxa
  revisions: 5605f8452eb3
  tool_panel_section_label: Single-cell
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: scanpy_scale_data
  owner: ebi-gxa
  revisions: f86b2c3bc459
  tool_panel_section_label: Single-cell
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: anndata_inspect
  owner: iuc
  revisions: afe044e6bc93
  tool_panel_section_label: Single-cell
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: anndata_inspect
  owner: iuc
  revisions: ee98d611afc6
  tool_panel_section_label: Single-cell
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: anndata_manipulate
  owner: iuc
  revisions: be1fdce914c6
  tool_panel_section_label: Single-cell
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: anndata_manipulate
  owner: iuc
  revisions: 3d748954434b
  tool_panel_section_label: Single-cell
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: scanpy_cluster_reduce_dimension
  owner: iuc
  revisions: a84a73471be7
  tool_panel_section_label: Single-cell
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: scanpy_filter
  owner: iuc
  revisions: 5813f9920877
  tool_panel_section_label: Single-cell
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: scanpy_inspect
  owner: iuc
  revisions: d86cb2ce311b
  tool_panel_section_label: Single-cell
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: scanpy_inspect
  owner: iuc
  revisions: 3b5ef3b00cdd
  tool_panel_section_label: Single-cell
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: scanpy_normalize
  owner: iuc
  revisions: 56c0d4acef02
  tool_panel_section_label: Single-cell
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: scanpy_plot
  owner: iuc
  revisions: 5a5468c0d26f
  tool_panel_section_label: Single-cell
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: scanpy_plot
  owner: iuc
  revisions: aa0c474463c2
  tool_panel_section_label: Single-cell
  tool_shed_url: https://toolshed.g2.bx.psu.edu/

t{ hist[0] | to_stars }} 5

t{ hist[0] | to_stars }} 2

t{ hist[0] | to_stars }} 1

November 2024

4 stars: Liked: Easy and very tutorial-like Disliked: More ways to look into this data. I think that it was necessary to know this type of pipeline before, as many of this data, like the PCA, didn't tell much for me.

September 2024

4 stars: Liked: How we were taken through the process step by step - explaining what each step achieved. This was also easy to visualise as certain steps (e.g. filtering) had been broken down into substeps so we could see the impact of each subsequent filter.

5 stars: Liked: colours and visual display Disliked: all good

November 2023

5 stars: Liked: Questions and answers

September 2022

5 stars: Liked: Clear, straight forward, but also information rich.

March 2022

5 stars: Liked: Everything because all was new for me Disliked: The tutorial video would be more detailed in some parts.

July 2021

2 stars: Disliked: From find markers on I couldn't produce the desired output. If a folow the link in galaxy to Scanpy findMarkers it produces an empty table; if a seach this tool in galaxy, it produces complete tables but in the end I can't produce the plots with scanpy plot embebed (gives error). The video is to quick and does not explain what is being doing, it is just a lot of clicks. Taking in consideration the international audience, jargon should not be used. This is a shame, because it is such an interesting topic.