Importing files from public atlases

Author(s) orcid logoJulia Jakiela avatar Julia Jakielaorcid logoWendi Bacon avatar Wendi Bacon
Reviewers Björn Grüning avatarJulia Jakiela avatarWendi Bacon avatarHelena Rasche avatarPablo Moreno avatarPavankumar Videm avatarMehmet Tekman avatarSaskia Hiltemann avatar
Overview
Creative Commons License: CC-BY Questions:
  • How do I use the EBI Single Cell Expression Atlas and Human Cell Atlas?

  • How can I reformat and manipulate the downloaded files to create the correct input for downstream analysis?

Objectives:
  • You will retrieve raw data from the EBI Single Cell Expression Atlas and Human Cell Atlas.

  • You will manipulate the metadata and matrix files.

  • You will combine the metadata and matrix files into an AnnData or Seurat object for downstream analysis.

Requirements:
Time estimation: 15 minutes
Supporting Materials:
Published: Nov 14, 2023
Last modification: Sep 13, 2024
License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MIT
purl PURL: https://gxy.io/GTN:T00374
version Revision: 5

Introduction

Public single cell datasets seem to accumulate by the second. Well annotated, quality datasets are slightly trickier to find, which is why projects like the Single Cell Expression Atlas (SCXA) exist - to curate datasets for public use. Here, we will guide you through transforming data imported from the SCXA repository into the input file required for the Filter, Plot, Explore tutorial and we will also show how to use the public atlases for your own research.

Agenda

In this tutorial, we will cover:

  1. Introduction
  2. Getting data from the Single Cell Expression Atlas
    1. Examine the imported files
    2. Metadata manipulation
    3. Check mitochondrial gene name format
    4. Creating the AnnData object
    5. AnnData manipulation
    6. Conclusion
    7. Creating the Seurat Object
    8. Conclusion
  3. Human Cell Atlas Matrix Downloader

Getting data from the Single Cell Expression Atlas

Galaxy has a specific tool for importing data from the SCXA (Moreno et al. 2020), which combines all the preprocessing steps shown in the corresponding tutorial into one! For this tutorial, the dataset can be seen at the EBI with experiment ID of E-MTAB-6945.

You can search datasets according to various criteria either using search box in Home tab or choosing kingdom, experiment collection, technology type (and others) in Browse experiments tab. When you find the experiment you are interested in, just click on it and the experiment ID will be displayed in the website URL, as shown below.

Arrow pointing to the website URL where you can find experiment ID.Open image in new tab

Figure 1: Where to find experiment ID on the EBI Single Cell Expression Atlas website.

Once you know the experiment ID, you can use EBI SCXA Data Retrieval tool in Galaxy!

Hands-on: Retrieving data from Single Cell Expression Atlas
  1. EBI SCXA Data Retrieval ( Galaxy version v0.0.2+galaxy2) with the following parameters:
    • “SC-Atlas experiment accession”: E-MTAB-6945
    • “Choose the type of matrix to download”: Raw filtered counts

It’s important to note that this matrix is processed somewhat through the SCXA pipeline, which is quite similar to the pre-processing that has been shown in this case study tutorial series. The resultant datasets contain any and all metadata provided by the SCXA pipeline as well as the metadata contributed by the original authors (for instance, more cell or gene annotations). So while the AnnData object generated at the end of this tutorial will be similar to that generated using the Alevin workflows on the original FASTQ files, some of the metadata will be slightly different. Relevant results and interpretation will not change, however!

Examine the imported files

Question
  1. What format has this tool imported?

Selecting the title of each resultant dataset will expand the dataset in the Galaxy history.

Matrix Market Format! We can tell this because our first file helpfully says MatrixMarket in the first line.

Green box containing first output, the matrix.mtx file. Columns are labelled 14458, 5218, and 5308559. Open image in new tab

Figure 2: Matrix Market Output

This param-file matrix.mtx file, in Matrix Market format, contains a column referring to each gene (column 1), to each cell (column 2), and the expression values themselves in the final column. To be useful, then, we need to know which genes and cells the numbers are referring to. That’s why this format comes with two more files.

Green box containing second output, the genes.tsv file. The first column contains EnsemblIDs such as ENSMUSG######, while the second column contains gene names. There are 14,457 lines.Open image in new tab

Figure 3: Genes Output

The param-file genes.tsv file lists each EnsemblID and its gene name. The lines (14,457) corresponds with the 14458 in the Matrix file…but the 14458 contains a header, so that’s why it has one more than the genes file!

Green box containing third output, the barcodes.tsv file. The file consists of 5,217 lines and a single column containing the cell barcode, variations of ERR2704656-AAAACACTCTGA.Open image in new tab

Figure 4: Cells Output

The param-file barcodes.tsv file lists each barcode. The lines (5,217) again correspond with the 5,218 lines in the Matrix file…which adds in the header again!

Green box containing fourth output, the exp_design.tsv file. The file consists of 5,218 lines and numerous columns starting with 'Assay' and 'Sample Characteristic'.Open image in new tab

Figure 5: Experimental Design

Finally, and helpfully, the tool also includes cell metadata where the Assay column corresponds with the barcodes in the param-file barcodes.tsv file. While this is not a required file to create an AnnData object from the three matrix market files, it is extremely necessary for actually interpreting the data. Imagine not knowing which barcodes came from which sample!

Metadata manipulation

At this point you might want to do some modifications in the files before downstream analysis. That can include re-formating the cell metadata or changing the names of the column headers, it all depends on your dataset and how you want to perfrom your analysis. It’s also fine to transform those files straight away. Here, we will show an extended version of metadata manipulation which allows us to create an input file consistent with the next tutorial workflow.

Before creating an AnnData object, we need to make a small modification in experimental design table. The dataset contains information about the 7 experimental samples (N701 – N707). However, in the param-file exp_design.tsv dataset, which contains the cell metadata, these samples are just numbered from 1 to 7.

You can preview this column in the the param-file exp_design.tsv dataset by selecting the galaxy-eye in the galaxy-history Galaxy history. If you scroll to the right, and move to the column Sample Characteristic[individual], you will find the batch information. Don’t worry, we’re about to rename and reformat this whole dataset to more useful titles. Make a note of the number of that column - number 12 - as we will need it to change the batch number to a batch name shortly.

The plotting tool that we will use later will fail if the entries are integers and not categorical values, so we will change 1 to N01 and so on.

Hands-on: Change batch numbers into names
  1. Change the datatype of param-file EBI SCXA Data Retrieval on E-MTAB-6945 exp_design.tsv to tabular:

    • Click on the galaxy-pencil pencil icon for the dataset to edit its attributes
    • In the central panel, click galaxy-chart-select-data Datatypes tab on the top
    • In the galaxy-chart-select-data Assign Datatype, select tabular from “New type” dropdown
      • Tip: you can start typing the datatype into the field to filter the dropdown menu
    • Click the Save button

  2. Column Regex Find And Replace ( Galaxy version 1.0.3) with the following parameters:

    • param-file “Select cells from”: EBI SCXA Data Retrieval on E-MTAB-6945 exp_design.tsv
    • “using column”: Column: 12
    • In “Check”:
      • param-repeat “Insert Check”
        • “Find Regex”: 1
        • “Replacement”: N01
      • param-repeat “Insert Check”
        • “Find Regex”: 2
        • “Replacement”: N02
      • param-repeat “Insert Check”
        • “Find Regex”: 3
        • “Replacement”: N03
      • param-repeat “Insert Check”
        • “Find Regex”: 4
        • “Replacement”: N04
      • param-repeat “Insert Check”
        • “Find Regex”: 5
        • “Replacement”: N05
      • param-repeat “Insert Check”
        • “Find Regex”: 6
        • “Replacement”: N06
      • param-repeat “Insert Check”
        • “Find Regex”: 7
        • “Replacement”: N07

While we’re renaming things, let’s also fix our titles.

Hands-on: Change cell metadata titles
  1. Replace parts of text ( Galaxy version 1.1.4) with the following parameters:
    • param-file “Select lines from”: output from Column Regex and Replace tool
    • In “Find and Replace”:
      • “Find pattern”: "Sample Characteristic[genotype]"
      • “Replace with”: genotype
    • param-repeat “Insert Find and Replace”
      • “Find pattern”: "Sample Characteristic[individual]"
      • “Replace with”: batch
    • param-repeat “Insert Find and Replace”
      • “Find pattern”: "Sample Characteristic[sex]"
      • “Replace with”: sex
    • param-repeat “Insert Find and Replace”
      • “Find pattern”: "Sample Characteristic[cell type]"
      • “Replace with”: cell_type
  2. Rename galaxy-pencil output Cell metadata

Check mitochondrial gene name format

We might like to flag mitochondrial genes. They can be identified quite easily since - depending on the species and formatting convention - their names often start with mt. Since tools for flagging mitochondrial genes are often case-sensitive, it might be a good idea to check the formatting of the mitochondrial genes in our dataset.

Hands-on: Check the format of mitochondrial genes names
  1. Search in textfiles ( Galaxy version 1.1.1) with the following parameters:
    • param-file “Select lines from”: EBI SCXA Data Retrieval on E-MTAB-6945 genes.tsv (Raw filtered counts)
    • “that”: Match
    • “Regular Expression”: mt
    • “Match type”: case insensitive
    • “Output”: Highlighted HTML (for easier viewing)
  2. Rename galaxy-pencil output Mito genes check

If you click on that dataset, you will see all the genes containing mt in their name. We can now clearly see that mitochondrial genes in our dataset start with mt-. Keep that in mind, we might use it in a moment!

Now we can create our single cell object!

Hands-on: Choose Your Own Tutorial

This is a "Choose Your Own Tutorial" section, where you can select between multiple paths. Click one of the buttons below to select how you want to follow the tutorial

You can choose whether you want to create an AnnData object for Scanpy Analysis or an RDS object for Seurat Analysis. Galaxy has more resources for Scanpy analysis, but sometimes Seurat might have what you want. The two packages are constantly trying to outdo the other! It often depends on what is more 'standard' in your work environment!

Creating the AnnData object

We will do several modifications within the AnnData object so that you can follow the next tutorial.

Hands-on: Create the AnnData Object
  1. Scanpy Read10x ( Galaxy version 1.8.1+galaxy9)
  2. Make sure you are using version 1.8.1+galaxy9 of the tool (change by clicking on tool-versions Versions button):
List of available tool versions shown when clicking on the 'Versions' button on the top of the page.Open image in new tab

Figure 6: How to change the version of the tool
  1. Set the following parameters:
    • param-file “Expression matrix in sparse matrix format (.mtx)”: EBI SCXA Data Retrieval on E-MTAB-6945 matrix.mtx (Raw filtered counts)
    • “Gene table”: EBI SCXA Data Retrieval on E-MTAB-6945 genes.tsv (Raw filtered counts)
    • “Barcode/cell table”: EBI SCXA Data Retrieval on E-MTAB-6945 barcodes.tsv (Raw filtered counts)
    • “Cell metadata table”: Cell metadata
  2. Rename galaxy-pencil output AnnData object

AnnData manipulation

We will now change the header of the column containing gene names from gene_symbols to Symbol. This edit is only needed to make our AnnData object compatible with this tutorial’s workflow. We will also flag the mitochondrial genes.

And the good news is that we can do both those steps using only one tool!

Hands-on: Modify AnnData object
  1. AnnData Operations ( Galaxy version 1.8.1+galaxy92)
  2. Make sure you are using version 1.8.1+galaxy92 of the tool (change by clicking on tool-versions Versions button)
  3. Set the following parameters:
    • param-file In “Input object in hdf5 AnnData format”: AnnData object
    • In “Change field names in AnnData var”:
      • param-repeat “Insert Change field names in AnnData var”
        • “Original name”: gene_symbols
        • “New name”: Symbol
    • “Gene symbols field in AnnData”: Symbol
    • In “Flag genes that start with these names”:
      • param-repeat “Insert Flag genes that start with these names”
        • “Starts with”: mt-
        • “Var name”: mito
  4. Rename galaxy-pencil output Mito-counted AnnData for downstream analysis

And that’s all! What’s even more exciting about the tool AnnData Operations tool is that it automatically calculates a bunch of metrics, such as log1p_mean_counts, log1p_total_counts, mean_counts, n_cells, n_cells_by_counts, n_counts, pct_dropout_by_counts, and total_counts. Amazing, isn’t it?

Conclusion

Now you can use this object as input for the Filter, Plot, Explore tutorial and its associated workflow!

Even though this tutorial was designed specifically to modify the AnnData object to be compatible with the subsequent tutorial, it also shows useful tools that you can use for your own, independent data analysis. You can find the workflow and the answer key history. However, if you want to use the workflow from this tutorial, you have to keep in mind that different datasets may have different column names. So you have to check them first, and only then you can modify them.

Creating the Seurat Object

Hands-on: Create the Seurat Object
  1. Seurat Read10x ( Galaxy version 4.0.4+galaxy0)
  2. Set the following parameters:
    • param-file “Expression matrix in sparse matrix format (.mtx)”: EBI SCXA Data Retrieval on E-MTAB-6945 matrix.mtx (Raw filtered counts)
    • “Gene table”: EBI SCXA Data Retrieval on E-MTAB-6945 genes.tsv (Raw filtered counts)
    • “Barcode/cell table”: EBI SCXA Data Retrieval on E-MTAB-6945 barcodes.tsv (Raw filtered counts)
    • “Cell metadata”: Cell metadata
  3. Rename galaxy-pencil output Seurat object

You can also choose if you want to create Seurat object, Loom or Single Cell Experiment by selecting your option in “Choose the format of the output”.

Conclusion

And you’re there! You now have a usable Seurat object for analysis with Seurat tools in your history! congratulations Congrats!

Human Cell Atlas Matrix Downloader

Another public atlas that you can use to access the datasets is Human Cell Atlas data portal. We will show you the tool in Galaxy which allows to retrieve expression matrices and metadata for any public experiment available in that repository.

To use it, simply set the project title, project label or project UUID, which can be found at the HCA data browser, and select the desired matrix format (Matrix Market or Loom).

Image showing project UUID as a final fragment of link address, project title (self-explanatory) and project label as an entry in the box on the right side of the page.Open image in new tab

Figure 7: Where to find project title, project label and project UUID

For projects that have more than one organism, one needs to be specified. Otherwise, there is no need to set the species.

Let’s use the suggested example of the project Single cell transcriptome analysis of human pancreas. If you check this project in HCA, you’ll find out that it’s actually its label. But it should work well if you enter the title or UUID!

Hands-on: Create AnnData object

Human Cell Atlas Matrix Downloader ( Galaxy version v0.0.4+galaxy0) with the following parameters:

  • “Human Cell Atlas project name/label/UUID”: Single cell transcriptome analysis of human pancreas
  • “Choose the format of matrix to download”: Matrix Market
Warning: Errors that you might encounter

If your dataset turns red, there might be several reasons for that, for example:

  • “There are too many connected users” - please be patient and re-run the step later, as it is advised
  • “Project identifier was not found in the database” - double check the spelling, try entering project title, project label or project UUID.

When “Matrix Market” is seleted, outputs are in 10X-compatible Matrix Market format:

  • Matrix (txt): Contains the expression values for genes (rows) and cells (columns) in raw counts. This text file is formatted as a Matrix Market file, and as such it is accompanied by separate files for the gene identifiers and the cells identifiers.
  • Genes (tsv): Identifiers (column repeated) for the genes present in the matrix of expression, in the same order as the matrix rows.
  • Barcodes (tsv): Identifiers for the cells of the data matrix. The file is ordered to match the columns of the matrix.
  • Experiment Design file (tsv): Contains metadata for the different cells of the experiment.

When “Loom” is selected, output is a single Loom HDF5 file:

  • Loom (h5): Contains expression values for genes (rows) and cells (columns) in raw counts, cell metadata table and gene metadata table, in a single HDF5 file.

If you chose Loom format and you need to convert your file to other datatype, you can use SCEasy ( Galaxy version 0.0.7+galaxy1) (more details in the next section). If you chose Matrix Market format, you can then transform the output to AnnData or Seurat, as shown in the EBI SCXA example above. Below, you will find an example of transforming the output to AnnData object.

Hands-on: Create AnnData object

Scanpy Read10x ( Galaxy version 1.8.1+galaxy9) with the following parameters: Make sure you are using version 1.8.1+galaxy9 of the tool (change by clicking on tool-versions Versions button)

  • “Expression matrix in sparse matrix format (.mtx)”: Human Cell Atlas Matrix Downloader on matrix.mtx
  • “Gene table”: Human Cell Atlas Matrix Downloader on genes.tsv
  • “Barcode/cell table”: Human Cell Atlas Matrix Downloader on barcodes.tsv
  • “Cell metadata table”: Human Cell Atlas Matrix Downloader on exp_design.tsv

After you create AnnData file, you can additionally use the AnnData Operations ( Galaxy version 1.8.1+galaxy92) tool (note the version 1.8.1+galaxy92) before downstream analysis. It’s quite a useful tool since not only does it flag mitochondrial genes, but also automatically calculates a bunch of metrics, such as log1p_mean_counts, log1p_total_counts, mean_counts, n_cells, n_cells_by_counts, n_counts, pct_dropout_by_counts, and total_counts.

When you use it to flag mitochondrial genes, here are some formatting tips:

  • Remember to check the name of the column with gene symbols
  • This tool is case sensitive
  • No parentheses needed when typing in the values
  • Including a dash is important to identify mitochondrial genes (eg. MT-)

You can have a look at the answer history of performing those three quick steps.