Reference Data with CVMFS without Ansible

Author(s)	Simon Gladman Helena Rasche
Reviewers

Overview
Questions:

Objectives:

Have an understanding of what CVMFS is and how it works

Install and configure the CVMFS client on a linux machine and mount the Galaxy reference data repository

Configure your Galaxy to use these reference genomes and indices

Time estimation: 1 hour

Supporting Materials:

FAQs

Published: Jun 17, 2020

Last modification: Jul 12, 2026

License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MIT

purl PURL: https://gxy.io/GTN:T00004

version Revision: 11

Overview

The CernVM-FS is a distributed filesystem perfectly designed for sharing readonly data across the globe. We use it in the Galaxy Project for sharing things that a lot of Galaxy servers need. Namely:

Reference Data
- Genome sequences for hundreds of useful species.
- Indices for the genome sequences
- Various bioinformatic tool indices for the available genomes
Tool containers
- Singularity containers of everything stored in Biocontainers (A bioinformatic tool container repository.) You get these for free every time you build a Bioconda recipe/package for a tool.
Others too..

From the Cern website:

The CernVM File System provides a scalable, reliable and low-maintenance software distribution service. It was developed to assist High Energy Physics (HEP) collaborations to deploy software on the worldwide-distributed computing infrastructure used to run data processing applications. CernVM-FS is implemented as a POSIX read-only file system in user space (a FUSE module). Files and directories are hosted on standard web servers and mounted in the universal namespace /cvmfs.”

A slideshow presentation on this subject is available. More details are available on usegalaxy.org (Galaxy Main’s) reference data setup and CVMFS system.

This exercise describes a manual process to install and configure CVMFS and Galaxy’s access to CVMFS. For a tutorial that uses Ansible to perform these tasks, see the Reference Data with CVMFS tutorial.

Agenda

Overview

CVMFS and Galaxy without Ansible

Configuring CVMFS

Testing it out

Look at the repository

CVMFS and Galaxy without Ansible

Comment: Manual version of Ansible Commands

If you wish to perform the same thing that we’ve just done, but by building the ansible script manually, follow these instructions. Otherwise, you have already done everything below and do not need to re-do it.

We are going to setup a CVMFS mount to the Galaxy reference data repository on our machines. To do this we have to install and configure the CVMFS client and then mount the appropriate CVMFS repository using the publicly available keys.

Hands On: Installing the CVMFS Client
On your remote machine, we need to first install the Cern software apt repo and then the CVMFS client and config utility:
sudo apt install lsb-release
wget https://ecsft.cern.ch/dist/cvmfs/cvmfs-release/cvmfs-release-latest_all.deb
sudo dpkg -i cvmfs-release-latest_all.deb
rm -f cvmfs-release-latest_all.deb
sudo apt-get update

sudo apt install cvmfs cvmfs-config
Now we need to run the CVMFS setup script.
sudo cvmfs_config setup

Configuring CVMFS

The configuration is not complex for CVMFS:

Hands On: Configuring CVMFS
Create a /etc/cvmfs/default.local file with the following contents:
CVMFS_REPOSITORIES="data.galaxyproject.org"
CVMFS_HTTP_PROXY="DIRECT"
CVMFS_QUOTA_LIMIT="500"
CVMFS_CACHE_BASE="/srv/cvmfs/cache"
CVMFS_USE_GEOAPI=yes
This tells CVMFS to mount the Galaxy reference data repository and use a specific location for the cache which is limited to 500MB in size and to use the instance’s geo-location to choose the best CVMFS repo server to connect to. You can use the cvmfs_quota_limit role variable to control this setting.

If you also want to mount the BRC Analytics pathogen data and Vertebrate Genomes Project repositories, include them in CVMFS_REPOSITORIES:
CVMFS_REPOSITORIES="data.galaxyproject.org,brc.galaxyproject.org,vgp.galaxyproject.org"
In production UseGalaxy.org.au uses 100GB, different sites have different needs and you can make your cache smaller depending on your usage. E.g. if your users only use one dataset from the reference data (e.g. just hg38) then perhaps you don’t need such a large cache.
Create a /etc/cvmfs/domain.d/galaxyproject.org.conf file with the following contents:
CVMFS_SERVER_URL="http://cvmfs1-psu0.galaxyproject.org/cvmfs/@fqrn@;http://cvmfs1-iu0.galaxyproject.org/cvmfs/@fqrn@;http://cvmfs1-tacc0.galaxyproject.org/cvmfs/@fqrn@;http://cvmfs1-mel0.gvl.org.au/cvmfs/@fqrn@;http://cvmfs1-ufr0.galaxyproject.eu/cvmfs/@fqrn@"
This is a list of the available stratum 1 servers that have this repo.
Create the key directory:
sudo mkdir -p /etc/cvmfs/keys/galaxyproject.org
Create a /etc/cvmfs/keys/galaxyproject.org/data.galaxyproject.org.pub file with the following contents:
-----BEGIN PUBLIC KEY-----
MIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEA5LHQuKWzcX5iBbCGsXGt
6CRi9+a9cKZG4UlX/lJukEJ+3dSxVDWJs88PSdLk+E25494oU56hB8YeVq+W8AQE
3LWx2K2ruRjEAI2o8sRgs/IbafjZ7cBuERzqj3Tn5qUIBFoKUMWMSIiWTQe2Sfnj
GzfDoswr5TTk7aH/FIXUjLnLGGCOzPtUC244IhHARzu86bWYxQJUw0/kZl5wVGcH
maSgr39h1xPst0Vx1keJ95AH0wqxPbCcyBGtF1L6HQlLidmoIDqcCQpLsGJJEoOs
NVNhhcb66OJHah5ppI1N3cZehdaKyr1XcF9eedwLFTvuiwTn6qMmttT/tHX7rcxT
owIDAQAB
-----END PUBLIC KEY-----
Create a /etc/cvmfs/keys/galaxyproject.org/galaxyproject.org.pub file with the following contents:
-----BEGIN PUBLIC KEY-----
MIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEAuJZTWTY3/dBfspFKifv8
TWuuT2Zzoo1cAskKpKu5gsUAyDFbZfYBEy91qbLPC3TuUm2zdPNsjCQbbq1Liufk
uNPZJ8Ubn5PR6kndwrdD13NVHZpXVml1+ooTSF5CL3x/KUkYiyRz94sAr9trVoSx
THW2buV7ADUYivX7ofCvBu5T6YngbPZNIxDB4mh7cEal/UDtxV683A/5RL4wIYvt
S5SVemmu6Yb8GkGwLGmMVLYXutuaHdMFyKzWm+qFlG5JRz4okUWERvtJ2QAJPOzL
mAG1ceyBFowj/r3iJTa+Jcif2uAmZxg+cHkZG5KzATykF82UH1ojUzREMMDcPJi2
dQIDAQAB
-----END PUBLIC KEY-----
The BRC Analytics and VGP repositories use this common Galaxy Project key. The singularity.galaxyproject.org repository also uses it, and the data repository is expected to transition to it eventually.
Make a directory for the cache files
sudo mkdir /srv/cvmfs

Testing it out

Probe the connection.

Hands On: Testing it out
Run sudo cvmfs_config probe data.galaxyproject.org
Question

What does it output?
OK
If this doesn’t return OK then you may need to restart autofs: sudo systemctl restart autofs
Change directory into /cvmfs/ and list the files in that folder

Question

What do you see?

You should see nothing, as CVMFS uses autofs in order to mount paths only upon request.
Change directory into /cvmfs/data.galaxyproject.org/.
Code In: Bash
cd /cvmfs/data.galaxyproject.org/
ls
ls byhand
ls managed
Question

What do you see now?

You’ll see .loc files, genomes and indices. AutoFS only mounts the files when they’re accessed, so it appears like there is no folder there.

And just like that we all have access to all the reference genomes and associated tool indices thanks to the Galaxy Project, IDC, and Nate’s hard work!

If you are developing a new tool, and want to add a reference genome, we recommend you talk to us on Gitter. You can also look at one of the tools that uses reference data, and try and copy from that. If you’re developing the location files completely new, you need to write the data manager.

Look at the repository

Now to configure Galaxy to use the CVMFS references we have just installed, see the Ansible tutorial.

You've Finished the Tutorial

Frequently Asked Questions

Have questions about this tutorial? Have a look at the available FAQ pages and support channels

Feedback

Did you use this material as an instructor? Feel free to give us feedback on how it went.
Did you use this material as a learner or student? Click the form below to leave feedback.

Citing this Tutorial

Simon Gladman, Helena Rasche, Reference Data with CVMFS without Ansible (Galaxy Training Materials). https://training.galaxyproject.org/training-material/topics/admin/tutorials/cvmfs-manual/tutorial.html Online; accessed TODAY
Hiltemann, Saskia, Rasche, Helena et al., 2023 Galaxy Training: A Powerful Framework for Teaching! PLOS Computational Biology 10.1371/journal.pcbi.1010752
Batut et al., 2018 Community-Driven Data Analysis Training for Biology Cell Systems 10.1016/j.cels.2018.05.012

@misc{admin-cvmfs-manual,
author = "Simon Gladman and Helena Rasche",
	title = "Reference Data with CVMFS without Ansible (Galaxy Training Materials)",
	year = "",
	month = "",
	day = "",
	url = "\url{https://training.galaxyproject.org/training-material/topics/admin/tutorials/cvmfs-manual/tutorial.html}",
	note = "[Online; accessed TODAY]"
}
@article{Hiltemann_2023,
	doi = {10.1371/journal.pcbi.1010752},
	url = {https://doi.org/10.1371%2Fjournal.pcbi.1010752},
	year = 2023,
	month = {jan},
	publisher = {Public Library of Science ({PLoS})},
	volume = {19},
	number = {1},
	pages = {e1010752},
	author = {Saskia Hiltemann and Helena Rasche and Simon Gladman and Hans-Rudolf Hotz and Delphine Larivi{\`{e}}re and Daniel Blankenberg and Pratik D. Jagtap and Thomas Wollmann and Anthony Bretaudeau and Nadia Gou{\'{e}} and Timothy J. Griffin and Coline Royaux and Yvan Le Bras and Subina Mehta and Anna Syme and Frederik Coppens and Bert Droesbeke and Nicola Soranzo and Wendi Bacon and Fotis Psomopoulos and Crist{\'{o}}bal Gallardo-Alba and John Davis and Melanie Christine Föll and Matthias Fahrner and Maria A. Doyle and Beatriz Serrano-Solano and Anne Claire Fouilloux and Peter van Heusden and Wolfgang Maier and Dave Clements and Florian Heyl and Björn Grüning and B{\'{e}}r{\'{e}}nice Batut and},
	editor = {Francis Ouellette},
	title = {Galaxy Training: A powerful framework for teaching!},
	journal = {PLoS Comput Biol}
}

                   

Funding

These individuals or organisations provided funding support for the development of this resource

The University of Melbourne

Melbourne Bioinformatics

Australian BioCommons

Congratulations on successfully completing this tutorial!

No feedback has been recieved yet for this training. Be the first one by filling in the feedback form.