Automating Galaxy workflows using the command line

Overview
Creative Commons License: CC-BY Questions:
  • How can I schedule and run tens or hundreds of Galaxy workflows easily?

  • How can I automate my analyses when large amounts of data are being produced daily?

Objectives:
  • Learn to use the planemo run subcommand to run workflows from the command line.

  • Be able to write simple shell scripts for running multiple workflows concurrently or sequentially.

  • Learn how to use Pangolin to assign annotated variants to lineages.

Requirements:
Time estimation: 2 hours
Supporting Materials:
Published: Jun 8, 2021
Last modification: Nov 21, 2025
License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MIT
purl PURL: https://gxy.io/GTN:T00162
rating Rating: 5.0 (0 recent ratings, 1 all time)
version Revision: 10

Galaxy is well-known as a web-based data analysis platform which provides a graphical interface for executing common bioinformatics tools in a reproducible manner. However, Galaxy is not just a user-friendly interface for executing one tool at a time. It provides two very useful features which allow scaling data analyses up to a high-throughput level: dataset collections and workflows.

  • Dataset collections represent groups of similar datasets. This doesn’t sound especially exciting, but it gets interesting when a tool is executed using a collection as input, where a single dataset would normally be created. In this case, Galaxy creates a new job for every dataset contained within the collection, and stores the tool outputs in a new collection (or collections, if there are multiple outputs) with the same size and structure as the input. This process is referred to as ‘mapping over’ the input collection.

  • Workflows are pipelines made up of multiple Galaxy tools executed in sequence. When a workflow is executed, Galaxy schedules all the jobs which need to be run to produce the workflow outputs. They will remain scheduled (represented by the familiar grey color in the Galaxy history) until the required inputs become available. After they complete, they will make their outputs available, which allows the next set of jobs to begin.

Between them, collections and workflows make it possible to scale-up from running a single tool on a single dataset to running multiple tools on multiple datasets.

Why use the command line?

All the functionality described so far is available through the graphical interface. So why use the command line? Let’s consider a couple of scenarios:

  1. You want to run some molecular dynamics simulations to perform free energy calculations for protein-ligand binding. You have a compound library of 1000 ligands and would like to run an ensemble of 10 simulations for each. You would like to have the analysis for each ligand in a separate history, with the ensembles represented as dataset collections, which means you need to invoke your workflow 1000 times - theoretically possible to do in the graphical interface, but you probably want to avoid it if possible.

  2. You are conducting research on a virus which is responsible for a deadly pandemic. New genomic data from the virus is being produced constantly, and you have a variant calling workflow which will tell you if a particular sample contains new mutations. You would like to run this workflow as soon as the data appears - every day perhaps, or even every hour. This will be quite tough if you have to click a button each time yourself.

If you are encountering similar problems as these in your research with Galaxy, this tutorial is for you! We will explain how to trigger workflow execution via the command line using Planemo, and provide some tips on how to write scripts to automate the process. The result will be a bot which takes care of the the whole analysis process for you.

Agenda
  1. Why use the command line?
  2. A short guide to Planemo
  3. A quick tutorial with an existing workflow
    1. Get workflows and data
    2. Get the workflow and prepare the job file
    3. Running the workflow
  4. A longer tutorial “From scratch” to CLI
    1. Get the toy data
    2. Creating a new workflow
    3. Create a job parameter template
    4. Running the workflow on a remote external Galaxy instance
    5. Going further
  5. Automated runs of a workflow for SARS-CoV-2 lineage assignment
    1. Scientific background
    2. Setting up the bot
    3. More advanced solutions
  6. Conclusion

A short guide to Planemo

The main tool we will use in this tutorial is Planemo, a command-line tool with a very wide range of functionality. If you ever developed a Galaxy tool, you probably encountered the planemo test, planemo serve and planemo lint subcommands. In this tutorial we will be using a different subcommand: planemo run.

Comment: Note

Planemo provides a more detailed tutorial on the planemo run functionality here. The pages on ‘Best Practices for Maintaining Galaxy Workflows’ and ‘Test Format’ also contain a lot of useful information.

For the purposes of this tutorial, we assume you have a recent version of Planemo (0.74.4 or later) installed in a virtual environment. If you don’t, please follow the installation instructions.

Hands-on: Choose Your Own Tutorial

This is a "Choose Your Own Tutorial" (CYOT) section (also known as "Choose Your Own Analysis" (CYOA)), where you can select between multiple paths. Click one of the buttons below to select how you want to follow the tutorial

A quick tutorial with an existing workflow

Get workflows and data

Workflows and data for this tutorial are hosted on GitHub.

Hands On: Download workflow and data

Download the workflows and data for this tutorial using git clone.

Code In: git clone
git clone https://github.com/usegalaxy-eu/workflow-automation-tutorial.git

Next, step into the cloned folder and take a look around.

Code In
cd workflow-automation-tutorial
ls
Code Out: Folder contents
example  LICENSE  pangolin  README.md

Of the two subfolders, example/ contains just a toy workflow used in the following to guide you through the basics of running workflows from the command line.

The pangolin/ folder holds the workflow and other material you will use in the second part of the tutorial, in which you will be setting up an automated system for assigning batches of SARS-CoV-2 variant data to viral lineages.

Get the workflow and prepare the job file

For this section, we will use a very simple workflow consisting of two text manipulation tools chained together.

Hands On: Step into and explore the example folder
Code In
cd example
ls
Code Out: Folder contents
tutorial.ga

The tutorial.ga file defines the workflow in JSON format; if we are confident we have a functional workflow, we don’t need to worry about its contents or modify it. However, we need a second file, the so-called ‘job file’, which specifies the particular dataset and parameter inputs which should be used to execute the workflow. We can create a template for this file using the planemo workflow_job_init subcommand.

Hands On: Creating the job file
  1. Run the planemo workflow_job_init subcommand.

    Code In: workflow_job_init
    planemo workflow_job_init tutorial.ga -o tutorial-init-job.yml
    # Now let's view the contents
    cat tutorial-init-job.yml
    

    The planemo workflow_job_init command identifies the inputs of the workflow provided and creates a template job file with placeholder values for each.

    Code Out: File contents
    Dataset 1:
      class: File
      path: todo_test_data_path.ext
    Dataset 2:
      class: File
      path: todo_test_data_path.ext
    Number of lines: todo_param_value
    

    The job file contains three inputs: two dataset inputs and one integer (parameter input).

  2. Create two files which can be used as inputs:

    Code In: Creating the input files
    printf "hello\nworld" > dataset1.txt
    printf "hello\nuniverse!" > dataset2.txt
    ls
    
    Code Out
    dataset1.txt  dataset2.txt  tutorial.ga  tutorial-init-job.yml
    
  3. Replace the placeholder values in the job file, so that it looks like the following:

    Dataset 1:
      class: File
      path: dataset1.txt
    Dataset 2:
      class: File
      path: dataset2.txt
    Number of lines: 3
    

Now we are ready to execute the workflow with our chosen parameters!

Running the workflow

Now we have a simple workflow, we can run it using planemo run. At this point you need to choose a Galaxy server on which you want the workflow to run. One of the big public servers would be a possible choice. You could also use a local Galaxy instance. Either way, once you’ve chosen a server, the next step is to get your API key.

  1. In your browser, open your Galaxy homepage
  2. Log in, or register a new account, if it’s the first time you’re logging in
  3. Go to User -> Preferences in the top menu bar, then click on Manage API key
  4. If there is no current API key available, click on Create a new key to generate it
  5. Copy your API key to somewhere convenient, you will need it throughout this tutorial

Hands On: Running our workflow
  1. Run the planemo run subcommand.

    Code In: planemo run
    planemo run tutorial.ga tutorial-init-job.yml --galaxy_url <SERVER_URL> --galaxy_user_key <YOUR_API_KEY> --history_name "Test Planemo WF" --tags "planemo-tutorial"
    
  2. Navigate to the web browser - you should be able to see a new history has been created with the chosen name and tag.
    One potential disadvantage of the previous command is that it waits until the invoked workflow has fully completed. For our very small example, this doesn’t matter, but for a workflow which takes hours or days to finish, it might be undesirable. Fortunately, planemo run provides a --no_wait flag which exits as soon as the workflow has been successfully scheduled.

  3. Run the planemo run subcommand with the --no_wait flag.

    Code In: planemo run
    planemo run tutorial.ga tutorial-init-job.yml --galaxy_url <SERVER_URL> --galaxy_user_key <YOUR_API_KEY> --history_name "Test Planemo WF with no_wait" --tags "planemo-tutorial" --no_wait
    

    This time you should see that the planemo run command exits as soon as the two datasets have been uploaded and the workflow has been scheduled.

A longer tutorial “From scratch” to CLI

Get the toy data

Hands On: Download some data

For this tutorial, we will use a training dataset: M. tuberculosis bioinformatics training data Download data for this tutorial using curl or wget.

Code In: Bash
wget https://zenodo.org/record/3960260/files/018-1_1.fastq.gz
wget https://zenodo.org/record/3960260/files/018-1_2.fastq.gz

Creating a new workflow

Hands On: Find a Galaxy instance

Logging into your favorite Galaxy instance (Ex: https://usegalaxy.eu/ or https://usegalaxy.fr/)

Hands On: Create the workflow into Galaxy.
  1. Go to the workflow panel galaxy-workflows-activity
  2. Create a new workflow: Fastq cleaning and check

    1. Click Workflow on the top bar
    2. Click the new workflow galaxy-wf-new button
    3. Give it a clear and memorable name
    4. Clicking Save will take you directly into the workflow editor for that workflow
    5. Need more help? Please see the How to make a workflow subsection here

  3. Add from Inputs an Input Dataset Collection
    • Label: Fastq Inputs
    • Format(s): fastqsanger.gz
  4. Add Falco ( Galaxy version 1.2.4+galaxy0)
  5. Add fastp ( Galaxy version 1.0.1+galaxy3)
  6. Connect the Fastq Inputs to Falco and fastp
  7. Add MultiQC ( Galaxy version 1.27+galaxy4)
    • 1: Results -> Which tool was used to generate logs?: FastQC
    • + Insert Results
    • 2: Results -> Which tool was used to generate logs?: fastp
  8. Connect:
    • Falco / text_file (txt) to MultiQC Result 1
    • fastp / report_json (json) to MultiQC Result 2
  9. Select the desired outputs:
    • In fastp, tick out1 (input)
    • In MultiQC, tick html_report (html)
This image show a preview of our workflow with tools as box and noodle as connectors).Open image in new tab

Figure 1: Preview of our workflow

Create a job parameter template

Comment

Those following steps could be processed once when you discover your new workflow. It wouldn’t be necessary at every update for example.

Hands On: Download your workflow
  1. Click on galaxy-workflows-activity Workflows in the Galaxy activity bar (on the left side of the screen, or in the top menu bar of older Galaxy instances). You will see a list of all your workflows
  2. Click on the dropdown button of the workflow you would like to download
  3. download

Hands On: Generate a job parameter.

We will use Planemo to generate a template for our job parameters before adaptating it.

There are many way to install planemo: https://planemo.readthedocs.io/en/latest/installation.html

An easy way to use planemo without “installation” for this turorial could be the Docker image provided by BioContainer

Code In: Bash
docker run --rm -v $PWD:/data/ -w /data quay.io/biocontainers/planemo:0.75.31--pyhdfd78af_0 planemo --version
Code In: Bash
planemo workflow_job_init Galaxy-Workflow-Fastq_cleaning_and_check.ga -o Galaxy-Workflow-Fastq_cleaning_and_check-job.xml

cat Galaxy-Workflow-Fastq_cleaning_and_check-job.xml
Code Out: Galaxy-Workflow-Fastq_cleaning_and_check-job.xml
Fastq Inputs:
  class: Collection
  collection_type: list
  elements:
  - class: File
    identifier: todo_element_name
    path: todo_test_data_path.ext
Hands On: Adapt the job parameter

Modify Galaxy-Workflow-Fastq_cleaning_and_check-job.xml to look like the following:

Fastq Inputs:
  class: Collection
  collection_type: list
  elements:
  - class: File
    identifier: 018-1_1
    path: 018-1_1.fastq.gz
  - class: File
    identifier: 018-1_2
    path: 018-1_2.fastq.gz

Running the workflow on a remote external Galaxy instance

Now we have a simple workflow, we can run it using planemo run. At this point you need to choose a Galaxy server on which you want the workflow to run. One of the big public servers would be a possible choice. You could also use a local Galaxy instance. Either way, once you’ve chosen a server, the next step is to get your API key.

  1. In your browser, open your Galaxy homepage
  2. Log in, or register a new account, if it’s the first time you’re logging in
  3. Go to User -> Preferences in the top menu bar, then click on Manage API key
  4. If there is no current API key available, click on Create a new key to generate it
  5. Copy your API key to somewhere convenient, you will need it throughout this tutorial

Hands On: Running our workflow

You can either provide the workflow file .ga or a workflow ID within a Galaxy instance. Let’s do it simple and “ecologic” with a workflow ID

  1. Get the ID of your workflow just created:

    1. Go to the galaxy-workflows-activityworkflows panel in Galaxy and find the workflow you’ve just created.
    2. Click on galaxy-pencil Edit or history-share Share
    3. The URL in your browser will look something like https://usegalaxy.eu/workflow/editor?id=34d18f081b73cb15. Copy the part after ?id= - this is the workflow ID, in our case: 34d18f081b73cb15

  2. Run the planemo run subcommand.

    Code In: planemo run
    planemo run <WORKFLOW_ID> Galaxy-Workflow-Fastq_cleaning_and_check-job.xml --galaxy_url <SERVER_URL> --galaxy_user_key <YOUR_API_KEY> --outdir . --download_outputs
    
    Comment: Checking within Galaxy

    You may want to check the ongoing process or the results within the Galaxy instance interface.

    One potential disadvantage of the previous command is that it waits until the invoked workflow has fully completed. For our very small example, this doesn’t matter, but for a workflow which takes hours or days to finish, it might be undesirable. Fortunately, planemo run provides a --no_wait flag which exits as soon as the workflow has been successfully scheduled.

  3. Enjoy your result files

    Contents of your directory:

    [...]
    018-1_1.fastq.gz
    018-1_2.fastq.gz
    Galaxy-Workflow-Fastq_cleaning_and_check-job.xml
    018-1_1.fastq_cleaned__b09ee414-9968-4e46-9415-535a5ddf2e0c.fastqsanger.gz
    018-1_2.fastq_cleaned__b61fd0af-6353-4222-a031-04af0b1ee993.fastqsanger.gz
    multiqc_report__82059f85-70ca-4e66-804a-beae024e108c.html
    
    This image show the MultiQC webpage).Open image in new tab

    Figure 2: MultiQC result webpage

Going further

Using Planemo profiles

Planemo provides a useful profile feature which can help simplify long commands. The idea is that flags which need to be used multiple times in different invocations can be combined together and run as a single profile. Let’s see how this works below.

Hands On: Creating and using Planemo profiles
  1. Create a Planemo profile with the following command:
    Code In: planemo run
    planemo profile_create planemo-tutorial --galaxy_url <SERVER_URL> --galaxy_user_key <YOUR_API_KEY>
    
    Code Out: Terminal
    Profile [planemo-tutorial] created.
    

    You can view and delete existing profiles using the profile_list and profile_delete subcommands.

  2. Now we can run our workflow yet again using the profile we have created:
    Code In: planemo run
    planemo run <WORKFLOW ID> tutorial-init-job.yml --profile planemo-tutorial --history_name "Test Planemo WF with profile" --tags "planemo-tutorial"
    

    This invokes the workflow with all the parameters specified in the profile planemo-tutorial.

Using Galaxy workflow and dataset IDs

If you execute the same workflow twice. If you inspect your histories and workflows through the Galaxy web interface, you will see that a new workflow was created on the server for each invocation, and both Dataset 1 and Dataset 2 were uploaded twice. This is undesirable - we are creating a lot of clutter and the uploads are creating additional unnecessary work for the Galaxy server.

Every object associated with Galaxy, including workflows, datasets and dataset collections, have hexadecimal IDs associated with them, which look something like 6b15dfc0393f172c. Once the datasets and workflows we need have been uploaded to Galaxy once, we can use these IDs in our subsequent workflow invocations.

Hands On: Running our workflow using dataset and workflow IDs
  1. Navigate to one of the histories to get dataset IDs for the input datasets. For each one:
    1. Click on the galaxy-info View details icon on the dataset in the history.
    2. Under the heading Dataset Information, find the row History Content API ID and copy the hexadecimal ID next to it.
  2. Modify tutorial-init-job.yml to look like the following:

    Dataset 1:
      class: File
      # path: dataset1.txt
      galaxy_id: <ID OF DATASET 1>
    Dataset 2:
      class: File
      # path: dataset2.txt
      galaxy_id: <ID OF DATASET 2>
    Number of lines: 3
    
  3. Now we need to get the workflow ID:

    1. Go to the galaxy-workflows-activityworkflows panel in Galaxy and find the workflow you’ve just created.
    2. Click on galaxy-pencil Edit or history-share Share
    3. The URL in your browser will look something like https://usegalaxy.eu/workflow/editor?id=34d18f081b73cb15. Copy the part after ?id= - this is the workflow ID, in our case: 34d18f081b73cb15

  4. Run the planemo run subcommand using the new workflow ID.

    Code In: planemo run
    planemo run <WORKFLOW ID> tutorial-init-job.yml --galaxy_url <SERVER_URL> --galaxy_user_key <YOUR_API_KEY> --history_name "Test Planemo WF with Planemo" --tags "planemo-tutorial" --no_wait
    

Ensuring Workflows meet Best Practices

When you are editing a workflow, there are a number of additional steps you can take to ensure that it is a Best Practice workflow and will be more reusable.

  1. Open a workflow for editing
  2. In the workflow menu bar, you’ll find the galaxy-wf-options Workflow Options dropdown menu.
  3. Click on it and select galaxy-wf-best-practices Best Practices from the dropdown menu.

    screenshot showing the best practices menu item in the gear dropdown.

  4. This will take you to a new side panel, which allows you to investigate and correct any issues with your workflow.

    screenshot showing the best practices side panel. several issues are raised like a missing annotation with a link to add that, and non-optional inputs that are unconnected. Additionally several items already have green checks like the workflow defining creator information and a license.

The Galaxy community also has a guide on best practices for maintaining workflows. This guide includes the best practices from the Galaxy workflow panel, plus:

  • adding tests to the workflow
  • publishing the workflow on GitHub, a public GitLab server, or another public version-controlled repository
  • registering the workflow with a workflow registry such as WorkflowHub or Dockstore

Automated runs of a workflow for SARS-CoV-2 lineage assignment

It’s now time to apply your newly acquired knowledge of workflow execution with Planemo to a relevant scientific problem.

Scientific background

The SARS-CoV-2 pandemic has been accompanied by unprecedented world-wide sequencing efforts. One of the accepted goals behind sequencing hundreds of thousands of individual viral isolates is to monitor the evolution and spreading of viral lineages in as close as real time as possible. Viral lineages are characterized by defining patterns of mutations that make them different from each other and from the original virus that started the pandemic at the beginning of 2020. Examples of viral lineages are B.1.1.7, first observed in the UK in the fall of 2020 and now termed variant of concern (VOC) alpha according to the WHO’s classification system, and B.1.617.2, first seen in India at the beginning of 2021 and now recognized as VOC delta.

Pangolin is a widely used tool for assigning newly sequenced viral isolates to established viral lineages, and in this final section of this tutorial you are going to run a workflow that:

  1. takes a collection of variant datasets in the variant call format VCF,

    where you can think of a collection as representing a batch of freshly sequenced viral isolates with each of its VCF datasets listing the nucleotide differences between one sample and the sequence of an original SARS-CoV-2 reference isolate

  2. reconstructs the viral genome sequence of each sample by incorporating its variants into the reference isolate’s sequence

  3. uses Pangolin to classify the resulting collection of genome sequences in FASTA format and to create a report of lineage assignments for all samples.

Just like in a real world situation, you will receive VCF files for several batches of samples and you will face the challenge of uploading the files from each batch as a collection into Galaxy and of triggering a run of the workflow for each of them.

Setting up the bot

Unlike for the previous toy example you will not get complete step-by-step instructions, but you are supposed to try yourself to transfer the knowledge from part 1 to this new, more complex task.

Every step along the way comes with solutions, which you can expand at any time, but you’re encouraged to give each problem some thought first.

As a very first step, however, let’s look at how the material for this part is arranged.

Hands On: Step into and explore the pangolin folder
Code In
cd ../pangolin
ls
Code Out: Folder contents
data  solutions  vcf2lineage.ga

The file vcf2lineage.ga defines the workflow just described, while the data/ folder holds the batches of VCF files we would, ultimately, like to run the workflow on.

Now, as a start, let’s get the workflow running on the first batch of files in the data/batch1/ subfolder.

Hands On: An initial workflow run
  1. Create a template job file for the vcf2lineage.ga workflow.

    We need to run workflow_job_init on the vcf2lineage.ga workflow.

    Code In: workflow_job_init
    planemo workflow_job_init vcf2lineage.ga -o vcf2lineage-job.yml
    # Now let's view the contents
    cat vcf2lineage-job.yml
    
    Code Out: File contents
    Reference genome:
      class: File
      path: todo_test_data_path.ext
    Variant calls:
      class: Collection
      collection_type: list
      elements:
      - class: File
        identifier: todo_element_name
        path: todo_test_data_path.ext
    min-AF for consensus variant: todo_param_value
    

    The job file contains three inputs: the Reference genome, a collection of VCF files (Variant calls) and a float parameter for the minimum allele frequency (min-AF for consensus variant). Later, we will need to specify each element of the collection under elements - currently there is just a single placeholder element.

  2. Replace the placeholder values: Reference genome should point to data/NC_045512.2_reference_sequence.fasta, Variant calls should contain all the VCF files in data/batch1, and min-AF for consensus variant should be set to 0.7.

    This is the trickiest part and you should consider writing a script to solve it. The entry in the job file should look something like this at the end:

    Variant calls:
      class: Collection
      collection_type: list
      elements:
      - class: File
        identifier: ERR5879218
        path: data/batch1/ERR5879218.vcf
      - class: File
        identifier: ERR5879219
        path: data/batch1/ERR5879219.vcf
      - class: File
        ...
    

    There should be one entry under elements for each of the 312 files under data/batch1.

    Adding the Reference genome dataset and min-AF for consensus variant parameter is straightforward. Modify the vcf2lineage-job.yml file and save it:

    Reference genome:
      class: File
      path: data/NC_045512.2_reference_sequence.fasta
    Variant calls:
      class: Collection
      collection_type: list
      elements:
      - class: File
        identifier: todo_element_name
        path: todo_test_data_path.ext
    min-AF for consensus variant: 0.7
    

    To add entries for every element of the Variant calling collection we should write a script. There are many possible solutions; here is an example:

    import sys
    from glob import glob
    from pathlib import Path
    
    import yaml
    
    job_file_path = sys.argv[1]
    batch_directory = sys.argv[2]
    
    with open(job_file_path) as f:
        job = yaml.load(f, Loader=yaml.CLoader)
    
    vcf_paths = glob(f'{batch_directory}/*.vcf')
    elements = [{'class': 'File', 'identifier': Path(vcf_path).stem, 'path': vcf_path} for vcf_path in vcf_paths]
    job['Variant calls']['elements'] = elements
    
    with open(job_file_path, 'w') as f:
        yaml.dump(job, f)
    

    Save this as create_job_file.py and run it with python create_job_file.py vcf2lineage-job.yml data/batch1. This has the effect of updating the vcf2lineage-job.yml with all the VCF files in the data/batch1 directory.

  3. Now that we have a complete job file, let’s run the workflow.
    Code In: workflow_job_init
    planemo run vcf2lineage.ga vcf2lineage-job.yml --profile planemo-tutorial --history_name "vcf2lineage test"
    

    You should see the new invocation in the Galaxy interface.

We have now performed a test invocation of the vcf2lineage workflow. It was already more challenging than the first example; for the first time, we needed to resort to writing a script to achieve a task, in this case the construction of the job file.

The next step is to automate this process so we can run the workflow on each of the 10 batch*/ directories in the data/ folder. We can imagine that these are newly produced data released at regular intervals, which need to be analysed.

Hands On: Automating vcf2lineage execution
  1. If we want to execute the workflow multiple times, we will once again encounter the issue that the datasets and workflow will be reuploaded each time. To avoid this, let’s obtain the dataset ID for the Reference genome (which stays the same for each invocation) and the workflow ID for the vcf2lineage.ga workflow.

  2. Now let’s create a template job file vcf2lineage-job-template.yml which we can modify at each invocation as necessary. We can start with the output of workflow_job_init and add the Reference genome dataset ID and set min-AF for consensus variant to 0.7 again.

    We need to run workflow_job_init on the vcf2lineage.ga workflow.

    Code In: Original output of workflow_job_init
    Reference genome:
      class: File
      path: todo_test_data_path.ext
    Variant calls:
      class: Collection
      collection_type: list
      elements:
      - class: File
        identifier: todo_element_name
        path: todo_test_data_path.ext
    min-AF for consensus variant: todo_param_value
    
    Code Out: Prepared `vcf2lineage-job-template.yml` file contents
    Reference genome:
      class: File
      galaxy_id: '1234567890abcdef' # replace with the ID you got from your own server
    Variant calls:
      class: Collection
      collection_type: list
      elements:
      - class: File
        identifier: todo_element_name
        path: todo_test_data_path.ext
    min-AF for consensus variant: 0.7
    

    We can copy and modify this new vcf2lineage-job-template.yml file iteratively to invoke the workflow on each of the data batches.

  3. Write a shell script to iterate over all the batches, create a job file and invoke with planemo run. After execution, move the processed batch to data/complete.

    Once again, there is no single exact solution. Here is one possibility:

    for batch in `ls -d data/batch*`; do
        batch_name=`basename $batch`
        cp vcf2lineage-job-template.yml vcf2lineage-${batch_name}-job.yml
        python create_job_file.py vcf2lineage-${batch_name}-job.yml $batch
        # replace with your own workflow ID below
        planemo run f4b02af7e642e75b vcf2lineage-${batch_name}-job.yml --profile planemo-tutorial
        sleep 300
        mv $batch data/complete/
    done
    

    Save this as run_vcf2lineage.sh.

  4. Run your script. Do you notice any issues? What could be improved?

    Uploads can cause several issues:

    • The default planemo behavior is to upload datasets one at a time, which ensures the server cannot be overloaded, but is slow. To upload all datasets simultaneously, you could use the --simultaneous_uploads flag.
    • If one of the upload jobs fails, the workflow will by default not be invoked - Planemo will just upload the datasets and finish. To override this behavior and start a workflow even if one or more uploads fail, you can use the --no_check_uploads_ok flag.
    • For a more complex solution to the issue of failed uploads, see this script from the ena-cog-uk-wfs repo.

More advanced solutions

This was a very basic example of a workflow. Perhaps for your case, you need a more customized solution.

For example, it might be the case that you want to run multiple different workflows, one after another. In this case you would need to implement some sort of check to verify if one invocation had finished, before beginning with the next one. Planemo will probably not be enough for a task like this; you will need to resort to using the lower-level BioBlend library to interact directly with the Galaxy API.

The Galaxy SARS-CoV-2 genome surveillance bot provides an example of a more advanced customized workflow execution solution, combining Planemo commands, custom BioBlend scripts and bash scripts, which then get run automatically via continuous integration (CI) on a Jenkins server.

Conclusion

You should now have a better idea about how to run Galaxy workflows from the command line and how to apply the ideas you have learnt to your own project.