Phylodiversity analysis quick tutorial

Overview
Creative Commons License: CC-BY Questions:
  • How to use the phylodiversity workflow?

  • How to construct phyloregions from occurrences species data, phylogenic data and geograpics data?

Objectives:
  • Learning how to use the phylodiversity workflow.

  • Compute endemism index

  • Create a phyloregion map

Requirements:
Time estimation: 2 hours
Supporting Materials:
Published: Jun 6, 2025
Last modification: Jun 6, 2025
License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MIT
version Revision: 1

This tutorial is designed to guide you through the Phylodiversity Galaxy workflow, demonstrating how to easily compute phylodiversity and create phyloregions from phylogeny, occupency and spatial files.

The tutorial will provide a detailed explanation of inputs, workflow steps, and outputs. This tutorial gives a practical example, highlighting a use case extract from souhtern sea actinos populations.

The primary goal of this workflow is to compute phylodiversity index and identify phyloregions. The project’s objective is to offer accessible, reproducible and transparents solutions for analyse phylodiversity.

This workflow is composed of four tools:

  • PhylOccuMatcher
  • CRSConverter
  • PhyloIndex
  • EstimEndem

In this tutorial, we estimate your data are correctly formated.

Agenda

In this tutorial, we will cover:

  1. Before starting
    1. phylogenic tree file
    2. occupancy file
    3. Shapefile
    4. Get data
  2. Data formatting
  3. Phylodiversity Workflow
    1. Match your phylogeny and occupancy with PhylOccuMatcher
    2. modifying the projection with CRSconverter
    3. Compute phylodiversity index with PhyloIndex
    4. Estimate the endemism with EstimEndem
  4. Conclusion

Before starting

This part will present the type of data you need to run the ecoregionalization workflow. This data will be downloaded in the next part of the tutorial.

phylogenic tree file

The first file needed for this workflow is the phylogenetic tree of your interested species. In this example it’a a simplified phylogeny of the actinopterigy This file must be at newick format.

occupancy file

The second file is an occupancy file, each line is a species, the decimal separator must be “.” and the column must be separated with “\t”(={Tabulation}). You need to have a column “grids” containing the cell of the grid you’ve seen your species and the name of the column with the species names must be “newscientificname”.

grids newscientificname
——- ——————- ——- —–

Shapefile

The last file is a spatial file in shapefile format. In Galaxy this type of file must be uploaded as a composite file of type shp. This kind of file must have at least 3 file with the same name and 3 different extension : .shp, .shx end .dbf. you can have more file optionally like the .prj file.

Get data

Hands On: Data Upload
  1. Create a new history for this tutorial
  2. Import the files from Zenodo or from the shared data library (GTN - Material -> ecology -> Phylodiversity analysis quick tutorial):

    For the tabular and newick datafiles

    https://zenodo.org/records/15601932/files/phylogeny_test
    https://zenodo.org/records/15601932/files/grid_test.tabular
    

    For the composite shp datafile (you here need to download locally each file to upload it from the “Composite” menu of “Upload Files” tool, selecting shp datatype)

    https://zenodo.org/records/15601932/files/shapefile.dbf
    https://zenodo.org/records/15601932/files/shapefile.prj
    https://zenodo.org/records/15601932/files/shapefile.shx
    https://zenodo.org/records/15601932/files/shapefile.shp
    
    • Copy the link location
    • Click galaxy-upload Upload Data at the top of the tool panel

    • Select galaxy-wf-edit Paste/Fetch Data
    • Paste the link(s) into the text field

    • Press Start

    • Close the window

    As an alternative to uploading the data from a URL or your computer, the files may also have been made available from a shared data library:

    1. Go into Libraries (left panel)
    2. Navigate to the correct folder as indicated by your instructor.
      • On most Galaxies tutorial data will be provided in a folder named GTN - Material –> Topic Name -> Tutorial Name.
    3. Select the desired files
    4. Click on Add to History galaxy-dropdown near the top and select as Datasets from the dropdown menu
    5. In the pop-up window, choose

      • “Select history”: the history you want to import the data to (or create a new one)
    6. Click on Import

  3. Rename the datasets
  4. Check that the datatype of the phylogenic file is newick (often not automatically detected to this format but json), occupancy file tabular and the spatial file a composite dataset of type shapefile

    • Click on the galaxy-pencil pencil icon for the dataset to edit its attributes
    • In the central panel, click galaxy-chart-select-data Datatypes tab on the top
    • In the galaxy-chart-select-data Assign Datatype, select newick from “New Type” dropdown
      • Tip: you can start typing the datatype into the field to filter the dropdown menu
    • Click the Save button

  5. A good pratice is also to add to each datafile a tag corresponding for example to the taxon, here Actinopterygians or other relevant information.

    Datasets can be tagged. This simplifies the tracking of datasets across the Galaxy interface. Tags can contain any combination of letters or numbers but cannot contain spaces.

    To tag a dataset:

    1. Click on the dataset to expand it
    2. Click on Add Tags galaxy-tags
    3. Add tag text. Tags starting with # will be automatically propagated to the outputs of tools using this dataset (see below).
    4. Press Enter
    5. Check that the tag appears below the dataset name

    Tags beginning with # are special!

    They are called Name tags. The unique feature of these tags is that they propagate: if a dataset is labelled with a name tag, all derivatives (children) of this dataset will automatically inherit this tag (see below). The figure below explains why this is so useful. Consider the following analysis (numbers in parenthesis correspond to dataset numbers in the figure below):

    1. a set of forward and reverse reads (datasets 1 and 2) is mapped against a reference using Bowtie2 generating dataset 3;
    2. dataset 3 is used to calculate read coverage using BedTools Genome Coverage separately for + and - strands. This generates two datasets (4 and 5 for plus and minus, respectively);
    3. datasets 4 and 5 are used as inputs to Macs2 broadCall datasets generating datasets 6 and 8;
    4. datasets 6 and 8 are intersected with coordinates of genes (dataset 9) using BedTools Intersect generating datasets 10 and 11.

    A history without name tags versus history with name tags

    Now consider that this analysis is done without name tags. This is shown on the left side of the figure. It is hard to trace which datasets contain “plus” data versus “minus” data. For example, does dataset 10 contain “plus” data or “minus” data? Probably “minus” but are you sure? In the case of a small history like the one shown here, it is possible to trace this manually but as the size of a history grows it will become very challenging.

    The right side of the figure shows exactly the same analysis, but using name tags. When the analysis was conducted datasets 4 and 5 were tagged with #plus and #minus, respectively. When they were used as inputs to Macs2 resulting datasets 6 and 8 automatically inherited them and so on… As a result it is straightforward to trace both branches (plus and minus) of this analysis.

    More information is in a dedicated #nametag tutorial.

Data formatting

The first step is to be sure your data are well formated. If all your file are in good format and do have the needed column as specified before, you can move forward.

An example of occupancy file:

Example occupancy datafile. Open image in new tab

Figure 1: Example occupancy datafile

Phylodiversity Workflow

Match your phylogeny and occupancy with PhylOccuMatcher

Hands On: run PhylOccuMatcher
  1. PhylOccuMatcher ( Galaxy version 1.0+galaxy0) with the following parameters:
    • param-file “Phylogeny file (Newick format)”: phylogeny_test (Input dataset)
    • param-file “Occupancy data (Tabular format)”: grid_test.tabular (Input dataset)
    Comment: short description

    This tool is the simpliest, you, normally, don’t have anything to change and just have to run it with your file as input.

modifying the projection with CRSconverter

Hands On: run CRSConverter
  1. CRSconverter ( Galaxy version 1.1+galaxy0) with the following parameters:
    • param-file “shapefile”: composite_dataset (Input dataset)
Warning: Pay attention to output format

This tool provide multiple possible outputs formats but only the shapefile format can be used for the workflow. The other output format are graphical representation for the user to visualize. If you want it you can rerun this tool outside of the workflow withe the same input and option.

Warning: Pay attention to the tool version

For the workflow to work you need to use the CRSConverter 1.1 not the 1.0. So be cautious it’s the case because if you use the 1.0 version the workflow will crash during the last step.

Comment: short description

The main interest of using this tool is to modify the projection of your shapefile. To use it you’ll have to select the parameter you need in the advanced option before running this tool.

Compute phylodiversity index with PhyloIndex

Hands On: run PhyloIndex
  1. PhyloIndex ( Galaxy version 1.0+galaxy0) with the following parameters:
    • param-file “Phylogeny file (Newick format)”: Phylogeny with occupancy data (output of PhylOccuMatcher tool)
    • param-file “Occupancy data (Tabular format)”: Matched output data (output of PhylOccuMatcher tool)
    Comment: short description

    This tool compute phylodiversity index, It include some randomness so, for reproducibility, you’ll need to select a random seed. Moreover you’ll need to select the way of modeling you want by choosing between 3 propositon: -“tipshuffle”: shuffles tip labels multiple times. -“rowwise”: shuffles sites (i.e., varying richness) and keeping species occurrence frequency constant. -“colwise”: shuffles species occurrence frequency and keeping site richness constant. The default value is the tipshuffle method

Estimate the endemism with EstimEndem

Hands On: run EstimEndem
  1. EstimEndem ( Galaxy version 0.1.0+galaxy0) with the following parameters:
    • param-file “Phylogeny file (Newick format)”: Phylogeny with occupancy data (output of PhylOccuMatcher tool)
    • param-file “Occupancy data (Tabular format)”: Matched output data (output of PhylOccuMatcher tool)
    • param-file “input_shapefile”: shapefile (output of CRSconverter tool)
    Comment: short description

    The output of this tool is a shapefile with the clusterisation done in function of the endemism. You’ll have to choose a number of cluster you want and the clustering method you want.

    Comment: More tips and info

    If you have no idea how many cluster you want, the tool start with an estimation of how many clusters are optimal between 0 to 30. So you can firstly run the tool with default value and go check the standard output to check the recommanded number. However keep in mind that this estimation is purely statistics and don’t always have biologic reasons.

Conclusion

Congratulation for successfully completed the Phylodiversity workflow. Here is the end of this quick tutorial. Don’t hesitate to contact us if you have any questions or if you have ideas for improvment of this workflow.