Cleaning GBIF data for the use in Ecology

Author(s)	Yvan Le Bras Simon Benateau
Reviewers

Overview
Questions:

How can I get ecological data from GBIF?

How do I check and clean the data from GBIF?

Which ecoinformatics techniques are important to know for this type of data?

Objectives:

Get occurrence data on a species

Visualize the data to understand them

Clean GBIF dataset for further analyses

Requirements:

Introduction to Galaxy Analyses

Time estimation: 30 minutes

Supporting Materials:

Workflows

instances Available on these Galaxies

Known Working

UseGalaxy.eu ✅ ⭐️

UseGalaxy.fr ✅ ⭐️

UseGalaxy.cz ✅

UseGalaxy.no ✅

Published: Oct 28, 2022

Last modification: Jun 3, 2025

License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MIT

purl PURL: https://gxy.io/GTN:T00129

version Revision: 6

GBIF (Global Biodiversity Information Facility, www.gbif.org) is for sure THE most remarkable biodiversity data aggregator worldwide giving access to more than 1 billion records across all taxonomic groups. The data provided via these sources are highly valuable for research. However, some issues exist concerning data heterogeneity, as they are obtained from various collection methods and sources.

In this tutorial we will propose a way to clean occurrence records retrieved from GBIF.

This tutorial is based on the Ropensci Zizka tutorial.

Agenda

In this tutorial, we will cover:

Retrive data from GBIF

Get data

Where do the records come from?

Filtering data based on the data origin

Have a look at the number of counts per record

Filtering data on individual counts

Have a look at the age of records

Filtering data based on the age of records

Taxonomic investigation

Filtering

Sub-step with OGR2ogr

Visualize your data on a GIS oriented visualization

Conclusion

Retrive data from GBIF

Get data

Hands On: Data upload

Create a new history for this tutorial

To create a new history simply click the new-history icon at the top of the history panel:

Import the files from GBIF: Get species occurrences data tool with the following parameters:

param-file “Scientific name of the species”: write the scientific name of something you are interested on, for example Loligo vulgaris

“Data source to get data from”: Global Biodiversity Information Facility : GBIF

“Number of records to return”: 999999 is a minimum value

Comment

The spocc Galaxy tool allows you to search species occurrences across a single or many data sources (GBIF, eBird, iNaturalist, EcoEngine, VertNet, BISON). Changing the number of records to return allows you to have all or limited numbers of occurrences. Specifying more than one data source will change the manner the output dataset is formatted.

Check the datatype galaxy-pencil, it should be tabular

Click on the galaxy-pencil pencil icon for the dataset to edit its attributes

In the central panel, click galaxy-chart-select-data Datatypes tab on the top

In the galaxy-chart-select-data Assign Datatype, select tabular from “New Type” dropdown

Tip: you can start typing the datatype into the field to filter the dropdown menu

Click the Save button

Add tags galaxy-tags to the dataset

make them propagating tags (tags starting with #)

make a tag corresponding to the species (#LoligoVulgaris for example here)

and another tag mentioning the data source (#GBIF for example here).

Tagging dataset like this is good practice in Galaxy, and will help you 1/ finding content of particular interest (using the filtering option on the history search form for example) and 2/ visualizing rapidly (notably thanks to the propagated tags) which dataset is associated to which content.

Datasets can be tagged. This simplifies the tracking of datasets across the Galaxy interface. Tags can contain any combination of letters or numbers but cannot contain spaces.

To tag a dataset:

Click on the dataset to expand it

Click on Add Tags galaxy-tags

Add tag text. Tags starting with # will be automatically propagated to the outputs of tools using this dataset (see below).

Press Enter

Check that the tag appears below the dataset name

Tags beginning with # are special!

They are called Name tags. The unique feature of these tags is that they propagate: if a dataset is labelled with a name tag, all derivatives (children) of this dataset will automatically inherit this tag (see below). The figure below explains why this is so useful. Consider the following analysis (numbers in parenthesis correspond to dataset numbers in the figure below):

a set of forward and reverse reads (datasets 1 and 2) is mapped against a reference using Bowtie2 generating dataset 3;

dataset 3 is used to calculate read coverage using BedTools Genome Coverage separately for + and - strands. This generates two datasets (4 and 5 for plus and minus, respectively);

datasets 4 and 5 are used as inputs to Macs2 broadCall datasets generating datasets 6 and 8;

datasets 6 and 8 are intersected with coordinates of genes (dataset 9) using BedTools Intersect generating datasets 10 and 11.

Now consider that this analysis is done without name tags. This is shown on the left side of the figure. It is hard to trace which datasets contain “plus” data versus “minus” data. For example, does dataset 10 contain “plus” data or “minus” data? Probably “minus” but are you sure? In the case of a small history like the one shown here, it is possible to trace this manually but as the size of a history grows it will become very challenging.

The right side of the figure shows exactly the same analysis, but using name tags. When the analysis was conducted datasets 4 and 5 were tagged with #plus and #minus, respectively. When they were used as inputs to Macs2 resulting datasets 6 and 8 automatically inherited them and so on… As a result it is straightforward to trace both branches (plus and minus) of this analysis.

More information is in a dedicated #nametag tutorial.

Where do the records come from?

Here we propose to investigate the content of the dataset looking notably at the “basisOfRecord” attribute to know more about heterogeneity related to the data collection origin.

Hands On: "basisOfRecord" filtering

Count tool with the following parameters:

param-file “from dataset”: output (output of Get species occurrences data tool)

“Count occurrences of values in column(s)”: c[17]

Comment

This tool is one of the important “classical” Galaxy tool who allows you to better synthesize information content of your data. Here we apply this tool to the 17th column (corresponding to the basisOfRecord attribute) but don’t hesitate to investigate others attributes!

Question

How many different types of data collection origin are there?

What is your assumption regarding this heterogeneity?

5

each basisOfRecord type is related to different collection method so different data quality

Filtering data based on the data origin

Hands On: Filter data on basisOfRecord GBIF attribute

Filter tool with the following parameters:

param-file “Filter”: output (output of Get species occurrences data tool)

“With following condition”: c17=='HUMAN_OBSERVATION' or c17=='OBSERVATION' or c17=='PRESERVED_SPECIMEN'

“Number of header lines to skip”: 1

Comment

A comment about the tool or something else. This box can also be in the main text

Question

How many records are kept and what is the percentage of filtered data?

Why are we keeping only these 3 types of data collection origin?

470 and 8.79% of records were drop out

These data collection methods are the most relevant

Add to the output dataset a propagating tag corresponding to the filtering criteria adding #basisOfRecord string for example

Datasets can be tagged. This simplifies the tracking of datasets across the Galaxy interface. Tags can contain any combination of letters or numbers but cannot contain spaces.

To tag a dataset:

Click on the dataset to expand it

Click on Add Tags galaxy-tags

Add tag text. Tags starting with # will be automatically propagated to the outputs of tools using this dataset (see below).

Press Enter

Check that the tag appears below the dataset name

Tags beginning with # are special!

They are called Name tags. The unique feature of these tags is that they propagate: if a dataset is labelled with a name tag, all derivatives (children) of this dataset will automatically inherit this tag (see below). The figure below explains why this is so useful. Consider the following analysis (numbers in parenthesis correspond to dataset numbers in the figure below):

a set of forward and reverse reads (datasets 1 and 2) is mapped against a reference using Bowtie2 generating dataset 3;

dataset 3 is used to calculate read coverage using BedTools Genome Coverage separately for + and - strands. This generates two datasets (4 and 5 for plus and minus, respectively);

datasets 4 and 5 are used as inputs to Macs2 broadCall datasets generating datasets 6 and 8;

datasets 6 and 8 are intersected with coordinates of genes (dataset 9) using BedTools Intersect generating datasets 10 and 11.

Now consider that this analysis is done without name tags. This is shown on the left side of the figure. It is hard to trace which datasets contain “plus” data versus “minus” data. For example, does dataset 10 contain “plus” data or “minus” data? Probably “minus” but are you sure? In the case of a small history like the one shown here, it is possible to trace this manually but as the size of a history grows it will become very challenging.

The right side of the figure shows exactly the same analysis, but using name tags. When the analysis was conducted datasets 4 and 5 were tagged with #plus and #minus, respectively. When they were used as inputs to Macs2 resulting datasets 6 and 8 automatically inherited them and so on… As a result it is straightforward to trace both branches (plus and minus) of this analysis.

More information is in a dedicated #nametag tutorial.

Have a look at the number of counts per record

Here we propose to have a look at the number of counts by record to know if there is some possible records with errors.

Hands On: Summary statistics of count

Summary Statistics tool with the following parameters:

param-file “Summary statistics on”: out_file1 (output of Filter tool)

“Column or expression”: c72

Add to the output dataset a propagating tag corresponding to the filtering criteria adding #individualCount string for example

Question

What is the min and max of individual counts?

From 1 to 100

Filtering data on individual counts

Hands On: Filter data on individualCount GBIF attribute

Filter tool with the following parameters:

param-file “Filter”: out_file1 (output of Filter tool)

“With following condition”: c72>0 and c72<99

“Number of header lines to skip”: 1

Question

How many records are kept and what is the percentage of filtered data?

How can you explain this result?

Which propagated tag you can propose to add here?

50 and 89.29% o records were drop out

An important percentage of data were drop out because of many records whithout any value for this individual count field

As for the previous “count” step you are dealing with the individualCount column, you can add a to the output dataset a #individualCount tag for example

Have a look at the age of records

Hands On: Here we propose to have a look at the age of records, through the `year` GBIF attribute to know if there is some ancient data to maybe not consider.

Summary Statistics tool with the following parameters:

param-file “Summary statistics on”: out_file1 (output of Filter tool)

“Column or expression”: c41

Add to the output dataset a propagating tag corresponding to the filtering criteria adding #ageOfRecord string for example

Question

What is the year of the older and younger records?

Why do you think of interest to treat differently ancient and recent records?

From 1903 to 2018

We can assume ancient records are not made in the same way than recent one so keeping ancient records can enhance heterogeneity of our dataset.

Filtering data based on the age of records

Hands On: Filter data on ageOfRecord GBIF attribute

Filter tool with the following parameters:

param-file “Filter”: out_file1 (output of Get species occurrences data tool)

“With following condition”: c41>1945

“Number of header lines to skip”: 1

Comment

A comment about the tool or something else. This box can also be in the main text

Question

How many records are kept and what is the percentage of filtered data?

Why are we keeping only data from 1945?

44 and 11.76% of records were drop out

This arbitrary date allow to have only quite recent records, but you can specify another year.

Add to the output dataset a propagating tag corresponding to the filtering criteria adding #ageOfRecord string for example

Datasets can be tagged. This simplifies the tracking of datasets across the Galaxy interface. Tags can contain any combination of letters or numbers but cannot contain spaces.

To tag a dataset:

Click on the dataset to expand it

Click on Add Tags galaxy-tags

Add tag text. Tags starting with # will be automatically propagated to the outputs of tools using this dataset (see below).

Press Enter

Check that the tag appears below the dataset name

Tags beginning with # are special!

They are called Name tags. The unique feature of these tags is that they propagate: if a dataset is labelled with a name tag, all derivatives (children) of this dataset will automatically inherit this tag (see below). The figure below explains why this is so useful. Consider the following analysis (numbers in parenthesis correspond to dataset numbers in the figure below):

a set of forward and reverse reads (datasets 1 and 2) is mapped against a reference using Bowtie2 generating dataset 3;

dataset 3 is used to calculate read coverage using BedTools Genome Coverage separately for + and - strands. This generates two datasets (4 and 5 for plus and minus, respectively);

datasets 4 and 5 are used as inputs to Macs2 broadCall datasets generating datasets 6 and 8;

datasets 6 and 8 are intersected with coordinates of genes (dataset 9) using BedTools Intersect generating datasets 10 and 11.

Now consider that this analysis is done without name tags. This is shown on the left side of the figure. It is hard to trace which datasets contain “plus” data versus “minus” data. For example, does dataset 10 contain “plus” data or “minus” data? Probably “minus” but are you sure? In the case of a small history like the one shown here, it is possible to trace this manually but as the size of a history grows it will become very challenging.

The right side of the figure shows exactly the same analysis, but using name tags. When the analysis was conducted datasets 4 and 5 were tagged with #plus and #minus, respectively. When they were used as inputs to Macs2 resulting datasets 6 and 8 automatically inherited them and so on… As a result it is straightforward to trace both branches (plus and minus) of this analysis.

More information is in a dedicated #nametag tutorial.

Taxonomic investigation

Hands On: Investigate the taxonomic coverage, at the family level

Count tool with the following parameters:

param-file “from dataset”: out_file1 (output of Filter tool)

“Count occurrences of values in column(s)”: c[31]

Comment

This column allows us to look at the different families associated to records. Normally, looking at a unique species, we will obtain only one family

Filtering

Hands On: Filter data on family attribute

Filter tool with the following parameters:

param-file “Filter”: out_file1 (output of Filter tool)

“With following condition”: c31=='Loliginidae'

“Number of header lines to skip”: 1

Comment

We here select only records with the family of interest, Loliginidae

Question

Is the filtering here of interest ?

Why keeping this step can be of interest?

No, because 100% of records are kept

Because this is an important step we have to take into account in such a GBIF data treatment, and if your goal is to create your own workflow you plan to use on others species, this can be of interest to keep this step

Sub-step with OGR2ogr

Hands On: Convert occurrence dataset to GIS one for visualization

OGR2ogr tool with the following parameters:

param-file “Gdal supported input file”: out_file1 (output of Filter tool)

“Conversion format”: GEOJSON

“Specify advanced parameters”: Yes, see full parameter list.

In “Add an input dataset open option”:

param-repeat “Insert Add an input dataset open option”

“Input dataset open option”: X_POSSIBLE_NAMES=longitude

param-repeat “Insert Add an input dataset open option”

“Input dataset open option”: Y_POSSIBLE_NAMES=latitude

Question

Did you have access to standard output and error of the original R script?

What kind of information you can retrieve here in the standard output and/or error?

Yes, of course ;) A previsualization of stdout is visible when clicking on the history output dataset and full report accessible through the information button, then stdout or stderr (here you can see warnings on the stderr)

The stderr is showing several warning related to automatic variable name mapping from GBIF to OGR plus information about application of a truncate process on a particularly long GeoJSON value

Visualize your data on a GIS oriented visualization

From your GeoJSON Galaxy history dataset, you can launch GIS visualization.

Hands On: Launch OpenLayers to visualize a map with your filtered records

Click on the Visualize tab on the upper menu and select Create Visualization

Click on the OpenLayers icon

Select the GeoJSON file from your history

Click on Create Visualization

Select Openlayers

Question

You don’t see Opebnlayers? Did you know why?

1.If you don’t see Openlayers but others visualization types like Cytoscape, this means your datatype is JSON, not geojson. You have to change the datafile manually before visualizing it

Conclusion

In this tutorial we learned how to get occurrence records from GBIF and several steps to filter these data to be ready to analyze it! So now, let’s go for the show!

You've Finished the Tutorial

Key points

Take the time to look at your data first, manipulate it before analyzing it

Frequently Asked Questions

Have questions about this tutorial? Have a look at the available FAQ pages and support channels

Useful literature

Further information, including links to documentation and original publications, regarding the tools, analysis techniques and the interpretation of results described in this tutorial can be found here.

References

Zizka, A. Cleaning GBIF data for the use in biogeography. https://ropensci.github.io/CoordinateCleaner/articles/Cleaning_GBIF_data_with_CoordinateCleaner.html

Feedback

Did you use this material as an instructor? Feel free to give us feedback on how it went.
Did you use this material as a learner or student? Click the form below to leave feedback.

Citing this Tutorial

Yvan Le Bras, Simon Benateau, Cleaning GBIF data for the use in Ecology (Galaxy Training Materials). https://training.galaxyproject.org/training-material/topics/ecology/tutorials/gbif_cleaning/tutorial.html Online; accessed TODAY
Hiltemann, Saskia, Rasche, Helena et al., 2023 Galaxy Training: A Powerful Framework for Teaching! PLOS Computational Biology 10.1371/journal.pcbi.1010752
Batut et al., 2018 Community-Driven Data Analysis Training for Biology Cell Systems 10.1016/j.cels.2018.05.012

@misc{ecology-gbif_cleaning,
author = "Yvan Le Bras and Simon Benateau",
	title = "Cleaning GBIF data for the use in Ecology (Galaxy Training Materials)",
	year = "",
	month = "",
	day = "",
	url = "\url{https://training.galaxyproject.org/training-material/topics/ecology/tutorials/gbif_cleaning/tutorial.html}",
	note = "[Online; accessed TODAY]"
}
@article{Hiltemann_2023,
	doi = {10.1371/journal.pcbi.1010752},
	url = {https://doi.org/10.1371%2Fjournal.pcbi.1010752},
	year = 2023,
	month = {jan},
	publisher = {Public Library of Science ({PLoS})},
	volume = {19},
	number = {1},
	pages = {e1010752},
	author = {Saskia Hiltemann and Helena Rasche and Simon Gladman and Hans-Rudolf Hotz and Delphine Larivi{\`{e}}re and Daniel Blankenberg and Pratik D. Jagtap and Thomas Wollmann and Anthony Bretaudeau and Nadia Gou{\'{e}} and Timothy J. Griffin and Coline Royaux and Yvan Le Bras and Subina Mehta and Anna Syme and Frederik Coppens and Bert Droesbeke and Nicola Soranzo and Wendi Bacon and Fotis Psomopoulos and Crist{\'{o}}bal Gallardo-Alba and John Davis and Melanie Christine Föll and Matthias Fahrner and Maria A. Doyle and Beatriz Serrano-Solano and Anne Claire Fouilloux and Peter van Heusden and Wolfgang Maier and Dave Clements and Florian Heyl and Björn Grüning and B{\'{e}}r{\'{e}}nice Batut and},
	editor = {Francis Ouellette},
	title = {Galaxy Training: A powerful framework for teaching!},
	journal = {PLoS Comput Biol}
}

                   

Funding

These individuals or organisations provided funding support for the development of this resource

PNDB

Congratulations on successfully completing this tutorial!

You can use Ephemeris's shed-tools install command to install the tools used in this tutorial.

shed-tools install [-g GALAXY] [-a API_KEY] -t <(curl https://training.galaxyproject.org/training-material/api/topics/ecology/tutorials/gbif_cleaning/tutorial.json | jq .admin_install_yaml -r)

Alternatively you can copy and paste the following YAML

---
install_tool_dependencies: true
install_repository_dependencies: true
install_resolver_dependencies: true
tools:
- name: gdal_ogr2ogr
  owner: ecology
  revisions: e12db3b4d3a6
  tool_panel_section_label: GIS Data Handling
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: spocc_occ
  owner: ecology
  revisions: f9d76a46799a
  tool_panel_section_label: Get Data
  tool_shed_url: https://toolshed.g2.bx.psu.edu/

No feedback has been recieved yet for this training. Be the first one by filling in the feedback form.