Cleaning GBIF data using OpenRefine

Overview
Creative Commons License: CC-BY Questions:
  • How can I use OpenRefine to clean data?

  • How do I check and clean biodiversity data using OpenRefine?

Objectives:
  • Use OpenRefine faceting functionalities to apply mass editing and manage duplicates

  • Use OpenRefine clustering and filtering functionalities to edit, transform data notably using regular expression

  • Use OpenRefine to apply API services on your data

Requirements:
Time estimation: 45 minutes
Supporting Materials:
Published: Jan 21, 2025
Last modification: Jan 21, 2025
License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MIT
version Revision: 1

In this tutorial we will use OpenRefine tool to clean occurrence records retrieved from GBIF.

This tutorial is based on the GBIF Laure Russell 2021 “Biodiversity data mobilization course” accessible on GBIF website on Use case 1 “Herbarium Specimens” using file from Exercise 3C, at the bottom of the page, reachable through this direct URL.

Agenda

In this tutorial, we will cover:

  1. Retrieve data from GBIF “Biodiversity data mobilization course”
  2. Deploy an OpenRefine instance and push your data in
  3. Faceting
  4. Filtering
  5. Clustering
  6. Exporting
  7. API use
  8. Conclusion

Retrieve data from GBIF “Biodiversity data mobilization course”

Hands-on: Data import and project creation
  1. Create a new history for this tutorial and give it a name (example: “OpenRefine tutorial”) for you to find it again later if needed.

    To create a new history simply click the new-history icon at the top of the history panel:

    UI for creating new history

    1. Click on galaxy-pencil (Edit) next to the history name (which by default is “Unnamed history”)
    2. Type the new name
    3. Click on Save
    4. To cancel renaming, click the galaxy-undo “Cancel” button

    If you do not have the galaxy-pencil (Edit) next to the history name (which can be the case if you are using an older version of Galaxy) do the following:

    1. Click on Unnamed history (or the current name of the history) (Click to rename history) at the top of your history panel
    2. Type the new name
    3. Press Enter

  2. Import data file from GBIF Laure Russell 2021 “Biodiversity data mobilization course” accessible on GBIF website on Use case 1 “Herbarium Specimens” using file from Exercice 3C, at the bottom of the page

    “Herbarium Specimens” data file as a csv file named “UC1-3c-open-refine.csv”

    https://docs.gbif.org/course-data-mobilization/exercise-data/UC1-3c-open-refine.csv
    
    • Copy the link location
    • Click galaxy-upload Upload Data at the top of the tool panel

    • Select galaxy-wf-edit Paste/Fetch Data
    • Paste the link(s) into the text field

    • Press Start

    • Close the window

  3. Regex Find And Replace ( Galaxy version 1.0.3) with the following parameters:
    • param-file “Select lines from”: UC1-3c-open-refine.csv
    • In “Check”:
      • param-repeat “Insert Check”
        • “Find Regex”: ;
        • “Replacement”: \t
  4. Change the datatype from “txt” to “tabular”

    • Click on the galaxy-pencil pencil icon for the dataset to edit its attributes
    • In the central panel, click galaxy-chart-select-data Datatypes tab on the top
    • In the galaxy-chart-select-data Assign Datatype, select tabular from “New type” dropdown
      • Tip: you can start typing the datatype into the field to filter the dropdown menu
    • Click the Save button

  5. Check your output. All columns must be separated by a tabulation now. See example below.

    Output of Regex Find And Replace example. Open image in new tab

    Figure 1: Output of Regex Find And Replace example
  6. Rename the datasets if needed, for example with “UC1-3c-open-refine-tabular” to keep orginal name and specify the tabular format.

    • Click on the galaxy-pencil pencil icon for the dataset to edit its attributes
    • In the central panel, change the Name field
    • Click the Save button

Deploy an OpenRefine instance and push your data in

Hands-on: Deploy an OpenRefine instance and import data from your Galaxy history
  1. OpenRefine tool with the following parameters:
    • “Input file in tabular format”: UC1-3c-open-refine-tabular
  2. Open the OpenRefine instance
    • go to “Open Project” space
    • select the project named “Galaxy file”
  3. You will now have your project with first 10 lines of your file displayed OpenRefine project start.

Faceting

Faceting is a feature that will allow us to get a big picture overview of the data, and to filter down to just the subset of rows that we want to change or view in bulk. It facilitates the use and analysis of data and can be done with cells containing any kind of text, numbers and dates.

Hands-on: Hands-on: Faceting and mass editing
  1. Go to column “kingdom”, and then click on the column menu and follow the route to Text facet.
    • On the left a window with the name of the column will appear, that is the facet. OpenRefine facet kingdom.
  2. Click on count to sort by count, then click on name to sort alphabetically
  3. Fix the spelling mistakes (Plante -> Plantae). Place the cursor over the text in the window and click on edit, then fix the error in the text box, and to save click on apply.
    • All the values will be fixed automatically.
Question
  1. How many Plantae occurences you now have?
  1. 736
Hands-on: Hands-on: Faceting and white space 1
  1. Go to “Country col.” column and click on column menu and perform a Text Facet
    • On a quick view, the country appears to be spelled correctly, but the facet shows three different values due to the extra spaces at the end of the text.
  2. Fix the error from the column menu on “Country col.” column, following the route Edit Cells > Common transforms > Trim leading and trailing whitespace.
    • You will see a notification message “Text transform on 38 cells in column Country col.: value.trim()” and the possibility to come back on previous state clicking on the undo hyperlink.
  3. Now check the facet window; only one value will remain.
Hands-on: hands_on Hands-on: Faceting and white space 2.
  1. Go to column “Full name” and click on then go to Text facet. Then click on count.
    • Guzmania lingulata is the first item in the list with 25 specimens, but it is also present in the 4th position with 20 specimens.
  2. Fix the error from the “Full name” column menu, Edit Cells > Common transforms > Collapse consecutive whitespaces.
Question
  1. How many Guzmania lingulata records only appear in the list finally?
  1. 45
Hands-on: hands_on Hands-on: Faceting and duplicates.
  1. Go to column catalog in “Cat. Numb”, and follow the route Facet > Customized facets > Duplicates facet.
    • The facet will show 4 duplicates
  2. Click on true, and you’ll see the values in the main window
  3. After a check with the specimens labels, fix the values clicking edit directly on the cell with the correct catalogue numbers
    • UWP:122470 Vargas P
    • UWP:122471 Vargas I
    • UWP:157351 Betancur H
    • UWP:157339 Betancur J

Filtering

Hands-on: hands_on Hands-on: Basic filter.
  1. Go again to “Full name” column menu and perform a Text facet to visualize the values
  2. Then go again to the column menu and click on Text filter
  3. Perform the following filters and fix them as described below:
    • search for “sp1” entries -> Then remove it and obtain “Cyperus”, clicking on Edit directly in the cell
    • search for “SP2” entries, check case sensitive -> To remove it and obtain “Cyperus”, you can Edit directly in the cell
    • search for “spp”.
      • Go to “Full name” column menu, then click Edit cells > Transform
      • In the text box paste the formula value.replace(" spp.", "") and click ok
Question
  1. Do you think you can use also Edit cells > Transform formula to apply the changes you made manually for the 2 first terms searched (“sp1” and “SP2”)?
  2. Why you can easily edit directly these 2 first entries and not the third one?
  1. Yes, using respectively value.replace(" sp1", "") and value.replace(" SP2", "") formulas
  2. Because the 2 first searchs only concern one entry for each when the third one return several.
Hands-on: hands_on Hands-on: Advance filter 1.
  1. Go to column “genus” and perform a Text filter.
  2. Check regular expression and case sensitive , then paste the expression “^[a-z]”
    • This regular expression filters the strings in which the first letter is lowercase.
  3. Perform a correction since the genus should be capitalized.

    Comment

    If you want to know more about regular expressions, you can consult the Openrefine regular expressions documentation and specifically the documentation for GREL function

    Regular expressions are a standardized way of describing patterns in textual data. They can be extremely useful for tasks such as finding and replacing data. They can be a bit tricky to master, but learning even just a few of the basics can help you get the most out of Galaxy.

    Finding

    Below are just a few examples of basic expressions:

    Regular expression Matches
    abc an occurrence of abc within your data
    (abc|def) abc or def
    [abc] a single character which is either a, b, or c
    [^abc] a character that is NOT a, b, nor c
    [a-z] any lowercase letter
    [a-zA-Z] any letter (upper or lower case)
    [0-9] numbers 0-9
    \d any digit (same as [0-9])
    \D any non-digit character
    \w any alphanumeric character
    \W any non-alphanumeric character
    \s any whitespace
    \S any non-whitespace character
    . any character
    \.  
    {x,y} between x and y repetitions
    ^ the beginning of the line
    $ the end of the line

    Note: you see that characters such as *, ?, ., + etc have a special meaning in a regular expression. If you want to match on those characters, you can escape them with a backslash. So \? matches the question mark character exactly.

    Examples

    Regular expression matches
    \d{4} 4 digits (e.g. a year)
    chr\d{1,2} chr followed by 1 or 2 digits
    .*abc$ anything with abc at the end of the line
    ^$ empty line
    ^>.* Line starting with > (e.g. Fasta header)
    ^[^>].* Line not starting with > (e.g. Fasta sequence)

    Replacing

    Sometimes you need to capture the exact value you matched on, in order to use it in your replacement, we do this using capture groups (...), which we can refer to using \1, \2 etc for the first and second captured values. If you want to refer to the whole match, use &.

    Regular expression Input Captures
    chr(\d{1,2}) chr14 \1 = 14
    (\d{2}) July (\d{4}) 24 July 1984 \1 = 24, \2 = 1984

    An expression like s/find/replacement/g indicates a replacement expression, this will search (s) for any occurrence of find, and replace it with replacement. It will do this globally (g) which means it doesn’t stop after the first match.

    Example: s/chr(\d{1,2})/CHR\1/g will replace chr14 with CHR14 etc.

    You can also use replacement modifier such as convert to lower case \L or upper case \U. Example: s/.*/\U&/g will convert the whole text to upper case.

    Note: In Galaxy, you are often asked to provide the find and replacement expressions separately, so you don’t have to use the s/../../g structure.

    There is a lot more you can do with regular expressions, and there are a few different flavours in different tools/programming languages, but these are the most important basics that will already allow you to do many of the tasks you might need in your analysis.

    Tip: RegexOne is a nice interactive tutorial to learn the basics of regular expressions.

    Tip: Regex101.com is a great resource for interactively testing and constructing your regular expressions, it even provides an explanation of a regular expression if you provide one.

    Tip: Cyrilex is a visual regular expression tester.

Question
  1. Try to find a regular expression to apply the same kind of Text filter operation than on the previous exercise. Which regular expression can work?
  1. On my side, I found this formla value.replace(value.substring(0,1),toUppercase(value.substring(0,1)))
Hands-on: hands_on Hands-on: Advance filter 2.
  1. Go to column “Full name” and perform a Text filter.
  2. Check regular expression and case sensitive, then paste the expression ^[A-Z].*\s[A-Z]
    • This regular expression filters the strings that start with a capital letter followed by 0 or more characters, then a space, then a capital letter.
  3. Perform a correction since the second word of the name should be lowercase.

Clustering

Hands-on: hands_on Hands-on: Basic clustering.
  1. Go to County and perform a Text facet.
    • Keep in mind that the correct counties are: “Flores”, “La Libertad”, “Melchor de Mencos”, “San Andres” and “San Jose”.
  2. On the top right of the facet window click on Cluster, a new window will appear.
  3. Click on the Cluster button from this new window.
  4. Now you can see information about the clusters:
    • “Cluster size”: the number of different versions that the clustering algorithm believes to be the same.
    • “Row count”: the number of records with any of the cluster values.
    • “Values in cluster”: the actual values that the algorithm believes to be the same. There is also the number of records with each particular value, and the possibility to browse the contents of the cluster in a different tab.
    • “Merge?”: check if values are to be merged into a single standard value.
    • “New cell value”: the value to be applied to every record in the cluster. By default, it is the value with most records. You can also click on any value to apply that to the New cell value.
    Comment

    If you want to know more about clustering have a look at the manual

  5. Click on Select All and then on Merge Selected & close
    • You will see a notification message “Mass edit 119 cells in column County”.
  6. To fix the remaining counties go again to Cluster in the facet window of Count.
  7. In the Cluster and edit window, go to Keying Function, then select ngram-fingerprint
    • set “1” as the value in n-Gram Size.
  8. Press the Cluster button, you normally see a cluster about “San Andres” of size “4”.
  9. Click on Select All and then on Merge Selected & close, you will see a notification message “Mass edit 360 cells in column County”.
    • Your counties are now fixed! Congratulation!

Exporting

Hands-on: hands_on Hands-on: Exporting cleaned file into your Galaxy history.
  1. On the upper right corner click on Export and select Galaxy exporter.
    • A notification message as “Dataset has been exported to Galaxy, please close this tab” is displayed.
    • You normally have your resulting data file exported on your Galaxy history as “openrefine-Galaxt file.tsv” dataset.
    Comment

    You can also download directly file on your computer choosing Export and Custom tabular exporter, allowing you to select notably columns and rows you want to download.

API use

Reconciliation matches the information in one of your columns to an outside database. This is particularly helpful when it comes to name validation, as it proves the name you have exists somewhere else. This is a really useful service, but can be time consuming. In this case we will go through the process with only three records using the API from GBIF.

Hands-on: hands_on Hands-on: Higher taxonomy.
  1. Go to “Collector” column, then make a Text facet. Select the collector “Elsa P”.
  2. Under “Full name”, click on column menu and then Edit column > Add column by fetching URLs…
    • call the new column “Api_name”
  3. Change the Throttle Delay to 250 and paste the expression http://api.gbif.org/v1/species/match?verbose=true&name="+escape(value,'url')
  4. Click ok and wait, this might take some time depending on internet connection and the number of taxa.
  5. Go to “Api_name”, click on column menu and then Edit column > Add column based on this column....
    • Call the new column “higherClassification” and paste the expression:
    value.parseJson().get("kingdom")+
    ", "+value.parseJson().get("phylum")+
    ", "+value.parseJson().get("class")+
    ", "+value.parseJson().get("order")+
    ", "+value.parseJson().get("family")
    

    You will see the Kingdom, Phylum, Class, Order and family of each taxon.

  6. Under “higherClassification” follow the route Edit column > Split into several columns…, leave the initial settings.
  7. Now you know how to obtain the taxonomic categories of a given taxon if this is available in the GBIF API. Column names can be edited in Edit column > Rename this column.
  8. For the purpose of the original GBIF workshop, the columns created in this exercise (Higher taxonomy) must be deleted.
    • Under All, which is the first column, go to Edit columns > Re-order / remove columns….
    • Remove columns “Api_name”, “higherClassification 1”, “higherClassification 2”, “higherClassification 3”, “higherClassification 4” and “higherClassification 5”.
  9. No need to export this file as it is normally come back to previous version you already exported.

Conclusion

Here you learned how to use OpenRefine tool from Galaxy platform to clean Biodiversity data. This tutorial notably allowed you to apply some basic but powerfull functionnalities of OpenRefine to clean your data.