Pangeo ecosystem 101 for everyone - Introduction to Xarray Galaxy Tools

Author(s) orcid logoAnne Fouilloux avatar Anne Fouilloux
Reviewers Helena Rasche avatarCristóbal Gallardo avatarAnne Fouilloux avatarYvan Le Bras avatarSaskia Hiltemann avatar
Overview
Creative Commons License: CC-BY Questions:
  • What Xarray Galaxy Tools can I use in Galaxy and what for?

  • What is an Xarray?

  • How do I use Xarray in Galaxy?

  • How to get metadata information?

  • How to make a selection?

  • How to visualize?

  • How to filter?

  • How to make reduction operations (mean, max, min)?

  • How to resample my data?

Objectives:
  • Understand what Pangeo and Xarray are

  • Learn to get metadata information using Xarray Galaxy Tools

  • Learn to select data

  • Learn to visualize geographical data on a map

  • Learn to filter, make reduction operations (mean, max, min)

  • Learn to resample my data

Requirements:
Time estimation: 1 hour
Supporting Materials:
Published: Feb 18, 2022
Last modification: Jun 14, 2024
License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MIT
purl PURL: https://gxy.io/GTN:T00044
rating Rating: 4.0 (0 recent ratings, 1 all time)
version Revision: 5

Pangeo is a project that effectively began in 2016 with a workshop at Columbia University. The mission for Pangeo developed at that workshop is still valid nowadays:

Our mission is to cultivate an ecosystem in which the next generation of open-source analysis tools for ocean, atmosphere and climate science can be developed, distributed, and sustained. These tools must be scalable in order to meet the current and future challenges of big data, and these solutions should leverage the existing expertise outside of the geoscience community.

In this tutorial, you will learn how to manipulate netCDF data files using Xarray Galaxy Tools. NetCDF stands for network Common Data Form and is one of the most popular file format in climate science. It is used for storing multidimensional scientific data variables such as temperature or humidity, and metadata can be added to facilitate sharing of netCDF data. netCDF is widely used outside the Climate Science community and each community has its own set of conventions, especially for metadata. The Climate and Forecast metadata convention, also called CF-convention is used by the Climate community and is designed to promote the processing and sharing of netCDF files.

Comment: Xarray and Earth Science

Xarray works with labelled multi-dimensional arrays and can be used for a very wide range of data and data formats. In this training material, we focus on the usage of Xarray for Earth Science data following the CF-Convention. However, some Galaxy Tools also work for non Earth Science datasets, and if needed current Xarray Galaxy Tools could be extended to accommodate new usage.

In this tutorial, we will be using data from Copernicus Atmosphere Monitoring Service (CAMS).

CAMS produces daily European air quality forecasts over Europe at a resolution of 0.1 degrees (which corresponds approximately to 10km). It is produced from an ensemble of nine air quality forecasting models across Europe: the nine models can be combined together and the spread between these models are used to provide an extimate of the forecast uncertainty. The analysis combines model data with observations provided by the European Environment Agency (EEA).

Several variables are generated and we will be using PM2.5 (Particle Matter < 2.5 μm) 4 days forecast from December, 22 2021. Particle Matter < 2.5 μm (PM2.5) are fine particules that remain suspended for a long time and exposure to high concentration of PM2.5 (yearly mean values greater than 25 µg/m3) can have multiple short term and long term health impacts such as eye irritation, asthma and chronic bronchitis. Information on air quality standards in Europe can be found here. The dataset we will be using in this tutorial is very small and there is no need to parallelize our data analysis. Parallel data analysis with Pangeo is not covered in this tutorial.

Agenda

In this tutorial, we will cover:

  1. Create a history
    1. Upload CAMS PM2.5 data
  2. Understanding our dataset
    1. Get metadata
  3. Plotting our dataset on a geographical map
  4. Select / Subset from coordinates
  5. Masking with Where statement
  6. From Xarray to Tabular Data
  7. Conclusion

Create a history

Hands-on: Create history
  1. Make sure you start from an empty analysis history.

    To create a new history simply click the new-history icon at the top of the history panel:

    UI for creating new history

  2. Rename your history to be meaningful and easy to find. For instance, you can choose Pangeo 101 for everyone - Xarray as the name of your new history.

    1. Click on galaxy-pencil (Edit) next to the history name (which by default is “Unnamed history”)
    2. Type the new name
    3. Click on Save
    4. To cancel renaming, click the galaxy-undo “Cancel” button

    If you do not have the galaxy-pencil (Edit) next to the history name (which can be the case if you are using an older version of Galaxy) do the following:

    1. Click on Unnamed history (or the current name of the history) (Click to rename history) at the top of your history panel
    2. Type the new name
    3. Press Enter

Upload CAMS PM2.5 data

Hands-on: Data upload
  1. Import the files from Zenodo or from the shared data library (GTN - Material -> climate -> Pangeo ecosystem 101 for everyone - Introduction to Xarray Galaxy Tools):

    https://zenodo.org/record/5805953/files/CAMS-PM2_5-20211222.netcdf
    
    • Copy the link location
    • Click galaxy-upload Upload Data at the top of the tool panel

    • Select galaxy-wf-edit Paste/Fetch Data
    • Paste the link(s) into the text field

    • Press Start

    • Close the window

    As an alternative to uploading the data from a URL or your computer, the files may also have been made available from a shared data library:

    1. Go into Data (top panel) then Data libraries
    2. Navigate to the correct folder as indicated by your instructor.
      • On most Galaxies tutorial data will be provided in a folder named GTN - Material –> Topic Name -> Tutorial Name.
    3. Select the desired files
    4. Click on Add to History galaxy-dropdown near the top and select as Datasets from the dropdown menu
    5. In the pop-up window, choose

      • “Select history”: the history you want to import the data to (or create a new one)
    6. Click on Import

  2. If needed rename the datasets to CAMS-PM2_5-20211222.netcdf
  3. Check that the datatype is netcdf

    • Click on the galaxy-pencil pencil icon for the dataset to edit its attributes
    • In the central panel, click galaxy-chart-select-data Datatypes tab on the top
    • In the galaxy-chart-select-data Assign Datatype, select datatypes from “New type” dropdown
      • Tip: you can start typing the datatype into the field to filter the dropdown menu
    • Click the Save button

  4. Add a tag corresponding to ads (for Atmosphere Data Service)

    Datasets can be tagged. This simplifies the tracking of datasets across the Galaxy interface. Tags can contain any combination of letters or numbers but cannot contain spaces.

    To tag a dataset:

    1. Click on the dataset to expand it
    2. Click on Add Tags galaxy-tags
    3. Add tag text. Tags starting with # will be automatically propagated to the outputs of tools using this dataset (see below).
    4. Press Enter
    5. Check that the tag appears below the dataset name

    Tags beginning with # are special!

    They are called Name tags. The unique feature of these tags is that they propagate: if a dataset is labelled with a name tag, all derivatives (children) of this dataset will automatically inherit this tag (see below). The figure below explains why this is so useful. Consider the following analysis (numbers in parenthesis correspond to dataset numbers in the figure below):

    1. a set of forward and reverse reads (datasets 1 and 2) is mapped against a reference using Bowtie2 generating dataset 3;
    2. dataset 3 is used to calculate read coverage using BedTools Genome Coverage separately for + and - strands. This generates two datasets (4 and 5 for plus and minus, respectively);
    3. datasets 4 and 5 are used as inputs to Macs2 broadCall datasets generating datasets 6 and 8;
    4. datasets 6 and 8 are intersected with coordinates of genes (dataset 9) using BedTools Intersect generating datasets 10 and 11.

    A history without name tags versus history with name tags

    Now consider that this analysis is done without name tags. This is shown on the left side of the figure. It is hard to trace which datasets contain “plus” data versus “minus” data. For example, does dataset 10 contain “plus” data or “minus” data? Probably “minus” but are you sure? In the case of a small history like the one shown here, it is possible to trace this manually but as the size of a history grows it will become very challenging.

    The right side of the figure shows exactly the same analysis, but using name tags. When the analysis was conducted datasets 4 and 5 were tagged with #plus and #minus, respectively. When they were used as inputs to Macs2 resulting datasets 6 and 8 automatically inherited them and so on… As a result it is straightforward to trace both branches (plus and minus) of this analysis.

    More information is in a dedicated #nametag tutorial.

Understanding our dataset

Our CAMS PM2.5 forecast dataset is in netCDF format. You could find the same dataset in different formats such as GRIdded Binary or General Regularly-distributed Information in Binary form (GRIB) or geoTIFF. The same Xarray Tools can be used with these other data formats. More information about this particular data set can be found on the CAMS European air quality forecast webpage. As mentioned earlier, we use netCDF data format because it is the most popular among climate scientists.

To understand what is contained in our dataset, we will first use Xarray metadata Galaxy Tool. That will give us all the metadata information about the dataset.

Get metadata

Global metadata information

Hands-on: netCDF dataset with Xarray metadata Galaxy Tool
  1. NetCDF xarray Metadata Info ( Galaxy version 0.15.1) with the following parameters:

    • param-file “Netcdf file”: CAMS-PM2_5-20211222.netcdf
  2. View galaxy-eye the two generated outputs:

    • Metadata infos is a tabular providing the list of variables, their dimension names and number of elements per dimension. This file is used by other Xarray Tools.
    • The second file info file provide a summary of the Xarray Dataset contained in your netCDF file.

In info file output file, we can identify 4 different sections:

  1. Dimensions: name of dimensions and corresponding number of elements;
  2. Coordinates: contains coordinate arrays (longitude, latitude, level and time) with their values.
  3. Data variables: contains all the variables available in the dataset. Here, we only have one variable. For each variable, we get information on its shape and values.
  4. Global Attributes: at this level, we get the global attributes of the dataset. Each attribute has a name and a value.
Question: CAM PM2.5 Dataset

What is the name of the variable for Particle matter < 2.5 μm and its physical units?

  1. Information about variable names and units can be found in info file that was generated by Xarray metadata Galaxy Tool.
    • Variable name: mass_concentration_of_pm2p5_ambient_aerosol_in_air
    • Units: µg/m3
Output
xarray.Dataset {
dimensions:
	latitude = 400 ;
	level = 1 ;
	longitude = 700 ;
	time = 97 ;

variables:
	float32 longitude(longitude) ;
		longitude:long_name = longitude ;
		longitude:units = degrees_east ;
	float32 latitude(latitude) ;
		latitude:long_name = latitude ;
		latitude:units = degrees_north ;
	float32 level(level) ;
		level:long_name = level ;
		level:units = m ;
	timedelta64[ns] time(time) ;
		time:long_name = FORECAST time from 20211222 ;
	float32 pm2p5_conc(time, level, latitude, longitude) ;
		pm2p5_conc:species = PM2.5 Aerosol ;
		pm2p5_conc:units = µg/m3 ;
		pm2p5_conc:value = hourly values ;
		pm2p5_conc:standard_name = mass_concentration_of_pm2p5_ambient_aerosol_in_air ;

// global attributes:
	:title = PM25 Air Pollutant FORECAST at the Surface ;
	:institution = Data produced by Meteo France ;
	:source = Data from ENSEMBLE model ;
	:history = Model ENSEMBLE FORECAST ;
	:FORECAST = Europe, 20211222+[0H_96H] ;
	:summary = ENSEMBLE model hourly FORECAST of PM25 concentration at the Surface from 20211222+[0H_96H] on Europe ;
	:project = MACC-RAQ (http://macc-raq.gmes-atmosphere.eu) ;
}

Coordinates information

Hands-on: Get Coordinate information with Xarray Coordinate
  1. NetCDF xarray Coordinate Info ( Galaxy version 0.18.2+galaxy0) with the following parameters:
    • param-file “Netcdf file”: CAMS-PM2_5-20211222.netcdf
  2. View galaxy-eye the 5 generated outputs:
    • latitude: a tabular file containing all the latitude values of our Xarray dataset;
    • longitude: a tabular file containing all the longitudes values;
    • level: this file contains information on all the different levels (here, we have surface data so level=0 meter);
    • time: this tabular file contains all the forecast times. In our case, these are relative to December 22, 2021;
    • version: this is a text file returning the Xarray package version. It is useful when publishing your Galaxy workflow.
    Comment

    This tool returns as many tabular files as the number of coordinate variables present in your input file. The values are decoded from the netCDF input file and no further processing is done. So units for instance for latitudes, longitudes, level and time may vary from one file to another depending on how it was coded in the original input file.

Question: Understanding PM2.5 forecast coordinates
  1. What is the unit of the time coordinate?
  2. What is the frequency of PM2.5 forecasts?
  3. What is the range of values for latitudes and longitudes?
  1. info file tells us that time is coded as timedelta64[ns] e.g. as differences in times (here in nanoseconds). Here the reference time is December 22, 2021. If we look at the tabular file named time (generated by NetCDF xarray Coordinate Info), we see that these times are automatically converted to human readable time format when printed:
Output
0	0 days 00:00:00
1	0 days 01:00:00
2	0 days 02:00:00
3	0 days 03:00:00
4	0 days 04:00:00

This tells us that we have hourly forecast data. The last forecast time is 4 days 00:00:00 which means that the last forecast is in 4 days at 00:00 UTC (from December 22, 2021).

Plotting our dataset on a geographical map

Hands-on: Map plot

We will use Xarray mapplot Galaxy Tool to plot PM2.5 on December 22, 2021.

  1. NetCDF xarray map plotting ( Galaxy version 0.18.2+galaxy0) with the following parameters:
    • param-file “Netcdf file”: CAMS-PM2_5-20211222.netcdf
    • param-file “Tabular of variables”: Metadata infos from CAMS-PM2_5-20211222.netcdf
    • “Choose the variable to plot”: pm2p5_conc
    • “Name of latitude coordinate”: latitude
    • “Name of longitude coordinate”: longitude
    • “Datetime selection”: Yes
      • param-file “Tabular of time values”: time
      • “Choose the times to plot”: 0 days 00:00:00
    • “Shift longitudes [0,360] –> [-180,180]”: Yes
    • “Range of values for plotting e.g. minimum value and maximum value (minval,maxval) (optional)”: 0,35
    • “Add country borders with alpha value [0-1] (optional)”: 0.2
    • “Add coastline with alpha value [0-1] (optional)”: 0.5
    • “Specify which colormap to use for plotting (optional)”: roma_r
    • “Specify the projection (proj4) on which we draw e.g. {“proj”:”PlateCarree”} with double quote (optional)”: {'proj': 'Mercator', 'central_longitude': 12.0}

CAMS PM2.5 December, 22th 2021 at 00:00 UTC.

Comment: Why shifting longitudes?

Longitudes are coded from 0 to 360 degrees. As we do not have global data but only covering Europe, we need to shift longitudes so that NetCDF xarray map plotting can plot properly our dataset.

Question: Visualize and Compare

Make a plot to Visualize the forecast for December, 24th 2021 at 12:00 UTC. Do you see any obvious differences with the plot from December 22, 2021 at 00:00 UTC?

Data starts on December, 22nd 2021 at 00:00 UTC so we need to add 2 days and 12 hours to select the correct time index. We reuse the same NetCDF xarray map plotting with a different selection for time:

NetCDF xarray map plotting ( Galaxy version 0.18.2+galaxy0) with the following parameters:

  • param-file “Netcdf file”: CAMS-PM2_5-20211222.netcdf
  • param-file “Tabular of variables”: Metadata infos from CAMS-PM2_5-20211222.netcdf
  • “Choose the variable to plot”: pm2p5_conc
  • “Name of latitude coordinate”: latitude
  • “Name of longitude coordinate”: longitude
  • “Datetime selection”: Yes
    • param-file “Tabular of time values”: time
    • “Choose the times to plot”: 2 days 12:00:00
  • “Shift longitudes [0,360] –> [-180,180]”: Yes
  • “Range of values for plotting e.g. minimum value abd maximum value (minval,maxval) (optional)”: 0,35
  • “Add country borders with alpha value [0-1] (optional)”: 0.2
  • “Add coastline with alpha value [0-1] (optional)”: 0.5
  • “Specify which colormap to use for plotting (optional)”: roma_r
  • “Specify the projection (proj4) on which we draw e.g. {“proj”:”PlateCarree”} with double quote (optional)”: {'proj': 'Mercator', 'central_longitude': 12.0} CAMS PM2.5 December, 24th 2021 at 12:00 UTC.

Select / Subset from coordinates

Hands-on: NetCDF xarray operations manipulate xarray from netCDF and save back to netCDF
  1. NetCDF xarray operations ( Galaxy version 0.18.2+galaxy0) with the following parameters:
    • param-file “Netcdf file”: CAMS-PM2_5-20211222.netcdf
    • param-file “Tabular of variables”: Metadata infos from CAMS-PM2_5-20211222.netcdf
    • “Choose the variable to extract”: pm2p5_conc
    • In “additional filter”:
      • param-repeat “Insert additional filter”
        • “Dimensions”: time
        • “Comparator”: slice(threshold1,threshold2)
          • “Choose the start value for slice”: 0 days 00:00:00
          • “Choose the end value for slice”: 1 days 00:00:00
  2. Rename the output dataset to CAMS-PM2_5-20211222_fc0-23h.netcdf
  3. Add a tag corresponding to 0-23h (do not forget to add # in front of the tag)
  4. NetCDF xarray Coordinate Info ( Galaxy version 0.18.2+galaxy0) with the following parameters:
    • param-file “Netcdf file”: CAMS-PM2_5-20211222_fc0-23h.netcdf
  5. Check the generated outputs and in particular time. We see that the tabular file time only contains 24 lines with times from 0 days 00:00:00 to 0 days 23:00:00

    Comment: slice threshold2 not included in selection

    You may have noticed already but when selecting a range with slice the upper limit (here 1 days 00:00:00) is not included.

Question: PM2.5 over Italy region

Using a selection and making plots of PM2.5 over Italy (latitudes: 43.N, 40.N and longitudes: 11.E,15.E), can you tell us if the forecasted PM2.5 will increase or decrease during the next 24 hours between 10:00 UTC and 17:00 UTC? Over which town in Italy do you see high values?

  1. NetCDF xarray operations ( Galaxy version 0.18.2+galaxy0) with the following parameters:
    • param-file “Netcdf file”: CAMS-PM2_5-20211222.netcdf
    • param-file “Tabular of variables”: Metadata infos from CAMS-PM2_5-20211222.netcdf
    • “Choose the variable to extract”: pm2p5_conc
    • In “additional filter”:
    • param-repeat “Insert additional filter”
      • “Dimensions”: time
      • “Comparator”: slice(threshold1,threshold2)
        • “Choose the start value for slice”: 0 days 10:00:00
        • “Choose the end value for slice”: 0 days 18:00:00
    • param-repeat “Insert additional filter”
      • “Dimensions”: latitude
      • “Comparator”: slice(threshold1,threshold2)
        • “Choose the start value for slice”: 43.05
        • “Choose the end value for slice”: 40.05
    • param-repeat “Insert additional filter”
      • “Dimensions”: longitude
      • “Comparator”: slice(threshold1,threshold2)
        • “Choose the start value for slice”: 11.05
        • “Choose the end value for slice”: 15.05
  2. Rename the output dataset to CAMS-PM2_5-20211222_fc10-17h_Italy.netcdf
  3. Add a tag corresponding to 0-23h-Italy
  4. NetCDF xarray Metadata Info ( Galaxy version 0.15.1) with the following parameters:
    • param-file “Netcdf file”: CAMS-PM2_5-20211222_fc10-17h_Italy.netcdf
  5. NetCDF xarray Coordinate Info ( Galaxy version 0.18.2+galaxy0) with the following parameters:
    • param-file “Netcdf file”: CAMS-PM2_5-20211222_fc10-17h_Italy.netcdf
  6. NetCDF xarray map plotting ( Galaxy version 0.18.2+galaxy0) with the following parameters:
    • param-file “Netcdf file”: CAMS-PM2_5-20211222_fc10-17h_Italy.netcdf
    • param-file “Tabular of variables”: Metadata infos from CAMS-PM2_5-20211222_fc10-17h_Italy.netcdf
    • “Choose the variable to plot”: pm2p5_conc
    • “Name of latitude coordinate”: latitude
    • “Name of longitude coordinate”: longitude
    • “Datetime selection”: Yes
    • param-file “Tabular of time values”: time
    • “Choose the times to plot”: Tick Select all
    • “Shift longitudes [0,360] –> [-180,180]”: No
    • “Range of values for plotting e.g. minimum value abd maximum value (minval,maxval) (optional)”: 0,35
    • “Add country borders with alpha value [0-1] (optional)”: 0.2
    • “Add coastline with alpha value [0-1] (optional)”: 0.5
    • “Specify which colormap to use for plotting (optional)”: roma_r
    • “Specify the projection (proj4) on which we draw e.g. {“proj”:”PlateCarree”} with double quote (optional)”: {'proj': 'Mercator', 'central_longitude': 12.0}
  7. Image Montage ( Galaxy version 1.3.31+galaxy1) with the following parameters:
    • param-files “Images”: Map plots
    • param-text ”# of images wide”: 4

CAMS PM2.5 Italy 10:00 - 17:00 December 22, 2021.

From the plot there is no obvious trend over this entire region of Italy. However, we clearly see that PM2.5 is always higher over Naples and tends to spread in the South-East direction by the end of the day (on that particular date).

Comment: `latitude=slice(43.05, 40.05)` and not `latitude=slice(40.05, 43.05)`

Why did we slice latitudes with latitude=slice(43.05, 40.05) and not latitude=slice(40.05, 43.05)?

  • because when using slice, you need to specify values using the same order as in the coordinates. Latitudes are specified in decreasing order for CAMS.

Masking with Where statement

  • Sometimes we may want to make more complex selections with criteria on the values of a given variable and not only on its coordinates. For this we use where.
  • For instance, we may want to only keep PM2.5 if values are greater than a chosen threshold.
Hands-on: Plot where PM2.5 is greater than 30 μm.m-3
  1. NetCDF xarray map plotting ( Galaxy version 0.18.2+galaxy0) with the following parameters:
    • param-file “Netcdf file”: CAMS-PM2_5-20211222_fc10-17h_Italy.netcdf
    • param-file “Tabular of variables”: Metadata infos from CAMS-PM2_5-20211222_fc10-17h_Italy.netcdf
    • “Choose the variable to plot”: pm2p5_conc
    • “Name of latitude coordinate”: latitude
    • “Name of longitude coordinate”: longitude
    • “Datetime selection”: Yes
    • param-file “Tabular of time values”: time
    • “Choose the times to plot”: 0 days 10:00:00
    • “Shift longitudes [0,360] –> [-180,180]”: No
    • “Range of values for plotting e.g. minimum value and maximum value (minval,maxval) (optional)”: 0,35
    • “Do not plot values below this threshold (optional)”: 30
    • “Add country borders with alpha value [0-1] (optional)”: 0.2
    • “Add coastline with alpha value [0-1] (optional)”: 0.5
    • “Specify which colormap to use for plotting (optional)”: roma_r
    • “Specify the projection (proj4) on which we draw e.g. {“proj”:”PlateCarree”} with double quote (optional)”: {'proj': 'Mercator', 'central_longitude': 12.0} CAMS PM2.5 Italy 10:00  December 22, 2021. Now we clearly see that values of PM2.5 > 30 μm.m-3 are only found over Naples on December 22, 10:00 UTC.
Question: PM2.5 over Italy over 30 μm.m-3

Using the same geographical region over Italy, can you tell us if the forecasted PM2.5 will exceed 30 μm.m-3 between 10:00 UTC and 17:00 UTC on december 22, 2021?

  1. NetCDF xarray map plotting ( Galaxy version 0.18.2+galaxy0) with the following parameters:
    • param-file “Netcdf file”: CAMS-PM2_5-20211222_fc10-17h_Italy.netcdf
    • param-file “Tabular of variables”: Metadata infos from CAMS-PM2_5-20211222_fc10-17h_Italy.netcdf
    • “Choose the variable to plot”: pm2p5_conc
    • “Name of latitude coordinate”: latitude
    • “Name of longitude coordinate”: longitude
    • “Datetime selection”: Yes
      • param-file “Tabular of time values”: time
      • “Choose the times to plot”: Tick Select all
    • “Shift longitudes [0,360] –> [-180,180]”: No
    • “Range of values for plotting e.g. minimum value and maximum value (minval,maxval) (optional)”: 0,35
    • “Do not plot values below this threshold (optional)”: 30
    • “Add country borders with alpha value [0-1] (optional)”: 0.2
    • “Add coastline with alpha value [0-1] (optional)”: 0.5
    • “Specify which colormap to use for plotting (optional)”: roma_r
    • “Specify the projection (proj4) on which we draw e.g. {“proj”:”PlateCarree”} with double quote (optional)”: {'proj': 'Mercator', 'central_longitude': 12.0}
  2. Image Montage ( Galaxy version 1.3.31+galaxy1) with the following parameters:
    • param-files “Images”: Browse the dataset and manually select all images (png files)
    • param-text ”# of images wide”: 4

CAMS PM2.5 Italy 10:00 - 17:00 December 22, 2021 with PM2.5 > 30 μm.m<sup>-3</sup>. Using thresholds, we can clearly identify areas if anywhere there are “high” values of PM2.5. On that particular day there are a few pixels where PM2.5 values exceed 30 μm.m-3.

From Xarray to Tabular Data

Hands-on: Xarray selection

We will select a single location: Naples (40.8518° N, 14.2681° E) and select the grid point that is closest to Naples.

  1. NetCDF xarray operations ( Galaxy version 0.18.2+galaxy0) with the following parameters:
    • param-file “Netcdf file”: CAMS-PM2_5-20211222.netcdf
    • param-file “Tabular of variables”: Metadata infos from CAMS-PM2_5-20211222.netcdf
    • “Choose the variable to extract”: pm2p5_conc
    • In “additional filter”:
      • param-repeat “Insert additional filter”
        • “Dimensions”: latitude
        • “Comparator”: slice(threshold1,threshold2)
          • “Choose the start value for slice”: 40.95
          • “Choose the end value for slice”: 40.85
      • param-repeat “Insert additional filter”
        • “Dimensions”: longitude
        • “Comparator”: slice(threshold1,threshold2)
          • “Choose the start value for slice”: 14.25
          • “Choose the end value for slice”: 14.35
  2. Rename the output dataset to CAMS-PM2_5-20211222_Naples.netcdf
  3. Add a tag corresponding to Naples
  4. NetCDF xarray Selection ( Galaxy version 0.15.1) with the following parameters:
    • param-file “Netcdf file”: CAMS-PM2_5-20211222_Naples.netcdf
    • “Choose the variable to extract”: pm2p5_conc
    • “Source of coordinates”: Manually enter coordinates
      • “Geographical area”: Whole available region
    • In “Select Time series”:
      • “Datetime selection”: No
  5. Rename the output dataset to CAMS-PM2_5-20211222_Naples.tabular
  6. View galaxy-eye the generated file. It is a tabular with timeseries of PM2.4 concentrations over Naples. The total number of lines is 97 but we only print the first 5 lines.

    Output
    	time	level	latitude	longitude	pm2p5_conc
    	0	0 days 00:00:00.000000000	0.0	40.95000076293945	14.25	24.00212
    	1	0 days 01:00:00.000000000	0.0	40.95000076293945	14.25	23.5767
    	2	0 days 02:00:00.000000000	0.0	40.95000076293945	14.25	21.383186
    	3	0 days 03:00:00.000000000	0.0	40.95000076293945	14.25	20.04839
    	4	0 days 04:00:00.000000000	0.0	40.95000076293945	14.25	18.347801
    
Question: PM2.5 at Naples over the 4 forecasted days

From a qualitative point of view, can you say if PM2.5 may increase or decrease over the 4 forecasted days?

We can make a simple plot using Scatterplot with ggplot2 ot climate stripes:

  1. Scatterplot with ggplot2 ( Galaxy version 2.2.1+galaxy2) with the following parameters:
    • param-file Input in tabular format”: CAMS-PM2_5-20211222_Naples.tabular
    • “Column to plot on x-axis”: 1
    • “Column to plot on y-axis”: 6
    • “Label for x axis”: Forecast time (hour) from December, 22 2021
    • “Label for y axis”: Particule Matter < 2.5 μm.m-3
    • In “Advanced Options”:
    • “Type of plot”: Lines only
    • “Data point options”: User defined point options
      • “Transparency of points (On a scale of 0-1; 0=transparent, 1=default)”: 0.7
    • “Plotting multiple groups”: No thanks - just plot the data as one group
    • “Axis title options”: Default
    • “Axis text options”: Default
    • “Plot title options”: Default
    • “Axis scaling”: Automatic axis scaling CAMS PM2.5 Naples.
  2. Column Regex Find And Replace ( Galaxy version 1.0.1) with the following parameters:
    • param-file “Select cells from”: CAMS-PM2_5-20211222_Naples.tabular
    • “using column”: c2
    • In “Check”:
    • param-repeat “Insert Check”
      • “Find Regex”: 0 days
      • “Replacement”: 20211222
    • param-repeat “Insert Check”
      • “Find Regex”: 1 days
      • “Replacement”: 20211223
    • param-repeat “Insert Check”
      • “Find Regex”: 2 days
      • “Replacement”: 20211224
    • param-repeat “Insert Check”
      • “Find Regex”: 3 days
      • “Replacement”: 20211225
    • param-repeat “Insert Check”
      • “Find Regex”: 4 days
      • “Replacement”: 20211226
  3. Rename your dataset to CAMS-PM2_5-20211222_Naples_with_dates.tabular
  4. climate stripes ( Galaxy version 1.0.1) with the following parameters:
    • “column name to use for plotting”: pm2p5_conc
    • “plot title”: PM2.5 4 days forecast from December 22 2021 over Naples
    • In “Advanced Options”:
    • “column name to use for x-axis”: time
    • “format for input date/time column”: %Y%m%d %H:%M:%S.%f
    • “format for plotting dates on the x-axis”: %d %b %H hours
    • ”“: winter CAMS PM2.5 Naples Stripes. From December 24, 2021 at 00:00 UTC and onwards, PM2-5 concentration are much lower than at the beginning. This is bothe visible on the 1D plot and stripes.

Conclusion

trophy Well done! In this tutorial, Xarray Galaxy Tools have been introduced and we learned to use these tools on a real dataset from Copernicus Atmosphere Monitoring Service. We encourage you to try with your own datasets.