Presenter notes contain extra information which might be useful if you intend to use these slides for teaching.
Press P
again to switch presenter notes off
Press C
to create a new window where the same presentation will be displayed.
This window is linked to the main window. Changing slides on one will cause the
slide to change on the other.
Useful when presenting.
Presenter notes contain extra information which might be useful if you intend to use these slides for teaching.
Press P
again to switch presenter notes off
Press C
to create a new window where the same presentation will be displayed.
This window is linked to the main window. Changing slides on one will cause the
slide to change on the other.
Useful when presenting.
Before diving into this slide deck, we recommend you to have a look at:
What is Pangeo?
What is the Pangeo Software Ecosystem?
What is ARCO (Analysis Ready, Cloud Optimized) Data?
What is Pangeo forge?
What is STAC?
Why and how to use STAC?
How do Pangeo forge and STAC relate to each other?
How can I use and/or contribute to Pangeo?
Where to go to learn more about Pangeo?
This presentation is a summary of:
This community provides documentation, develops and maintains Open Source software, and deploys computing infrastructure to make scientific research and programming easier.
Pangeo is funded through many different projects in USA, Europe and Australia but the main funders are NSF, EarthCube, NASA and the Gordon and Betty Moore foundation.
There are several building crises facing the geoscience community:
Pangeo aims to address these challenges through a unified, collaborative effort.
The mission of Pangeo is to cultivate an ecosystem in which the next generation of open-source analysis tools for ocean, atmosphere and climate science can be developed, distributed, and sustained. These tools must be scalable in order to meet the current and future challenges of big data, and these solutions should leverage the existing expertise outside of the geoscience community.
Source: Pangeo Tutorial - Ocean Sciences 2020 by Ryan Abernathey, February 17, 2020.
Xarray expands on NumPy arrays and pandas. Xarray has two core data structures:
DataArray
is our implementation of a labeled, N-dimensional array. It is a generalization of a pandas.Series
.Dataset
is a multi-dimensional, in-memory array database. It is a dict-like container of DataArray
objects aligned along any number of shared dimensions, and serves a similar purpose in xarray to the pandas.DataFrame
.Source: Xarray documentation
A powerful, format-agnostic, community-driven Python package for analysing and visualising Earth science data.
Source: Scitools Iris documentation
Enabling performance at scale for the tools you love
Dask accelerates the existing Python ecosystem (Numpy, Pandas, Scikit-learn)
Source: Dask documentation
Free software, open standards, and web services for interactive computing across all programming languages
Source: Jupyter documentation
You may have heard about the Jupyter ecosystem and wonder why it is presented here as part of the Pangeo ecosystem.
Jupyter plays an important role for the Pangeo community.
Jupyter is free, follows open standards, and has web services for interactive computing across all programming languages.
What is "Analysis Ready"?
What is "Cloud Optimized"?
Pangeo Forge is an open source platform for data Extraction, Transformation, and Loading (ETL). The goal of Pangeo Forge is to make it easy to extract data from traditional repositories and deposit this data in cloud object storage in an analysis-ready, cloud optimized (ARCO) format.
Pangeo Forge is inspired directly by Conda Forge, a community-led collection of recipes for building conda packages.
The goal of Pangeo Forge is to make it easy to extract data from traditional repositories and deposit this data in cloud object storage in analysis-ready, cloud optimized (ARCO) format.
Pangeo Forge is inspired directly by Conda Forge, a community-led collection of recipes for building conda packages.
It is under active development and the Pangeo community hopes it will play a role in democratizing the publication of datasets in ARCO format.
This may look complicated on this figure but like for conda forge most of the process is automated.
The goal of Pangeo Forge is to "convert" existing datasets from their native format into ARCO format.
They can then be used by anyone from anywhere.
Let's imagine you have a bunch of data from NOAA in a tradictional data repository.
Instead of manually converting them to ARCO format, you create a recipe, actually you often reuse an existing one that will automatically transform the original datasets in ARCO format and publish it to an s3 compatible object storage such as Amazon.
The next step is then to tell the community where and how to access to your transformed dataset.
In this example, we search data over the main region (France) between the 1st January 2019 and the 4th June 2019 using STAC catalogs.
The result shows that data is available for 108 dates and for Landsat-8, sentinel 2A, etc.
A lot of project are now build around STAC.
All packages that rely or work extremely well with STAC are at listed at stacindex.org/ecosystem.
If you are a Python programmer, you will probably make use of intake-stac: this is currently the most popular Python package for discovering, exploring, and loading spatio-temporal datasets.
Join and contribute to STAC: https://github.com/radiantearth/stac-spec
The Pangeo project is completely open to involvement from anyone with interest.
There are many ways to get involved:
For more information, consult the Frequently Asked Questions.
Everyone is welcome to the Pangeo Weekly Community Meeting.
For more information, consult the Frequently Asked Questions on the pangeo website.
Everyone is welcome to the Pangeo Weekly Community Meetings: they are organized in different time zones for increasing accessibility.
Pangeo is an inclusive community promoting open, reproducible and scalable science.
The Pangeo software ecosystem involves open source tools such as Xarray, iris, dask, jupyter, and many other packages.
Pangeo is an inclusive community promoting open, reproducible and scalable science.
The Pangeo software ecosystem involves open source tools such as Xarray, iris, dask, jupyter, and many other packages.
On the cloud, Analysis Ready, Cloud Optimized ata format (ARCO) is preferable.
Pangeo-forge eases the extractionm transformation and loading of Earth Science datasets
SpatioTemporal Asset Catalogs helps to provide a unified interface for searching and extracting spatio temporal datasets
STAC and pangeo-forge aim at complementing each other
This material is the result of a collaborative work. Thanks to the Galaxy Training Network and all the contributors!
Author(s) |
![]() ![]() |
Reviewers |
|
Tutorial Content is licensed under Creative Commons Attribution 4.0 International License.
Before diving into this slide deck, we recommend you to have a look at:
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |