<div style="border: 2px solid #8A9AD0; margin: 1em 0.2em; padding: 0.5em;">

# RO-Crate in Python

by [Simone Leo](https://training.galaxyproject.org/hall-of-fame/simleo/), [Bruno P. Kinoshita](https://training.galaxyproject.org/hall-of-fame/kinow/)

Apache-2.0 licensed content from the [Galaxy Training Network](https://training.galaxyproject.org/)

**Objectives**

- What data is contained within an RO-Crate
- How can I create an RO-Crate myself?

**Objectives**

- Create a custom, annotated RO-Crate
- Use ORCIDs and other linked data to annotate datasets contained within the crate

**Time Estimation: 30M**
</div>


<p>This tutorial will show you how to manipulate <a href="https://w3id.org/ro/crate/">RO-Crates</a> in Python using the <a href="https://github.com/ResearchObject/ro-crate-py">ro-crate-py</a> package. It is based on the <a href="https://github.com/ResearchObject/ro-crate-py/blob/e1218fbca595f4c33059cfe15849ee2ae9e6896b/README.md">ro-crate-py documentation</a>.</p>
<blockquote class="agenda" style="border: 2px solid #86D486;display: none; margin: 1em 0.2em">
<div class="box-title agenda-title" id="agenda">Agenda</div>
<p>In this tutorial, we will cover:</p>
</blockquote>
<p>Let’s start by installing the library via <a href="https://docs.python.org/3/installing/">pip</a>. Note that the name of the package is <code style="color: inherit">rocrate</code>.</p>


In [None]:
pip install rocrate

<h1 id="creating-an-ro-crate">Creating an RO-Crate</h1>
<p>In its simplest form, an RO-Crate is a directory tree with an <code style="color: inherit">ro-crate-metadata.json</code> file at the top level. This file contains metadata about the other files and directories, represented by <a href="https://www.researchobject.org/ro-crate/1.1/data-entities.html">data entities</a>. These metadata consist both of properties of the data entities themselves and of other, non-digital entities called <a href="https://www.researchobject.org/ro-crate/1.1/contextual-entities.html">contextual entities</a>. A contextual entity can represent, for instance, a person, an organization or an event.</p>
<p>Suppose Alice and Bob worked on a research project together, and then wrote a paper about it; additionally, Alice prepared a spreadsheet containing experimental data, which Bob then used to generate a diagram. For the purpose of this tutorial, you can just create placeholder files for the documents:</p>


In [None]:
import os

data_dir = "exp"
os.mkdir(data_dir)

for filename in ["paper.pdf", "results.csv", "diagram.svg"]:
    with open(os.path.join(data_dir, filename), "w") as file:
        pass

<p>Let’s make an RO-Crate to represent this information:</p>


In [None]:
from rocrate.rocrate import ROCrate

crate = ROCrate()
paper = crate.add_file("exp/paper.pdf", properties={
    "name": "manuscript",
    "encodingFormat": "application/pdf"
})
table = crate.add_file("exp/results.csv", properties={
    "name": "experimental data",
    "encodingFormat": "text/csv"
})
diagram = crate.add_file("exp/diagram.svg", dest_path="images/figure.svg", properties={
    "name": "bar chart",
    "encodingFormat": "image/svg+xml"
})

<p>We’ve started by adding the data entities. Now we add contextual entities representing Alice and Bob:</p>


In [None]:
from rocrate.model.person import Person

alice_id = "https://orcid.org/0000-0000-0000-0000"
bob_id = "https://orcid.org/0000-0000-0000-0001"
alice = crate.add(Person(crate, alice_id, properties={
    "name": "Alice Doe",
    "affiliation": "University of Flatland"
}))
bob = crate.add(Person(crate, bob_id, properties={
    "name": "Bob Doe",
    "affiliation": "University of Flatland"
}))

<p>At this point, we have a representation of the various entities. Now we need to express the relationships between them. This is done by adding properties that reference other entities:</p>


In [None]:
paper["author"] = [alice, bob]
table["author"] = alice
diagram["author"] = bob

<p>You can also add whole directories together with their contents. In an RO-Crate, a directory is represented by the <code style="color: inherit">Dataset</code> entity:</p>


In [None]:
logs_dir = os.path.join(data_dir, "logs")
os.mkdir(logs_dir)

for filename in ["log1.txt", "log2.txt"]:
    with open(os.path.join(logs_dir, filename), "w") as file:
        pass

logs = crate.add_dataset("exp/logs")

<p>Finally, we serialize the crate to disk:</p>


In [None]:
crate.write("exp_crate")

<p>This should generate an <code style="color: inherit">exp_crate</code> directory containing copies of all the files we added and an <code style="color: inherit">ro-crate-metadata.json</code> file containing a <a href="https://json-ld.org">JSON-LD</a> representation of the metadata. Note that we have chosen a different destination path for the diagram, while the paper and the spreadsheet have been placed at the top level with their names unchanged (the default).</p>
<p>Some applications and services support RO-Crates stored as archives. To save the crate in zip format, you can use <code style="color: inherit">write_zip</code>:</p>


In [None]:
crate.write_zip("exp_crate.zip")

<blockquote class="comment" style="border: 2px solid #ffecc1; margin: 1em 0.2em">
<div class="box-title comment-title" id="comment-how-code-style-quot-color-inherit-quot-rocrate-code-handles-the-contents-of-code-style-quot-color-inherit-quot-exp-logs-code"><i class="far fa-comment-dots" aria-hidden="true" ></i> Comment: How <code style=&quot;color: inherit&quot;>rocrate</code> handles the contents of <code style=&quot;color: inherit&quot;>exp/logs</code></div>
<p>Exploring the <code style="color: inherit">exp_crate</code> directory, we see that all files and directories contained in <code style="color: inherit">exp/logs</code> have been added recursively to the crate. However, in the <code style="color: inherit">ro-crate-metadata.json</code> file, only the top level Dataset with <code style="color: inherit">@id</code> <code style="color: inherit">"exp/logs"</code> is listed. This is because we used <code style="color: inherit">crate.add_dataset("exp/logs")</code> rather than adding every file individually. There is no requirement to represent every file and folder within the crate in the <code style="color: inherit">ro-crate-metadata.json</code> file - in fact, if there were many files in the crate it would be impractical to do so.</p>
<p>If you do want to add files and directories recursively to the metadata, use <code style="color: inherit">crate.add_tree</code> instead of <code style="color: inherit">crate.add_dataset</code> (but note that it only works on local directory trees).</p>
</blockquote>
<h2 id="appending-elements-to-property-values">Appending elements to property values</h2>
<p>What ro-crate-py entities actually store is their JSON representation:</p>


In [None]:
paper.properties()

<div class="language-json highlighter-rouge"><div><pre style="color: inherit; background: transparent"><code style="color: inherit"><span class="p">{</span><span class="w">
</span><span class="nl">"@id"</span><span class="p">:</span><span class="w"> </span><span class="s2">"paper.pdf"</span><span class="p">,</span><span class="w">
</span><span class="nl">"@type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"File"</span><span class="p">,</span><span class="w">
</span><span class="nl">"name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"manuscript"</span><span class="p">,</span><span class="w">
</span><span class="nl">"encodingFormat"</span><span class="p">:</span><span class="w"> </span><span class="s2">"application/pdf"</span><span class="p">,</span><span class="w">
</span><span class="nl">"author"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
</span><span class="p">{</span><span class="nl">"@id"</span><span class="p">:</span><span class="w"> </span><span class="s2">"https://orcid.org/0000-0000-0000-0000"</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="nl">"@id"</span><span class="p">:</span><span class="w"> </span><span class="s2">"https://orcid.org/0000-0000-0000-0001"</span><span class="p">},</span><span class="w">
</span><span class="p">]</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>
<p>When <code style="color: inherit">paper["author"]</code> is accessed, a new list containing the <code style="color: inherit">alice</code> and <code style="color: inherit">bob</code> entities is generated on the fly. For this reason, calling <code style="color: inherit">append</code> on <code style="color: inherit">paper["author"]</code> won’t actually modify the <code style="color: inherit">paper</code> entity in any way. To add an author, use the <code style="color: inherit">append_to</code> method instead:</p>


In [None]:
donald = crate.add(Person(crate, "https://en.wikipedia.org/wiki/Donald_Duck", properties={
  "name": "Donald Duck"
}))
paper.append_to("author", donald)

<p>Note that <code style="color: inherit">append_to</code> also works if the property to be updated is missing or has only one value:</p>


In [None]:
for n in "Mickey_Mouse", "Scrooge_McDuck":
    p = crate.add(Person(crate, f"https://en.wikipedia.org/wiki/{n}"))
    donald.append_to("follows", p)

<h2 id="adding-remote-entities">Adding remote entities</h2>
<p>Data entities can also be remote:</p>


In [None]:
input_data = crate.add_file("http://example.org/exp_data.zip")

<p>By default the file won’t be downloaded, and will be referenced by its URI in <code style="color: inherit">ro-crate-metadata.json</code>:</p>
<div class="language-json highlighter-rouge"><div><pre style="color: inherit; background: transparent"><code style="color: inherit"><span class="p">{</span><span class="w">
</span><span class="nl">"@id"</span><span class="p">:</span><span class="w"> </span><span class="s2">"http://example.org/exp_data.zip"</span><span class="p">,</span><span class="w">
</span><span class="nl">"@type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"File"</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>
<p>If you add <code style="color: inherit">fetch_remote=True</code> to the <code style="color: inherit">add_file</code> call, however, the library (when <code style="color: inherit">crate.write</code> is called) will try to download the file and include it in the output crate.</p>
<p>Another option that influences the behavior when dealing with remote entities is <code style="color: inherit">validate_url</code>, also <code style="color: inherit">False</code> by default: if it’s set to <code style="color: inherit">True</code>, when the crate is serialized, the library will try to open the URL to add / update metadata such as the content’s length and format.</p>
<h2 id="adding-entities-with-an-arbitrary-type">Adding entities with an arbitrary type</h2>
<p>An entity can be of any type listed in the <a href="https://www.researchobject.org/ro-crate/1.1/context.jsonld">RO-Crate context</a>. However, only a few of them have a counterpart (e.g., <code style="color: inherit">File</code>) in the library’s class hierarchy, either because they are very common or because they are associated with specific functionality that can be conveniently embedded in the class implementation. In other cases, you can explicitly pass the type via the <code style="color: inherit">properties</code> argument:</p>


In [None]:
from rocrate.model.contextentity import ContextEntity

hackathon = crate.add(ContextEntity(crate, "#bh2021", properties={
    "@type": "Hackathon",
    "name": "Biohackathon 2021",
    "location": "Barcelona, Spain",
    "startDate": "2021-11-08",
    "endDate": "2021-11-12"
}))

<p>Note that entities can have multiple types, e.g.:</p>
<div class="language-plaintext highlighter-rouge"><div><pre style="color: inherit; background: transparent"><code style="color: inherit">    "@type" = ["File", "SoftwareSourceCode"]
</code></pre></div></div>
<h1 id="consuming-an-ro-crate">Consuming an RO-Crate</h1>
<p>An existing RO-Crate package can be loaded from a directory or zip file:</p>


In [None]:
crate = ROCrate('exp_crate')  # or ROCrate('exp_crate.zip')
for e in crate.get_entities():
    print(e.id, e.type)

<div class="language-plaintext highlighter-rouge"><div><pre style="color: inherit; background: transparent"><code style="color: inherit">./ Dataset
ro-crate-metadata.json CreativeWork
paper.pdf File
results.csv File
images/figure.svg File
https://orcid.org/0000-0000-0000-0000 Person
https://orcid.org/0000-0000-0000-0001 Person
...
</code></pre></div></div>
<p>The first two entities shown in the output are the <a href="https://www.researchobject.org/ro-crate/1.1/root-data-entity.html">root data entity</a> and the <a href="https://www.researchobject.org/ro-crate/1.1/metadata.html">metadata file descriptor</a>, respectively. The former represents the whole crate, while the latter represents the metadata file. These are special entities managed by the <code style="color: inherit">ROCrate</code> object, and are always present. The other entities are the ones we added in the <a href="#creating-an-ro-crate">section on RO-Crate creation</a>. As shown above, <code style="color: inherit">get_entities</code> allows to iterate over all entities in the crate. You can also access only data entities with <code style="color: inherit">crate.data_entities</code> and only contextual entities with <code style="color: inherit">crate.contextual_entities</code>. For instance:</p>


In [None]:
for e in crate.data_entities:
    author = e.get("author")
    if not author:
        continue
    elif isinstance(author, list):
        print(e.id, [p.get("name") for p in author])
    else:
        print(e.id, repr(author.get("name")))

<div class="language-plaintext highlighter-rouge"><div><pre style="color: inherit; background: transparent"><code style="color: inherit">paper.pdf ['Alice Doe', 'Bob Doe']
results.csv 'Alice Doe'
images/figure.svg 'Bob Doe'
</code></pre></div></div>
<p>You can fetch an entity by its <code style="color: inherit">@id</code> as follows:</p>


In [None]:
article = crate.dereference("paper.pdf")  # or crate.get("paper.pdf")

<h1 id="command-line-interface">Command Line Interface</h1>
<blockquote class="comment" style="border: 2px solid #ffecc1; margin: 1em 0.2em">
<div class="box-title comment-title" id="comment-jupyter-notebook-users-switch-to-a-terminal"><i class="far fa-comment-dots" aria-hidden="true" ></i> Comment: Jupyter Notebook users: switch to a terminal</div>
<p>The code cells in this section use Unix shell commands, which can’t be run within a notebook. Open a Unix/Linux terminal to follow along.</p>
</blockquote>
<p><code style="color: inherit">ro-crate-py</code> includes a hierarchical command line interface: the <code style="color: inherit">rocrate</code> tool. <code style="color: inherit">rocrate</code> is the top-level command, while specific functionalities are provided via sub-commands. Currently, the tool allows to initialize a directory tree as an RO-Crate (<code class="language-plaintext highlighter-rouge">rocrate init</code>) and to modify the metadata of an existing RO-Crate (<code class="language-plaintext highlighter-rouge">rocrate add</code>).</p>
<div class="language-console highlighter-rouge"><div><pre style="color: inherit; background: transparent"><code style="color: inherit"><span class="gp">&#36;</span><span class="w"> </span>rocrate <span class="nt">--help</span>
<span class="go">Usage: rocrate [OPTIONS] COMMAND [ARGS]...

Options:
  --help  Show this message and exit.

Commands:
  add
  init
  write-zip
</span></code></pre></div></div>
<h2 id="crate-initialization">Crate initialization</h2>
<p>The <code style="color: inherit">rocrate init</code> command explores a directory tree and generates an RO-Crate metadata file (<code class="language-plaintext highlighter-rouge">ro-crate-metadata.json</code>) listing all files and directories as <code style="color: inherit">File</code> and <code style="color: inherit">Dataset</code> entities, respectively.</p>
<div class="language-console highlighter-rouge"><div><pre style="color: inherit; background: transparent"><code style="color: inherit"><span class="gp">&#36;</span><span class="w"> </span>rocrate init <span class="nt">--help</span>
<span class="go">Usage: rocrate init [OPTIONS]

Options:
  --gen-preview
  -e, --exclude CSV
  -c, --crate-dir PATH
  --help                Show this message and exit.
</span></code></pre></div></div>
<p>The command acts on the current directory, unless the <code style="color: inherit">-c</code> option is specified. The metadata file is added (overwritten if present) to the directory at the top level, turning it into an RO-Crate.</p>
<h2 id="adding-items-to-the-crate">Adding items to the crate</h2>
<p>The <code style="color: inherit">rocrate add</code> command allows to add files, datasets (directories), workflows, and other entity types (currently <a href="https://crs4.github.io/life_monitor/workflow_testing_ro_crate">testing-related metadata</a>) to an RO-Crate:</p>
<div class="language-console highlighter-rouge"><div><pre style="color: inherit; background: transparent"><code style="color: inherit"><span class="gp">&#36;</span><span class="w"> </span>rocrate add <span class="nt">--help</span>
<span class="go">Usage: rocrate add [OPTIONS] COMMAND [ARGS]...

Options:
  --help  Show this message and exit.

Commands:
  dataset
  file
  test-definition
  test-instance
  test-suite
  workflow
</span></code></pre></div></div>
<p>Note that data entities (e.g., workflows) must already be present in the directory tree: the effect of the command is to register them in the metadata file.</p>
<h2 id="example">Example</h2>
<p>To run the following commands, we need a copy of the ro-crate-py repository:</p>


In [None]:
git clone https://github.com/ResearchObject/ro-crate-py

<p>Navigate to the following directory in the repository we just cloned:</p>


In [None]:
cd ro-crate-py/test/test-data/ro-crate-galaxy-sortchangecase

<p>This directory is already an RO-Crate. Delete the metadata file to get a plain directory tree:</p>


In [None]:
rm ro-crate-metadata.json

<p>Now the directory tree contains several files and directories, including a Galaxy workflow and a Planemo test file, but it’s not an RO-Crate anymore, since there is no metadata file. Initialize the crate:</p>


In [None]:
rocrate init

<p>This creates an <code style="color: inherit">ro-crate-metadata.json</code> file that lists files and directories rooted at the current directory. Note that the Galaxy workflow is listed as a plain <code style="color: inherit">File</code>:</p>
<div class="language-json highlighter-rouge"><div><pre style="color: inherit; background: transparent"><code style="color: inherit"><span class="p">{</span><span class="w">
</span><span class="nl">"@id"</span><span class="p">:</span><span class="w"> </span><span class="s2">"sort-and-change-case.ga"</span><span class="p">,</span><span class="w">
</span><span class="nl">"@type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"File"</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>
<p>To register the workflow as a <code style="color: inherit">ComputationalWorkflow</code>, run the following:</p>


In [None]:
rocrate add workflow -l galaxy sort-and-change-case.ga

<p>Now the workflow has a type of <code style="color: inherit">["File", "SoftwareSourceCode", "ComputationalWorkflow"]</code> and points to a <code style="color: inherit">ComputerLanguage</code> entity that represents the Galaxy workflow language. Also, the workflow is listed as the crate’s <code style="color: inherit">mainEntity</code> (this is required by the <a href="https://w3id.org/workflowhub/workflow-ro-crate/1.0">Workflow RO-Crate profile</a>, a subtype of RO-Crate which provides extra specifications for workflow metadata).</p>
<p>To add files or directories after crate initialization:</p>


In [None]:
cp ../sample_file.txt .
rocrate add file sample_file.txt -P name=sample -P description="Sample file"
cp -r ../test_add_dir .
rocrate add dataset test_add_dir

<p>The above example also shows how to set arbitrary properties for the entity with -P. This is supported by most <code style="color: inherit">rocrate add</code> subcommands.</p>


# Key Points

- RO-Crates can be created by hand with essentially arbitrary data, using the rocrate python module
- However the rocrate command line tool adds several commands to make it easier to automatically generate crates based on existing folder structures.

# Congratulations on successfully completing this tutorial!

Please [fill out the feedback on the GTN website](https://training.galaxyproject.org/training-material/topics/fair/tutorials/ro-crate-in-python/tutorial.html#feedback) and check there for further resources!
