Phenotypes to Semantic web publication

The sharing and especially the reusability of complex data like phenotypes is a good use case for a publication following the semantic web principles. This page describes how to generate a Phenotype RDF dataset from a Breeding API MIAPPE compliant endpoint. The growing adoption of this API  will facilitate the reusability of such a transformation process.

Recommendations

The RDF publication relies among other things on the correct identification of different resources through Permanent Unique Identifier (PUI), either URIs or DOIs. Therefore, adding a correct set of URI’s to the key resources is mandatory. However loose PUIs are acceptable for some resources, e.g. dataset unique ID (DUID), which unicity is ensured only in the dataset namespace. The resources that need Unique Identifiers are :

  • Dataset. Mandatory. Ideally a DOI registered at datacite.
  • Studies. Mandatory. Simple URI, not necesserally dereferencable.
  • Observation variables. Mandatory. URIs or ID taken from ontologies.
  • Plant material/Germplasm. Mandatory. Simple URI, not necessarily dereferencable or ideally a DOI following the MCPD recommendations.
  • Treatment. Not mandatory, dataset DUID which can be generated on the fly.
  • Observation unit, like plots or individuals in the experiment. Not mandatory, dataset DUID which can be generated on the fly.

Data transformation

The full documentation of this tool is available on github.com/gnpis.

JSON Extraction process

This ETL first extracts BrAPI data in a hierarchical pattern described in the config.json file under "brapi_calls" section. In this configuration, each BrAPI entity can have “children” entities that can be extracted using identifiers contained in the parent entity.

The result of this extraction is a set of JSON files containing the whole dataset.

For example, when extracting a study, the location of this study can be extracted using location details call and the location identifier in the study JSON data. This hierarchical extraction can be used to extract only a specific data set from a BrAPI endpoint and helps checking links between BrAPI JSON objects using their identifiers.

RDF Transformation process

This step transforms the extracted JSON in RDF.

The transformation process is composed of two distinct steps. First, the JSON-LD annotation and URI generation to the BrAPI JSON are applied. The second step converts the resulting JSON-LD files into RDF turtle that can easily be loaded into Virtuoso.

PUI and DUID handling

The transformation of BrAPI data to RDF can be difficult because URIs are rarely used to identify BrAPI data. To solve this problem, two solutions are available in this ETL. Either the BrAPI endpoint provides a PUI field for its data or the PUI is generated from the BrAPI URL, the entity name and the entity identifier.

For the PUI field, we took the convention used in the “germplasm” calls.

For locations, you can provide a “locationPUI” field. For study, you should provide a “studyPUI” field. And so on, and so forth.

If no PUI fields are provided, the URI is generated using the following pattern: {brapi_url}/{brapi_entity}/{entity_id}.

For example, if the location “foo42” is extracted from the “URGI” institution, the generated URI will look like: https://urgi.versailles.inra.fr/GnpISCore-srv/brapi/v1/locations/foo42.

RDF conversion

This conversion is achieved by applying the JSON-LD context, which is maintained and fully described here. In order to easily integrate the BrAPI RDF data, the last transformation step is to convert the JSON-LD files into the RDF turtle format. This transformation is a straightforward conversion from one RDF format to another.

The OWL model defining all the RDF classes and properties is also converted to RDF turtle so that is can be integrated alongside the actual data.

Once generated, the RDF data can be loaded in a SPARQL endpoint with the help of the BrAPI-extract-index-prototype toolkit. This latter one allows to execute the whole transformation process :

# Extract brapi data
python2 main.py extract brapi

# Tansform brapi data to RDF (with JSON-LD as a intermediary format)
python2 main.py transform rdf

# Index data in Virtuoso:
python2 main.py load virtuoso

The latest version of the full documentation is maintained on github.