Phenotypes as RDF

Phenotypes to Semantic web publication

The sharing and especially the reusability of complex data like phenotypes is a good use case for a publication following the semantic web principles. This page describes how to generate a Phenotype RDF dataset from a Breeding API MIAPPE compliant endpoint. The growing adoption of this API will facilitate the reusability of such a transformation process.

Recommendations

The RDF publication relies among other things on the correct identification of different resources through Permanent Unique Identifier (PUI), either URIs or DOIs. Therefore, adding a correct set of URI’s to the key resources is mandatory. However loose PUIs are acceptable for some resources, e.g. dataset unique ID (DUID), which unicity is ensured only in the dataset namespace. The resources that need Unique Identifiers are :

Dataset. Mandatory. Ideally a DOI registered at datacite.
Studies. Mandatory. Simple URI, not necesserally dereferencable.
Observation variables. Mandatory. URIs or ID taken from ontologies.
Plant material/Germplasm. Mandatory. Simple URI, not necessarily dereferencable or ideally a DOI following the MCPD recommendations.
Treatment. Not mandatory, dataset DUID which can be generated on the fly.
Observation unit, like plots or individuals in the experiment. Not mandatory, dataset DUID which can be generated on the fly.

Data transformation

The full documentation of this tool is available on github.com/gnpis.

JSON Extraction process

This ETL first extracts BrAPI data in a hierarchical pattern described in the config.json file under "brapi_calls" section. In this configuration, each BrAPI entity can have “children” entities that can be extracted using identifiers contained in the parent entity.

The result of this extraction is a set of JSON files containing the whole dataset.

For example, when extracting a study, the location of this study can be extracted using location details call and the location identifier in the study JSON data. This hierarchical extraction can be used to extract only a specific data set from a BrAPI endpoint and helps checking links between BrAPI JSON objects using their identifiers.

RDF Transformation process

This step transforms the extracted JSON in RDF.

The transformation process is composed of two distinct steps. First, the JSON-LD annotation and URI generation to the BrAPI JSON are applied. The second step converts the resulting JSON-LD files into RDF turtle that can easily be loaded into Virtuoso.

PUI and DUID handling

The transformation of BrAPI data to RDF can be difficult because URIs are rarely used to identify BrAPI data. To solve this problem, two solutions are available in this ETL. Either the BrAPI endpoint provides a PUI field for its data or the PUI is generated from the BrAPI URL, the entity name and the entity identifier.

For the PUI field, we took the convention used in the “germplasm” calls.

For locations, you can provide a “locationPUI” field. For study, you should provide a “studyPUI” field. And so on, and so forth.

If no PUI fields are provided, the URI is generated using the following pattern: {brapi_url}/{brapi_entity}/{entity_id}.

For example, if the location “foo42” is extracted from the “URGI” institution, the generated URI will look like: https://urgi.versailles.inra.fr/GnpISCore-srv/brapi/v1/locations/foo42.

RDF conversion

This conversion is achieved by applying the JSON-LD context, which is maintained and fully described here. In order to easily integrate the BrAPI RDF data, the last transformation step is to convert the JSON-LD files into the RDF turtle format. This transformation is a straightforward conversion from one RDF format to another.

The OWL model defining all the RDF classes and properties is also converted to RDF turtle so that is can be integrated alongside the actual data.

Once generated, the RDF data can be loaded in a SPARQL endpoint with the help of the BrAPI-extract-index-prototype toolkit. This latter one allows to execute the whole transformation process :

# Extract brapi data
python2 main.py extract brapi

# Tansform brapi data to RDF (with JSON-LD as a intermediary format)
python2 main.py transform rdf

# Index data in Virtuoso:
python2 main.py load virtuoso

The latest version of the full documentation is maintained on github.

Cookie	Duration	Description
cookielawinfo-checkbox-advertisement	6 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Advertisement".
cookielawinfo-checkbox-analytics	6 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	6 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	6 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	6 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	6 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	6 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.
__cfduid	1 month	The cookie is used by cdn services like CloudFare to identify individual clients behind a shared IP address and apply security settings on a per-client basis. It does not correspond to any user ID in the web application and does not store any personally identifiable information.

Cookie	Duration	Description
_pk_id.23.b4c0	13 months	This cookie name is associated with the Matomo open source web analytics platform. It is used to help website owners track visitor behaviour and measure site performance. It is a pattern type cookie, where the prefix _pk_ses is followed by a short series of numbers and letters, which is believed to be a reference code for the domain setting the cookie.
_pk_ses.23.b4c0	30 minutes	This cookie name is associated with the Matomo open source web analytics platform. It is used to help website owners track visitor behaviour and measure site performance. It is a pattern type cookie, where the prefix _pk_ses is followed by a short series of numbers and letters, which is believed to be a reference code for the domain setting the cookie.

Cookie	Duration	Description
CONSENT	16 years 9 months 1 day 16 hours	YouTube is a Google owned platform for hosting and sharing videos. YouTube collects user data through videos embedded in websites, which is aggregated with profile data from other Google services in order to display targeted advertising to web visitors across a broad range of their own and other websites.
IDE	1 year 24 days	Used by Google DoubleClick and stores information about how the user uses the website and any other advertisement before visiting the website. This is used to present users with ads that are relevant to them according to the user profile.
test_cookie	15 minutes	This cookie is set by doubleclick.net. The purpose of the cookie is to determine if the user's browser supports cookies.
VISITOR_INFO1_LIVE	5 months 27 days	This cookie is set by Youtube. Used to track the information of the embedded YouTube videos on a website.