Apr 12, 2021
post

OmicsDI Data Format Specification

Omics Discovery Index (OmicsDI) expects data from Providers in common XML Format^*.

The architecture of OmicsDI starts with an XML file that contains the information from all datasets in a given database. XML files are retrieved from providers nightly, and every new dataset in the provided XML file is added to OmicsDI automatically.

Each file in OmicsDI is indexed using EBI Search System and the final information is made available via web services. EBI Search System also contains indexes of other major databases such as Uniprot, Ensembl and PubMed, allowing data providers to cross-link biological entities in their datasets with those resources.

^*For any queries about OmicsDi XML Format or data submissions to OmicsDI please contact:
omicsdi-support@ebi.ac.uk

OmicsDI XML: High Level Structure

The OmicsDI XML is used to represent metadata of any database (including all of its datasets) via the following common generic structure:


<database>
  <name>Database Name</name>
  <description>Database Description</description>
  <release>Release tag or number</release>
  <release_date>Release date</release_date>
  <entry_count>Number of entries</entry_count>
  <entries>
     <entry id="Dataset_ID_1">
       <name>Name of the Dataset</name>
       <description>Description of the dataset</description>
       <cross_references>
         <ref dbkey="CHEBI:16551" dbname="ChEBI">
         <ref dbkey="MTBLC16551" dbname="MetaboLights">
         <ref dbkey="CHEBI:16810" dbname="ChEBI">
       </cross_references>
       <dates>
         <date type="submission" value="2013-11-19">
         <date type="publication" value="2013-11-26">
       </dates>
       <additional_fields>
         <field name="repository">Repository</field>
         <field name="omics_type">Omics Type</field>
       </additional_fields>       
     </entry>
  </entries>
</database>

Cross references section allows for linking the dataset to external databases. The dbkey contains the dataset identifier in the linked database, itself identified via dbname.

OmicsDI defines a hierarchical metadata schema for each dataset, divided into three main categories: Mandatory, Recommended and Additional. The scoring system in the OmicsDI search engine boosts datasets that provide more metadata, thus rewarding researchers that have annotated their results more thoroughly. The following tables describe the metadata fields for each category, with examples and description in each case. This document describes the structure of OmicsDI schema version 1.0, the corresponding metadata fields and types of fields. For each field in OmicsDI schema we have defined three types of categories:

Mandatory (M) : These fields must be provided for the OmicsDI schema to be valid, and are part of the minimum information required to represent a dataset in OmicsDI;
Recommended (R) : These fields should be provided to be searchable and displayed adequately in OmicsDI web interface and web services;
Additional (A) : These fields should be provided to add value to the dataset - the more metadata a dataset contains, the more sense OmicsDI infrastructure can make out of the data. For example, if the proteins, genes or metabolites are provided for each dataset; OmicsDI is able to find other datasets where those biological entities have been found or studied.

OmicsDI XML: Database Section

All the information required for inclusion in OmicsDI is contained within the database section of the XML file (see generic structure above):

Field	Comment	Example	Type
name	Name of the database or provider	<name>PRIDE</name>	M
description	A short description of the provider. This description is shown in OmicsDI web interface and can be used in OmicsDI search.	<description>The proteomics identification database is an EBI resource for Proteomics</description>	R
release	The tag for the database release to which the data belongs.	<release>Release-May-2016</release>	A
release_date	The date of the database release to which the data belongs. This field may be used to store the date the data was generated, if applicable.	<release_date>2015-05-13</release_date>	R
entry_count	The number of entries in the XML file. This field is used for validation purposes.	<entry_count>2</entry_count>	R

Providers may add further information to the database section but it will not be captured during the indexing process, e.g.<license> Apache 2.0 </license>

Note that it makes sense for small databases to provide their data to OmicsDI as a single full-repository XML file. However, most ‘omics resources contain a large number of datasets, making it impracticable to exchange their data in a single file. Such resources may provide their data via multiple XML files in the same format as described above, each containing a distinct subset of dataset entries. Note that entry_count in each xml file should correspond to the number of entries in that file only, not the overall number of entries provided from that database.

OmicsDI XML: Entries Section

The entries section contains all the datasets provided in a given XML file. The <entries> tag is used to list all the entries. Each dataset is enclosed in an <entry> tag.

Each entry consists of three different sections: General information, Cross-references and Additional Fields.

A dataset in OmicsDI must have four different attributes: an identifier, a name, a description and a date of publication. In addition, other (optional) dates may be provided: submission, updated and creation, as listed in the table below:

Field	Comment	Example	Type
id	Original and UNIQUE identifier across the repository, database or provider	<entry id="PXD000001"></entry>	M
name	Name, title of the dataset, can be considered as the title of the publication	<name>TMT spikes</name>	M
description	A short description or abstract of the dataset. It can be considered similar to a "publication abstract"	<description>Expected reporter ion ratios: Erwinia peptides</description>	M
date	Date of publication of the dataset	<date type="publication" value="2014-09-22">	M^*
date	Date of initial creation of dataset submission in the database	<date type="creation" value="2014-09-22">	M^*
date	Date of successful submission to the database	<date type="submission" value="2014-09-22">	M^*
date	Date of the latest update to the dataset	<date type="updated" value="2014-09-22">	M^*

^*Note that at least one of the date fields above must be present.
For example:


    <entries>
    <entry id="ST000004">
      <name>Lipidomics studies on NIDDK / NIST human plasma samples</name>
      <description>The National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK) in collaboration with the National Institute of Standards (NIST) recently produced a human plasma standard reference material (SRM 1950) for metabolite analysis.
  </description>
      <dates>
        <date type="creation" value="2013-02-01">
        <date type="publication" value="2014-09-22">
        <date type="submission" value="2013-11-19">
        <date type="updated" value="2014-05-21">
      </dates>    
      ...
    </entry>
    ...
    </entries>

OmicsDI XML Entry: Additional Fields

All biological and experimental metadata for a given da; taset is represented using a key-value pair structure under the tag <additional_fields>. The generic key-value pair structure allows data providers to represent any information that is necessary to represent their dataset adequately and make it discoverable via OmicsDI search.

While OmicsDI places no restriction on the content of key-value pairs, some fields have been agreed and standardised by the 'omics community and should be provided in a consistent way. It is important to note that every key-value pair is a single representation of the value. If the user wishes to represent multiple values for the same key then the same tag should be repeated, for example:


    <additional_fields>
      <field name="species">Homo sapiens</field>
      <field name="species">Mus musculus</field>
      ...
    </additional_fields>

Standardised fields for OmicsDI

Field	Comment	Example	Type	Cardinality
omics_type	A category for the type of omics dataset (from a controlled vocabulary used by OmicsDI)	<field name="omics_type">Proteomics</field>	M	[1..n]
data_protocol	Description of the software, pipeline and tools used to process the data	<field name="data_protocol"></field>	R	[0..n]
sample_protocol	Description of the biological sample handling protocol	<field name="sample_protocol"></field>	R	[0..n]
repository	The name of the repository or provider^*	<field name="repository">PRIDE</field>	M	[1]
species	Specie studied in the experiment that generated the data (Free Text)	<field name="species">Homo sapiens</field>	A	[0..n]
disease	Disease studied in the experiment that generated the data (Free Text)	<field name="disease">Lung carcinoma</field>	A	[0..n]
tissue	Tissue studied in the experiment that generated the data (Free Text)	<field name="tissue">Lung</field>	A	[0..n]
cell_type	Cell type studied in the experiment that generated the data (Free Text)	<field name="cell_type">brain cortex glial cell</field>	A	[0..n]
full_dataset_link	The original link of the dataset in the provider’s web service, it should be a universal URL that can be used to find the original data	<field name="full_dataset_link"> http://www.ebi.ac.uk/pride/archive/projects/PRD000123 </field>	M	[1]
submitter	Name of the person who submitted the data into the original repository	<field name="submitter">Yasset Perez-Riverol</field>	A	[0..n]
submitter_mail	Submitter’s contact email	<field name="submitter_mail">yperez@ebi.ac.uk</field>	A	[0..n]
submitter_affiliation	Submitter’s affiliation, institution, department, etc.	<field name="submitter_affiliation">European Bioinformatics Institute</field>	A	[0..n]
instrument_platform	Instrument used to analyze the experiment’s samples	<field name="instrument_platform">LTQ Orbitrap</field>	R	[0..n]
technology_type	Technique of instrumental analysis used in the experiment	<field name="technology_type">Tandem MS/MS</field>	A	[0..n]
modification	Post-translational modifications; used mainly in Proteomics experiments	<field name="modification">Oxidation</field>	A	[0..n]
submitter_keywords	Keywords describing the dataset further, in this case added by the submitter of the data	<field name="submitter_keywords">ProteoGenomics</field>	A	[0..n]
quantification_method	Free text describing the quantitative method used in the data analysis	<field name="quantification_method">SILAQ</field>	A	[0..n]
submission_type	In ProteomeXChange this field is used to classify the type of submission	<field name="submission_type">COMPLETE</field>	A	[0..n]
software	Software used for data analysis	<field name="software">Trans-Proteomics Pipeline</field>	A	[0..n]
publication	Free text describing the publications, citation, title^**	<field name="publication">Effect of Obesity on the Preovulatory Follicle.</field>	A	[0..n]
dataset_file	This a URL of an individual data file in the dataset.	<field name="dataset_file"> ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2010/07/ PRD000123/PRIDE_Exp_Complete_Ac_9777.xml.gz </field>	A	[0..n]

Please note the following comments about the above fields:

^*repository: The name of the repository should be provided at the dataset level because some providers can act as aggregators distributing datasets from different databases. In such cases the original repository should be provided using the field <field name="repository"/>.

^**publication: Some resources do not provide information about the PubMed article in which the dataset was published. In such cases OmicsDI provides a mechanism to add the publication as free text by means of the field <field name="publication"/>.

The <additional_fields> section provides a mechanism for adding free text to OmicsDI XML and is important for search capabilities, however, is not well-designed for structured data where references and identifiers should be provided. For that information OmicsDI XML provides the <cross_references> section (see below).

OmicsDI XML Entry: Cross-reference fields

Standardised fields for OmicsDI

Field	Comment	Example	Type	Cardinality
pubmed	PubmedId is the pubmed identifier which is also used to find similar datasets based on the publications.	<ref dbkey="22531117" dbname="pubmed">	R	[0..n]

OmicsDI provides a mechanism to store and handle data that is well-referenced, structured and with common identifiers such as biological entities, ontology terms, taxonomies, publications, etc. The cross-reference fields are used for those properties that link to other indexed databases. For example, if PubMed identifier is known a cross-reference should be added referencing PubMed database.

Examples of cross-references to other resources are shown below:


    <entries>
     <entry id="ST000004">
       ...
       <cross_references>
         <ref dbkey="19770167" dbname="pubmed">
         <ref dbkey="P01130" dbname="uniprot">
         <ref dbkey="ENSP00000316578" dbname="ensembl">
         <ref dbkey="CHEBI:30031" dbname="chebi">
         <ref dbkey="GO:0044752" dbname="go">                          
       </cross_references>    
     </entry>
     ...
    </entries>

OmicsDI XML Entry: A complete example


    <entries>
     <entry id="BIOMD0000000900">
       <name>Bianca2013 - Persistence analysis in a Kolmogorov-type model for cancer-immune system competition</name>
       <description>Persistence analysis in a Kolmogorov-type model for cancer-immune system competition
        AIP Conference Proceedings 1558, 1797 (2013); https://doi.org/10.1063/1.4825874
        C. Bianca
        Dipartimento di Scienze Matematiche, Politecnico di Torino, Torino, Italy
        F. Pappalardo
        Dipartimento di Scienza del Farmaco, Università degli Studi di Catania, Catania, Italy
        M. Pennisi and M. A. Ragusa
        Dipartimento di Matematica e Informatica, Università degli Studi di Catania, Catania, Italy        
  </description>
       <dates>
         <date type="submission" value="2019-12-18">
         <date type="publication" value="2019-12-18">
         <date type="last_modification" value="2019-12-18">
       </dates>    
       <additional_fields>
         <field name="submitter">Mohammad Umer Sharif Shohan</field>
         <field name="submitter_mail">msharifshohan@ebi.ac.uk</field>
         <field name="repository">BioModels</field>
         <field name="full_dataset_link">https://www.ebi.ac.uk/biomodels/BIOMD0000000900</field>
         <field name="publication">This paper is concerned with analytical investigations on the competition between cancer cells and immune system cells. Specifically the role of the B-cells and T-cells in the evolution of cancer cells is taken into account. The mathematical model is a Kolmogorov-type system of three evolution equations where the growth rate of the cells is described by logistic law and the response of B-cells and T-cells is modeled according to Holling type-II function. The stability analysis of equilibrium points is performed and the persistence of the model is proved.. 1, 1558.
          Dipartimento di Scienze Matematiche, Politecnico di Torino, Torino, Italy
Dipartimento di Scienza del Farmaco, Università degli Studi di Catania, Catania, Italy
Dipartimento di Matematica e Informatica, Università degli Studi di Catania, Catania, Italy
</field>
          <field name="modellingApproach">Ordinary differential equation model</field>
          <field name="omics_type">Models</field>
          <field name="modelFormat">SBML</field>
          <field name="submissionId">MODEL1912180002</field>
          <field name="publicationId">BIOMD0000000900</field>
          <field name="publication_year">2013</field>
          <field name="levelVersion">L2V4</field>
          <field name="curationStatus">Manually curated</field>
          <field name="non_derived_xrefs">BIOMD0000000900 biomodels.db MODEL1912180002 biomodels.db</field>
          <field name="publication_authors">C. Bianca, F. Pappalardo, M. Pennisi and M. A. Ragusa</field>
          <field name="first_author">C. Bianca</field>
          <field name="tokenised_name">Bianca2013   Persistence analysis in a Kolmogorov type model for cancer immune system competition</field>
        </additional_fields>
        <cross_references>
          <ref dbkey="9606" dbname="taxonomy"> 
          <ref dbkey="C25513" dbname="ncit">
          <ref dbkey="GO:0006955" dbname="go">
          <ref dbkey="10.1063/1.4825874" dbname="doi">
          <ref dbkey="10.4039/entm9745fv" dbname="doi.org">
          <ref dbkey="0000311" dbname="efo">
          <ref dbkey="BTO:0000776" dbname="bto">
          <ref dbkey="BTO:0000782" dbname="bto">
          <ref dbkey="C64914" dbname="ncit">
          <ref dbkey="C64382" dbname="ncit">            
        </cross_references>
     </entry>
     ...
    </entries>