Omics Discovery Index (OmicsDI) expects data from Providers in common XML Format*.

The architecture of OmicsDI starts with an XML file that contains the information from all datasets in a given database. XML files are retrieved from providers nightly, and every new dataset in the provided XML file is added to OmicsDI automatically.

Each file in OmicsDI is indexed using EBI Search System and the final information is made available via web services. EBI Search System also contains indexes of other major databases such as Uniprot, Ensembl and PubMed, allowing data providers to cross-link biological entities in their datasets with those resources.

*For any queries about OmicsDi XML Format or data submissions to OmicsDI please contact:
omicsdi-support@ebi.ac.uk

OmicsDI XML: High Level Structure

The OmicsDI XML is used to represent metadata of any database (including all of its datasets) via the following common generic structure:


<database>
  <name>Database Name</name>
  <description>Database Description</description>
  <release>Release tag or number</release>
  <release_date>Release date</release_date>
  <entry_count>Number of entries</entry_count>
  <entries>
     <entry id="Dataset_ID_1">
       <name>Name of the Dataset</name>
       <description>Description of the dataset</description>
       <cross_references>
         <ref dbkey="CHEBI:16551" dbname="ChEBI">
         <ref dbkey="MTBLC16551" dbname="MetaboLights">
         <ref dbkey="CHEBI:16810" dbname="ChEBI">
       </cross_references>
       <dates>
         <date type="submission" value="2013-11-19">
         <date type="publication" value="2013-11-26">
       </dates>
       <additional_fields>
         <field name="repository">Repository</field>
         <field name="omics_type">Omics Type</field>
       </additional_fields>       
     </entry>
  </entries>
</database>

Cross references section allows for linking the dataset to external databases. The dbkey contains the dataset identifier in the linked database, itself identified via dbname.

OmicsDI defines a hierarchical metadata schema for each dataset, divided into three main categories: Mandatory, Recommended and Additional. The scoring system in the OmicsDI search engine boosts datasets that provide more metadata, thus rewarding researchers that have annotated their results more thoroughly. The following tables describe the metadata fields for each category, with examples and description in each case. This document describes the structure of OmicsDI schema version 1.0, the corresponding metadata fields and types of fields. For each field in OmicsDI schema we have defined three types of categories:

  • Mandatory (M) : These fields must be provided for the OmicsDI schema to be valid, and are part of the minimum information required to represent a dataset in OmicsDI;
  • Recommended (R) : These fields should be provided to be searchable and displayed adequately in OmicsDI web interface and web services;
  • Additional (A) : These fields should be provided to add value to the dataset - the more metadata a dataset contains, the more sense OmicsDI infrastructure can make out of the data. For example, if the proteins, genes or metabolites are provided for each dataset; OmicsDI is able to find other datasets where those biological entities have been found or studied.

OmicsDI XML: Database Section

All the information required for inclusion in OmicsDI is contained within the database section of the XML file (see generic structure above):

Field Comment Example Type
nameName of the database or provider<name>PRIDE</name>M
descriptionA short description of the provider. This description is shown in OmicsDI web interface and can be used in OmicsDI search.<description>The proteomics identification database is an EBI resource for Proteomics</description>R
releaseThe tag for the database release to which the data belongs.<release>Release-May-2016</release>A
release_dateThe date of the database release to which the data belongs. This field may be used to store the date the data was generated, if applicable.<release_date>2015-05-13</release_date>R
entry_countThe number of entries in the XML file. This field is used for validation purposes.<entry_count>2</entry_count>R

Providers may add further information to the database section but it will not be captured during the indexing process, e.g.<license> Apache 2.0 </license>

Note that it makes sense for small databases to provide their data to OmicsDI as a single full-repository XML file. However, most ‘omics resources contain a large number of datasets, making it impracticable to exchange their data in a single file. Such resources may provide their data via multiple XML files in the same format as described above, each containing a distinct subset of dataset entries. Note that entry_count in each xml file should correspond to the number of entries in that file only, not the overall number of entries provided from that database.

OmicsDI XML: Entries Section

The entries section contains all the datasets provided in a given XML file. The <entries> tag is used to list all the entries. Each dataset is enclosed in an <entry> tag.

Each entry consists of three different sections: General information, Cross-references and Additional Fields.

A dataset in OmicsDI must have four different attributes: an identifier, a name, a description and a date of publication. In addition, other (optional) dates may be provided: submission, updated and creation, as listed in the table below:

Field Comment Example Type
idOriginal and UNIQUE identifier across the repository, database or provider<entry id="PXD000001"></entry>M
nameName, title of the dataset, can be considered as the title of the publication<name>TMT spikes</name>M
descriptionA short description or abstract of the dataset. It can be considered similar to a "publication abstract"<description>Expected reporter ion ratios: Erwinia peptides</description>M
dateDate of publication of the dataset<date type="publication" value="2014-09-22">M*
dateDate of initial creation of dataset submission in the database<date type="creation" value="2014-09-22">M*
dateDate of successful submission to the database<date type="submission" value="2014-09-22">M*
dateDate of the latest update to the dataset<date type="updated" value="2014-09-22">M*

*Note that at least one of the date fields above must be present.
For example:


    <entries>
    <entry id="ST000004">
      <name>Lipidomics studies on NIDDK / NIST human plasma samples</name>
      <description>The National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK) in collaboration with the National Institute of Standards (NIST) recently produced a human plasma standard reference material (SRM 1950) for metabolite analysis.
  </description>
      <dates>
        <date type="creation" value="2013-02-01">
        <date type="publication" value="2014-09-22">
        <date type="submission" value="2013-11-19">
        <date type="updated" value="2014-05-21">
      </dates>    
      ...
    </entry>
    ...
    </entries>