NOMAD Meta Info

NOMAD wanted to make data produced by others available for big data analysis. The NOMAD meta info is a cornerstone in our approach to achieve this. It is an extensible language independent and format agnostic description of atomistic simulation data.

Names with Context

Unlike a plain dictionary, a meta info dictionary is not just a collection of names and a descriptions, but of meta_info_entry. Each meta_info_entry, along with meta_name and meta_description has a meta_type, and several other attributes, but most crucially meta_parent_section which gives the context of that value.

The meaning of a value can often be understood only together with other data: its context. meta_parent_section gives it in form of a section, a composite type described by a meta_info_entry with meta_type=’type-section’.

The context helps users to understand the values. Section names are globally unique, which also simplifies referring to them.

In NOMAD we further decided that all meta_name should be unique.

Extensibility

Schemas are often used to describe the structure of the data (and thus the context of a value). Schemas main purpose is validation, so they normally adopt a top down approach, where one starts describing the most external value, and then, recursively, all its components. While this approach works well for validation, it makes extending the schema cumbersome.

Meta Info on the other end was developed to describe data structure, and easily allows another dictionary to define meta_info_entry with any meta_parent_section, thus attaching them to an existing section (and extending it).

This open world approach was crucial in NOMAD, where every simulation code might have code specific extensions.

Map to concrete data formats

Meta info goal was not to define a schema, or a specific format, but rather to describe data with enough structure to be useful. The meta info tries to avoid the complexity coming from describing all details of a specific format, by abstracting the most useful and common properties of data formats, and try to ensure that data conforming the meta info could be mapped naturally to several data formats.

We defined mappings to Json, HDF5, parquet.

Tools

Recently I rewrote the tools to manage the meta info in python 3, improved them and generally tried to take advantage of the lesson learned. They are available on github

With it, it is possible to:

  • generate a deeply interlinked documentation that consists only static html files that can be served in any way. They use javascript judiciously to filter the meta_info_entry and display mathematical formulas. The documentation of the current version is available here
  • reformat dictionaries, and support the new git friendly exploded format that uses one file per meta_info_entry, grouped by parent section, and simplifies review of the history in git.
  • perform several consistency checks on the meta_info
  • generate json-schema compatible schemas for values corresponding to the meta infoin json format
  • Validate json data data described by the meta info stored in the json format.
  • a cascade command to automatically reformat, transform the exploded format to single file, document and generate json-schema files for a group of meta info dictionaries, and thus keep a git repository current. nomad-meta-info for example is kept up to date using it.

NOMAD meta info

In NOMAD the meta info has been used to describe atomistic simulations. The concrete structuring of the data, the names to use in practice and all the small concrete details needs as much if not more thinking than the general structure of the meta info.

The latest version of the common “code independent” parts of the meta info can be browsed here, or with the old interactive browser.

To understand a bit the idea behind it, it is useful to look at the main sections used and represented n the image below:

Logical relationships between the various sections representing an atomistic calculation (MD,...).

Updated:

Leave a comment

Comments are moderated. Your email address is neither published nor stored, only an md5 hash of it. Required fields are marked with *

Loading...