Wim Hugo1*, Donald Hobern2, Urmas Kõljalg3, Éamonn O’Tuama2, Hannu Saarenmaa4

* Coordinating Lead Author

1 South African Environmental Observation Network, PO Box 2600, Pretoria 0001, South Africa. wim@saeon.ac.za

2 Global Biodiversity Information Facility, Universitetsparken 15, DK-2100 Copenhagen, Denmark. dhobern@gbif.org

3 Institute of Ecology and Earth Sciences, University of Tartu, Ülikooli 18, Tartu 50090, Estonia. urmas.koljalg@ut.ee

4 Digitarium/ University of Eastern Finland, P.O. Box 111, FI-80101 Joensuu, Finland. hannu.saarenmaa@helsinki.fi

Review of Standards in Biodiversity Informatics
Measurements and Sensors

In the longer term, for describing observations and measurements associated with biological sampling, the biodiversity community[1] will benefit from adoption of a comprehensive conceptual model, such as the OGC and ISO O&M[2] [3]. The model is an essential underpinning for the related OGC Sensor Observation Service. The model defines an observation as an activity that results in a measurement, obtained using a particular procedure, of the value of a property associated with a feature-of-interest. A sampling feature can be, e.g., a station, transect or specimen and a set of related observations can be grouped together in the same sampling event. The model is, by its nature, high level and abstract, and although an XML implementation[4] exists, the challenge remains[5] for any community of practice to develop community based vocabularies and content standards through identifying the important features and their properties within a particular domain and express these using GML application schemas.

Occurrences, Observations, and Monitoring

Darwin Core[6] (DwC) (Wieczorek et al. 2012) is a TDWG standard designed for sharing data about biodiversity. It is a glossary of terms, which can be seen as an extension of the Dublin Core[7] metadata standard for the biodiversity domain. It is amongst the most widely deployed of biodiversity vocabularies (e.g. on the GBIF network), and while its main use is for publishing specimen and observation records, it continues to evolve to meet the needs for sharing more complex sample-based and ecological data.

Access to Biological Collection Data[8] (ABCD) is also a TDWG standard. It uses a more comprehensive model than DwC and is thus more expressive. ABCD covers metadata (data set descriptions), everything related to the collecting or observing event (who, why, where, when, how), everything related to identifications (who, when, as what, according to, etc.), biological observations (pathogen, pollinator, parasitic and other relationships, sex, stage, etc.), and freely chosen measurements and categorised observations and their methodology. ABCD is used in the BioCASE network and readily integrates with the GBIF network through a mapping of ABCD to DwC. Significantly, DwC has adopted some properties from ABCD. ABCD also provides extensions for Earth/Geosciences (Access to Biological Collection Data Extended for Geosciences, ABCDEFG[9]) and genomic data (ABCDDNA).

Ecological Metadata Language[10] (EML) is a metadata specification developed by the ecology discipline and for the ecology discipline. It is based on prior work done by the Ecological Society of America and associated efforts (Michener et al., 1997). EML is implemented as a series of XML document types that can be used in a modular and extensible manner to document ecological data. Each EML module is designed to describe one logical part of the total metadata that should be included with any ecological data set.

Unifying data standards of biodiversity and ecological monitoring is a challenge for GEO BON. The occurrence domain has developed the ABCD and Darwin Core standards, but the ecosystem monitoring domain has nothing similar, probably due to the perceived complexity of the task.

Common ground has partly been found in recent years through adoption of Ecological Metadata Language (EML) by the occurrence domain. EML originates in the ecosystem monitoring domain, and is suitable for describing sampling protocols. Through development of a profile[11], it has now been extended to also cover descriptions of museum collections and databases. However, this still needs to be directly supported in EML through a dedicated class of elements.

Sites and Habitats

Site is a facility such as an experimental station with certain capabilities for carrying out projects. Perhaps a biological collection is a corresponding concept. Progress has already been made by ALTER-net[12] to describe sites in a standardised way.

Much work is also being done in habitat classification. Darwin Core currently has just one property (habitat) for describing the habitat in which an event occurred but a proposal[13] has recently been submitted to expand Darwin Core by adding three new properties (environmental material, environmental feature, and biome), with the recommended values for these and habitat drawn from the equivalent Environment Ontology (EnvO)[14] classes.

Genetic Data

Generally the term genetics is used for the study of single genes and genomics covers study of all genes and their interactions, including environmental ones. Standards developed for this subject cover both areas and there is no clear line between them. Therefore we use here the term genomic sensu lato, which also covers genetics. Metagenomics refers to the study of sequence data derived directly from environmental samples without first undertaking DNA isolation and culture steps. Such studies are set to revolutionise our understanding of biodiversity by enabling investigation of microbial diversity in relation to community structure, habitat and environment at the fundamental level of the genome (Wooley, 2010).

The International Nucleotide Sequence Database Collaboration[15] (INSDC) includes three collaborating partners, viz. National Center for Biotechnology Information[16] (NCBI), European Nucleotide Archive[17] (ENA) and DNA Database of Japan[18] (DDBJ) who, together, developed a common standard and exchange format for genomic data. Documentation includes a feature definition table[19] and sample record[20].

The Genomic Standards Consortium[21] (GSC) is the principal organisation for the development of genomic standards. Founded in 2005, its mission is the implementation of new genomic standards as well as methods to capture and exchange associated metadata. GSC collaborates with the INSDC in order to implement genomic standards in their system. The GSC standard “Minimum Information about any (x) Sequence” (MIxS) (Yilmaz et al. 2011) includes three separate checklists which are sometimes also called standards: MIGS for genomes (“minimum information about a genome sequence”), MIMS for metagenomes (“minimum information about a metagenome sequence”) and MIMARKS for marker genes (“minimum information about a marker genome sequence”). MIxS also includes so called environmental packages for describing the environment from where the organism(s) or DNA sample was taken. There are currently 14 environmental packages with new, additional packages under development. The list of environmental packages as well as shared and specific descriptors in the checklists are shown in Fig. 12.A.1.

Figure 12.A.1 Overview of the MIxS checklists and environmental packages.
Source: http://gensc.org/index.php?title=File:Fig1.png

The Genomic Biodiversity Working Group[22] (GBWG) of the GSC was formed to review existing biodiversity standards and bridge the gaps between researchers working in molecular biology, taxonomy, ecology, and biodiversity informatics. The GBWG collaborates with TDWG with annual meetings at TDWG conferences. In addition, a series of workshops funded largely through the US National Science Foundation and GBIF brought together experts from the genomics and traditional biodiversity communities to address the aligning of their respective standards. In February 2012, a hackathon brought together several experts to continue the alignment of the Darwin Core and MiXS standards (Ó Tuama 2012). In May 2012, at the Semantics of Biodiversity meeting[23], term definitions in biodiversity informatics were addressed and, in September 2012, at the bioCollections Ontology Hackathon[24], a prototype bioCollections Ontology was developed. These workshops gave significant input to three initiatives:

  1. Darwin Core DNA and Tissue Extension, which aims to track DNA extracts, and any biological samples as they relate to occurrence records, harvested by GBIF. Two primary use cases were proposed for this extension – a) barcoding, producing 1:1 mapping between sample and taxonomy, and b) metagenomics / molecular community ecology giving typically 1-to-many mapping between sample and taxonomy.
  2. BiSciCol[25], a linked data project with a goal of tracking biological collection objects and their derivatives, across distributed databases, multiple domains and information standards. BiSciCol provides a method for determining allowable relationships and traversing graph-based data derived from multiple standards for biological collections.
  3. Development of BCO[26] (Biodiversity Collections Ontology) and PCO[27] (Population and Community Ontology) that serve to fill the gap in the formal description of biodiversity observation (collections, specimens), and to formulate more complex relationships between primary data elements such as evolutionary processes, organismal interactions, and ecological experiments. These ontologies are in development (PCO) or have just recently been formally published (BCO) but would form useful starting points for community adoption of standards that are sorely needed.

These initiatives led to the creation of two extensions to the Darwin Core Standard (DwC), viz. MIxS sample[28] and Taxon Abundance[29] which are still under development. There are many adopters of the MIxS standards including INSDC, the Quantitative Insights Into Microbial Ecology[30] (QIIME) software package, EBI Metagenomics Portal, Genomes Online Database (COLD), etc. and the number continues to grow. The GSC also have their own journal “ Standards in Genomic Sciences”[31] and several core projects:

  • GCDML[32] – Genomic Contextual Data Markup Language, an XML Schema for generating MIxS compliant reports for data entry, exchange and storage. This sample-centric, strongly-typed schema provides a diverse set of descriptors for describing the exact origin and processing of a biological sample, from sampling to sequencing, and subsequence analysis;
  • Genomic Rosetta Stone – a registry of identifiers describing complete genomes across a wide range of relevant databases (Genome Catalogue) and allowing to automatically track down all related metadata for these published genomes. Their end goal is to make this physical mapping available in multiple formats (e.g. relational schema / spreadsheet / webservices) to facilitate the discovery of genomic information on the web, comparative genomic studies, and the population of databases with hyperlinks and metadata;
  • Habitat-Lite, which is a light-weight , easy-to-use set of terms that captures high-level information about habitat while preserving a mapping to existing Environment Ontology (EnvO). The main motivation is to meet the needs of the majority of users by generating enhanced list of terms based on already existing data submitted to INSDC. EnvO terms are used in MIxS specification. GSC also participates in many projects on the community level.

The Barcode of Life[33] (BOL) includes three major consortia, viz. the International Barcode of Life Project[34] (iBOL), the Consortium for the Barcode of Life[35] (CBOL) and the European Consortium for the Barcode of Life[36] (ECBOL). Their Database Working Group (DBWG) published BARCODE Data Standard “Data Standards for BARCODE Records in INSDC (BRIs)”[37]. BRIs set five major components which secure integration between DNA barcode sequences and other biodiversity information (data on specimens, taxonomy, biogeography, etc.).

The Global Genome Biodiversity Network[38] (GGBN) is a global network of well-managed collections of genomic tissue samples across the Tree of Life, which also develops standards for sharing DNA and tissue information. The DNA Bank Network[39], initiated by GBIF Germany in 2007 is one of the founding organisations of the GGBN. It maintains a central web portal, upon which the GGBN portal will be built, providing DNA samples of complementary collections and has developed and uses in its network, ABCDDNA[40], a DNA extension for the ABCD standard, and submitted it to TDWG for ratification. GGBN is also involved in creating and testing the DNA and tissue extension for DarwinCore Archive[41], and planning to use it in parallel with their ABCDDNA schema.

Genomic data is one type of many which are used to study taxa and their function in different environments. Other major data types include morphological/anatomical, physiological, chemical, environmental, etc. Exhaustive understanding of taxa, their function and distribution related to the environment and climate change is possible if all data types are stored and managed in conjunction. This is now a major driving force and most organisations developing biodiversity standards are trying to merge or link standards developed originally for a specific data type only.

Species Traits and Taxonomic Data

For species level data exchange there are several standards. Taxonomic Concept Schema (TCS) is a TDWG standard for exchanging information about biological taxa, such as their names, publications, authorship, synonyms, and concept definitions. TCS is not used widely, and most data exchange needs are much simpler and can be fulfilled with DwC.

The TDWG standard Structured Descriptive Data (SDD) is designed for the expression and transport of descriptive information about biological specimens, taxa, and similar entities such as diseases or ecosystems. However, SDD does not currently accommodate certain types of data and is thus not suitable for ecological measurements. For instance the following cannot be included:

  • Molecular sequence and other genetic data,
  • Occurrence and specimen data (e. g., distribution maps),
  • Complex ecological data such as models and ecological observations,
  • Organism interactions (host-parasite, plant-pollinator, predator-prey, etc.),
  • Nomenclatural and formal systematic (rank) information.

Character data can also be exchanged in the DELTA format, an older industry and TDWG standard.

Plinian Core[42] is a data exchange format to share species level information. Its hierarchical schema allows developing species data sheets that can be shown in websites. Plinian Core aims to be a standard for sharing information mainly at the species level. It was conceived as a way to publish species information and to make it interoperable. This refers to all kinds of properties and traits related to taxa (of any rank), including descriptions, nomenclature, conservation status, management, natural history, etc.

However, the use of these standards is not widespread. This is probably because there is no global species traits database or portal. Encyclopedia of Life, FishBase, and other similar species level portals fulfil partly that function, but do not share their traits data in standard formats. Most taxonomic data is being distributed in basic forms such as CSV from the main aggregators, which are listed below.

The Catalogue of Life[43] (CoL), a product developed through the partnership of Species 2000 and the Integrated Taxonomic Information System (ITIS), is the most comprehensive and authoritative global index of species currently available. It consists of a single integrated species checklist and taxonomic hierarchy. The catalogue holds essential information on the names, relationships and distributions of over 1.4 million species and continues to rise. The key features of the CoL are the species checklist, management classification, and integration of global species databases. It provides critical species information on synonymy, higher taxa and distribution. Two versions of the checklist are available. The Dynamic Checklist is always up to date, while the Annual Checklist is a snapshot of the entire catalogue. CoL provides a web service[44] for retrieving data from both versions of the checklist.

Species 2000[45] is a network of database organisations that engages with taxonomists around the world in order to develop a uniform and validated index of the world’s species (plants, animals, fungi and microbes) by integrating several global databases that deal with the major groups of organisms.

The Integrated Taxonomic Information System[46] (ITIS) provides authoritative taxonomic information on plants, animals, fungi, and microbes of North America and the world. ITIS is meant to serve as a standard to enable the comparison of biodiversity data sets, and therefore aims to incorporate classifications that have gained broad acceptance in the taxonomic literature and by professionals who work with the taxa concerned.. Data conform to the International Code of Botanical Nomenclature and the International Code of Zoological Nomenclature. Ranks in the animal kingdom below subspecies are not included as these ranks are not allowed in the zoological code. The botanical code allows the ranks variety, subvariety, forma, and subforma. ITIS adopted a five kingdom system – Monera, Protista, Plantae, Fungi, Animalia. ITIS makes practical decisions as to the placement of protists within the five kingdom framework. The ITIS SOAP web service[47] provides 51 functions to retrieve data. These include common functions like getting full record, accepted name, or hierarchy, as well as uncommon functions like getting credibility rating, taxon currency, or jurisdiction values. JSON[48] and JSON-P[49] based services are also available.

The Pan-European Species directories Infrastructure[50] (PESI) provides standardised and authoritative taxonomic information by integrating and securing Europe’s taxonomically authoritative species name registers and nomenclators (name databases) and associated expert networks that underpin the management of biodiversity in Europe. PESI integrates three European Focal Points Networks: Fauna Europaea, European Register of Marine Species (ERMS), and Euro+Med PlantBase and is now part of the broader initiative on taxonomic data standards known as EU-nomen. PESI offers web services[51] based on the platform-independent SOAP/WSDL standard. Every record retrieved from PESI has a Globally Unique Identifier (GUID). Many records in PESI have an LSID[52] (Life Sciences Identifier). However, it is now accepted that HTTP URIs can perform a similar naming task, are less technically complex to set up, and follow W3C architecture best practices.

The Global Names Architecture[53] (GNA) is a system of databases, programs, and web services – a cyberinfrastructure – that will be used to discover, index, organise and interconnect on-line information about organisms and their names. It is a communal open environment that manages names so that we can manage information about organisms and serve the needs of biologists. The main component of the GNA is the Global Names Index[54] (GNI) that provides a list of all names that have been used for organisms. Within this list lie all of the nomenclaturally correct names, all of the names that are accepted as tokens for taxa, and all of the taxonomic metadata for biodiversity informaticians.

GNA offers access to various services and tools either as web services or Ruby implementations. The Global Names Recognition and Discovery (GNRD) service accepts text documents, images, and other files, performs OCR and discovers names in these files. The Global Names Index, as a service, resolves names against known sources. It uses exact or fuzzy matching as required. Its second version is in development. The Biblio service is a parser for discovery of bibliographic citations. All these services provide their output in JSON or XML format.

Geospatial Data

Traditional geospatial data are overwhelmingly used by individuals as local vector or raster data sets, either in file systems or increasingly stored in spatially aware databases – which include open source (SpatialLite, variants of PostGres and MySQL) and licensed software (Oracle, DB2, MS SQL Server, and more).

Making these data sets accessible as standardised services in the web is increasingly simple, based either on open-source software such as GeoServer, MapServer, and GeoNode, or using server functions included in licensed software such as ArcGIS, InterGraph, and others.

The major standards for serving data, as managed by the OGC, are the following:

  • Web Map Services (WMS): These are used to serve vector or raster data sets over the web as images – corresponding to a request filtered primarily on the bounding box. These services are efficient and well established, and suitable for data visualisation and context.
  • Web Feature Services (WFS): These services offer vector data to the client in encodings of GML (Geographic Markup Language) – typically as XML. The geometry of features as well as attributes are transferred to the client. Data sets can be large, hence the services are efficient only if filters in respect of geographic area and other attributes are employed. For web-based processing and client-based querying, WFS is usually required. Additionally, WFS-T allows transactions, enabling updates from a client application.
  • Web Coverage Services (WCS): WCS are designed for multidimensional data sets in time and space, of which two-dimensional raster images are a special case. As such, WCS is best suited to typical data sets associated with continuous monitoring on both time and space. It duplicates the capabilities of NetCDF to some degree, and NetCDF is in process of being aligned and integrated into an extended WCS standard. WCS supports two fundamental filtering processes: slicing, which reduces the dimensionality of the data, and trimming, which retains dimensionality. WCS-T supports transactions and allows for updates.
  • WMTS (Web Map Tiling Service): WMTS supports tiling services over the web. These services are very well suited to fast, contextual applications such as backdrop maps. The speed is achieved by pre-processing tiles for all zoom levels for the entire coverage – which can then be served quickly on demand. It is usually unsuitable for dynamic data.

The traditional encoding and protocol for these services have been XML over HTTP, but new developments include GeoJSON[55] – which is less verbose than XML and better suited to client-side parsing. OGC standards are being extended to accommodate GeoJSON.

Finally, NetCDF developed independently from OGC as a community standard, and together with HDF-4 and HDF-5, are widely used standards for exchange in the climate, weather, and ocean-observing communities. While peripheral to the Biodiversity Observation, increased focus on composite indicators will require integration of biophysical data sources to an increasing degree.

Interoperability Protocols and Catalogue Services

Accessing the data can be achieved in several ways. The simplest solution is just to create an archive file in any of the above-mentioned data standards, and use HTTP to expose the data. Darwin Core Archive is a format that has been created for packaging the EML and DwC files for such distribution.

The biodiversity community has developed also three interoperability protocols for distributed queries (methods to encode SQL queries in XML). The first such protocol was DiGIR, which was modelled after the Z39.50 protocol, but was based on XML. DiGIR was pioneered by VertNet and MaNIS, and was deployed by OBIS and GBIF worldwide. At the same time in Europe, the BioCASe protocol was developed for exchanging ABCD data. The best features of each of these protocols were later included in the TAPIR protocol. All these protocols are still being used widely, although their role is diminishing with the wide adoption of Darwin Core Archive. It turned out that distributed queries were not used often and a simpler harvesting protocol will be sufficient for most uses.

Standards-based catalogue services such as CS/W and OAI-PMH have not yet found their way to biodiversity informatics domain. This may change in future as the community integrates more diverse data families (OGC WxS, NetCDF, SensorWeb) and share meta-data with global registries such as ICSU-WDS and GEOSS.

[1] “Biodiversity community” used in the sense of providers, users, and processors of biodiversity data.

[2] http://www.opengeospatial.org/standards/om

[3] http://portal.opengeospatial.org/files/?artifact_id=41579

[4] http://portal.opengeospatial.org/files/?artifact_id=41510

[5] https://teamwork.niwa.co.nz/display/NZEIIF/Biodiversity+Interoperability+through+Open+Geospatial+Standards

[6] http://rs.tdwg.org/dwc/

[7] http://dublincore.org/

[8] http://wiki.tdwg.org/ABCD/

[9] http://www.geocase.eu/efg

[10] http://knb.ecoinformatics.org/software/eml/

[11] http://rs.gbif.org/schema/eml-gbif-profile/

[12] http://data.lter-europe.net/deims/

[13] http://www.gbif.org/orc/?doc_id=5424

[14] http://environmentontology.org/

[15] http://www.insdc.org

[16] http://www.ncbi.nlm.nih.gov/

[17] http://www.ebi.ac.uk/ena/

[18] http://www.ddbj.nig.ac.jp

[19] http://www.insdc.org/documents/feature-table

[20] http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html

[21] http://gensc.org

[22] http://gensc.org/index.php?title=Biodiversity_Working_Group

[23] http://biocodecommons.org/workshops/sob.html

[24] http://biocodecommons.org/workshops/bioCollections/

[25] http://biscicol.org

[26] http://bioportal.bioontology.org/ontologies/BCO

[27] http://bioportal.bioontology.org/ontologies/PCO

[28] http://tools.gbif.org/dwca-validator/extension.do?id=http://gensc.org/ns/mixs/terms/Sample

[29] http://tools.gbif.org/dwca-validator/extension.do?id=http://rs.gbif.org/terms/1.0/TaxonAbundance

[30] http://qiime.org/

[31] http://www.standardsingenomics.org

[32] http://gensc.org/projects/gcdml/

[33] http://www.barcodeoflife.or

[34] http://ibol.org

[35] http://www.barcodeoflife.org/content/about/what-cbol

[36] http://www.ecbol.org

[37] http://www.barcodeoflife.org/sites/default/files/legacy/pdf/DWG_data_standards-Final.pdf

[38] http://www.ggbn.org/

[39] http://www.dnabank-network.org/

[40] http://wiki.bgbm.org/dnabankwiki/index.php/ABCDDNA

[41] http://rs.tdwg.org/dwc/terms/guides/text/

[42] http://code.google.com/p/pliniancore/

[43] http://www.catalogueoflife.org/

[44] http://www.catalogueoflife.org/content/web-services

[45] http://www.sp2000.org

[46] http://www.itis.gov/

[47] http://www.itis.gov/ws_develop.html

[48] http://json.org/

[49] http://json-p.org/

[50] http://www.eu-nomen.eu/

[51] http://www.eu-nomen.eu/portal/webservices.php

[52] http://www.ipni.org/lsids.html

[53] http://www.globalnames.org/

[54] http://gni.globalnames.org/

[55] GeoJSON: http://geojson.org/