Global Species Databases in the Catalogue of Life

Biodiversity knowledge network

Benefitting from the expertise of the whole global community

The GBIO provided the following summary:

Researchers in biodiversity have long had a culture of curating and annotating data — from identifying specimens to correcting and cleaning up entire downloaded datasets. These efforts are a key part of the data validation process: even with the best automated tools, identifying and correcting most errors still requires an expert, human eye. Yet these annotations are not always made available to the original data owners, and even when they are, there may be neither the resources nor the mechanisms in place to incorporate them. As a result, mistakes get replicated or have to be repeatedly corrected, duplicating effort, while there is little incentive for researchers to continue to correct and annotate records more widely.

Data aggregators generally encourage users to report mistakes; several GBIF national nodes have developed systems of data curation, including amateur networks to curate citizens’ observations while the EU OpenUp! project includes a data quality toolkit for GBIF data. Some projects are already using expert curation for aggregated data, for example the Encyclopedia of Life and the Fish Barcode of Life Initiative (FISH-BOL). However, too often these use ad hoc systems and require an extra effort on the part of the contributors, especially if they want to make corrections in many different sites, while data providers or publishers may not feel confident in trusting changes submitted over the Internet. The next step will be to agree with individual institutions and projects how data cleanup efforts can be recognized and valued, putting the incentives in place to ensure that annotations are made and fed back into the system. In combination with the fitness-for-use and annotation component – which considers the systems needed to enable annotations to be integrated into the data – this will be the first step towards making distributed data curation the norm.

In the short term, the priority should be developing a shared identity management system for contributors, whether professionals or citizen scientists, so that they can have a common identity and contribution history across platforms — particularly the key data networks and publishers. In the medium term, key data networks will be able to trace back any changes to the original contributor and over time it will be possible to use metrics to value contributions automatically, based on the contributor’s past history. In the long term, annotating data will become the norm and the curation of data will come to be considered a shared responsibility among the biodiversity community.

In recent years, social networking approaches have been adopted by many research networks and web platforms, including offerings from academic publishers and researcher tools such as ResearchGate, DataCite and ORCID. These have stimulated efforts to standardise researcher identifiers and to graph the relationships between researchers, institutions, publications, datasets, projects and funding bodies. Wikipedia and other products of the Wikimedia Foundation and OpenStreetMap also demonstrate the possibilities arising from coordinated international efforts.

More specifically, some major biodiversity informatics resources are the product of networks of experts. Catalogue of Life and the World Register of Marine Species (WoRMS) support a network Global Species Databases, each produced by an expert community and coming together to form globally important datasets. Citizen science networks and volunteer digitisation activities form other important networks, often with significant inputs from knowledgable experts. Many other expert networks operate to produce biodiversity knowledge products, including Red List groups and communities working on issues around invasive species and biosecurity, although there are often only weak linkages between such work and the products of biodiversity informatics.

There are massive remaining challenges that restrict the involvement of the global expert community as owners, curators and beneficiaries of work to improve digital access to biodiversity knowledge. A common pattern shown by many biodiversity informatics infrastructures is to aggregate data from a broad range of experts, organise these data through largely automated processes, and deliver products which may be of primary benefit to third parties rather than to the original contributors. Taxonomists and other expert groups have difficulty contributing to the overall quality of the resulting data products and may receive little direct benefit from spending time working on digital resources. Career incentives are also often biased against work which does not lead directly to published papers.

Our vision remains that digital biodiversity knowledge should be organised in ways that support all researchers and other stakeholders, and that long-term management and improvement of this knowledge should be possible online under the oversight of experts regardless of nationality or language. Progress has been slow in this area and there are clearly many factors which need to be addressed to achieve these goals.

GBIC2 will include a working session to explore the challenges which may impede progress in these areas. The goal will not be to develop a multi-year roadmap and set of fully-refined priorities for bridging between the Biodiversity Knowledge Network and biodiversity informatics infrastructures, although recommendations from the session will be taken forward for this purpose. The key goal in the context of GBIC2 is to understand the nature of the impediments which limit progress in this area towards full ownership of digital knowledge by the international research community. Thus will enable the GBIC2 workshop to consider the best approach to address these impediments within the governance and the planning processes of an international coordination mechanism.