Optical character recognition, Transactions of the Entomological Society of London (1910), via Biodiversity Heritage Library

Published materials

Using data mining and semantic tools to turn unstructured and inaccessible data into information

The GBIO provided the following summary:

Published materials – primarily printed literature, but also images, videos and other multimedia forms such as sound recordings – have long served as the primary means for disseminating biodiversity knowledge. Along with collections and specimens, they also form the primary source of species-level trait and descriptive data, vital for identifications and taxonomic research. Much progress has been made by research institutions and by the Biodiversity Heritage Library (BHL), scanning historical materials into digital formats, while new materials are almost exclusively developed as digital objects. Nevertheless the information in these resources remains largely inaccessible to automated processing due to a lack of internal structure and mark-up and, for older literature, errors introduced during the scanning process. Multimedia objects need consistent indexing to make them properly discoverable. Consistent standards, including text recognition standards, keywording and indexing, data mining techniques and crowd sourcing will enable the community to step up the rate at which such data are made fully accessible. The resulting information, and the tools to generate it, also need to be made freely available to all. Despite progress in recent years, the scale of the task and its importance to the framework means this component requires continued and long-term investment and it will be urgent to find ways to accelerate and streamline the process.

There are already a number of initiatives working on the relevant tools and techniques with research projects investigating automated image extraction and crowd-sourced tagging, data mining, semi-automatic and automatic markup of taxonomic descriptions, and handling multiple languages as well as using structured data in taxonomic publications. The next steps will be to catalogue and define the types of unstructured biodiversity data available and understand their particular challenges, and to agree standards for future publication in a form that makes the data immediately available not just to experts but to searches and automated processing. In the short term, the priority will be to build on some of the existing pilots and implement some full-scale crowd sourcing and automated data mining projects. In the medium term, the software behind such projects should be made available as open source tools. Countries will start to establish their own bibliographies of national biodiversity. New publications will increasingly come in an enhanced, semantically structured form. In the long term, such enhanced publication will be the norm and automated and semi-automated data mining tools will be freely available for unstructured biodiversity data. As a result, complete bodies of thematic or geographical information will be progressively made available as linked datasets.

In recent years, a trend towards open science and open data has stimulated publishers, societies and authors to increase open access both for historical and new publications, and many journals are promoting the importance of sharing underlying data either as supplementary materials or through data papers, with new journals and repositories being developed for this purpose. A range of publishers and academic infrastructures are building knowledge graphs (of variable openness) linking authors, publications, datasets, funding bodies, etc.

More specifically, Biodiversity Heritage Library continues to promote and support the digitization and free and open sharing of historical literature. Pensoft is a leading example of an academic publisher which is not only supporting open access but also working with biodiversity informatics platforms and researchers to facilitate direct access to structured data underlying taxonomic and ecological papers. Plazi has developed a range of tools and services to parse and expose structured data from a broader range of taxonomic literature.

Many challenges remain. Access to recent literature is still often limited and supplementary materials are often semi-structured or non-standard formats. Available optical character recognition tools have systemic problems handling key features of taxonomic literature, especially older published materials (italicized text, male/female symbols, fractions, ligatures, etc.). Efforts to extract, structure and normalize data from published materials are fragmentary and do not yet combine to produce an integrated knowledge repository spanning both historical and recent publications. There are serious shortcomings to existing vocabularies, directories, taxonomies and gazetteers that could support more rigorous normalization of extracted data. In the opposite direction, existing biodiversity infrastructures do not supply aggregated data in forms that fully support reference and reuse and allow new research to build seamlessly on a solid foundation of linked open data.

GBIC2 will include a working session to explore the challenges which may impede progress in these areas. The goal will not be to develop a multi-year roadmap and set of fully-refined priorities for Published Materials as a component of the biodiversity informatics landscape, although recommendations from the session will be taken forward for this purpose. The key goal in the context of GBIC2 is to understand the nature of the impediments which limit progress in this area towards seamless linkages between historical and new literature and well-organised digital knowledge of the world’s biodiversity. Thus will enable the GBIC2 workshop to consider the best approach to address these impediments within the governance and the planning processes of an international coordination mechanism.