Structuring Descriptive Data of Organisms — Requirement Analysis and Information Models
- Data that describe organisms in a structured form are indispensable not only for taxonomic and identification purposes, but also many phylogenetic, genetic, or ecological analyses. By analyzing existing information models and performing selected fundamental requirement analyses, the present work contributes to a broadening of the understanding of these forms of data. It falls into an interdisciplinary area between biology and information science. The term “descriptive data” is understood here in a broad sense: As descriptions of individuals, populations, or taxa, intended for various purposes (e. g., genetic, phylogenetic, diagnostic, taxonomic, or ecological), and covering a wide array of observation methods and data types (e. g., morphological, anatomical, genetic, physiological, molecular, or behavioral data). The position of descriptive data in the context of biodiversity framework concepts (covering, e. g., nomenclatural data, specimen collection data, or resource management) is discussed. A number of fundamental problems arise when modeling biological descriptive data. The ways in which existing data exchange formats, information models, and software applications address them are studied and future possible solutions are outlined. One such solution, the information model for the software “DiversityDescriptions (DeltaAccess)” is one of the results of this thesis and fully documented (Ch. 7). This entity relationship model fully supports the concepts of the traditional DELTA data exchange format (Description Language for Taxonomy; TDWG standard since 1986). If further improves on DELTA by introducing “modifiers” as a new terminology class, by introducing a more flexible system of handling statistical measures, by improving the handling of multilingual data sets, by supporting subset and filter features for concurrent collaborative editing (instead of supporting these for report-generation purposes alone), by supporting improved character attributes to create natural language descriptions from structured descriptions, and by adding metadata for a data set to improve the ability of data exchange without external documentation. In preparation of a future improved information model for descriptive data, the results of three requirement analyses are presented: a data-centric analysis of general concepts, a process-centric analysis of identification tools, and a high-level use case analysis. The first analysis (Ch. 4) is a structured inventory of fundamental approaches and problems involved in collecting and summarizing scientific descriptions of organisms. It is informed in part by current practices in information science, comparative data analysis, statistical, descriptive or phylogenetic software applications, and data exchange formats in biodiversity informatics. At the end three topics are discussed in particular detail (“Federation and modularization of terminology”, “Modifiers”, and “Secondary classification resulting in description scopes”). Except for phylogenetic analyses, identification is the most common usage of descriptive data. The second analysis (Ch. 5) therefore studies the processes, data structures, presentational and user interface requirements for printable and computer-aided identification tools (“keys”). Finally, a general use case analysis is performed with the goal of creating a framework of high-level use cases into which present as well as future requirements may be integrated (Ch. 6). All three requirement analyses are explorative and do not fulfill formal criteria of software engineering. They identify many requirements not addressed by the relational DiversityDescriptions model. Some of these could only be explored and await future solutions. For others solutions are proposed (some of which could already be incorporated into the design of SDD, an xml-based TDWG standard since 2005): The traditional data types are changed into an extensible character type model. The importance of data aggregation concepts was recognized to be fundamental. Complementary to data aggregation, the present and potentially future use of data inheritance along the lines of the taxonomic hierarchy is briefly studied. The concept of calculated characters could be addressed only insofar as the mapping between values can potentially be generalized. Character decomposition models are studied, but ultimately the traditional character concept, supplemented with a forest of ontologies for compositional and generalization concept hierarchies, is preferred as a more general concept. Both the traditional character subset and character applicability models can be integrated into concept hierarchies.