Science-Ready Data for Discovery
Session: Transforming Earth and Planetary Science Through Data and Data Management: In Honor of MSA Distinguished Public Service Medal Awardee, Kerstin Lehnert
Presenting Author:
Kerstin LehnertAuthor:
Lehnert, Kerstin Annette1(1) Lamont-Doherty Earth Observatory, Columbia University, Palisades, NY, USA,
Abstract:
Data wrangling – the process of cleaning, harmonizing, and integrating data into a usable form – is still needed for much of our shared research data. It represents a major bottleneck to the application of AI (artificial intelligence), ML (machine learning), and statistical methods to extracting new knowledge from geoscience data. Persistent problems are the fragmentation of data, even data of similar type, across different data systems, and the lack of community-governed data standards that make data machine-readable and interoperable. While the FAIR Principles (Wilkinsen et al. 2016) have helped to improve data management practices, they are not data standards as needed to make data analysis- or science-ready. The term ‘science-ready’ has been applied to satellite data describing "data that have been processed to a minimum set of requirements and organized into a form that allows immediate analysis with a minimum of additional user effort" [1]. Large synthesis databases such as PetDB, the Astromat Synthesis, GEOROC, and the LEPR/traceDs (Library of Experimental Phase Relations) are delivering science-ready geochemical and petrological data to the community, but currently perform data wrangling tasks in a manual and time-consuming way, which is not sustainable in the long term. Several different approaches are pursued by EarthChem and the Astromaterials Data System (Astromat) to overcome barriers to science-ready data: (1) Both EarthChem and Astromat helped found and participate in the global OneGeochemistry initiative that works to develop and promote data standards for laboratory analytical data with the goal to harmonize data structures, metadata, and vocabularies to allow data integration across a global network of databases. (2) EarthChem and Astromat have developed the concept for a new tool called EDIT (Enhanced Data Ingestion Tool) that will allow researchers to compile and store their lab analytical data and the metadata about analytical procedures and samples, and export them into consistently structured files (JSON schemas) that can be archived, but also machine-read into the synthesis databases. EDIT will build on a data entry tool that is currently used by curators to enter data into PetDB and the Astromat Synthesis. (3) EarthChem and Astromat will explore use of ML for automating data validation and data harmonization to ensure data quality and minimize manual effort of data curators.
[1] https://www.spectralreflectance.space/p/the-challenge-of-analysis-ready-data-in-earth-observation-d978ea1df97
Science-Ready Data for Discovery
Category
Topical Sessions
Description
Preferred Presentation Format: Oral
Categories: Geoinformatics and Data Science; Geochemistry; Petrology, Igneous
Back to Session