Times are displayed in (UTC-05:00) Central Time (US & Canada)Change
An LLM-based Workflow for Extracting Data and Knowledge from Geoscience Literature
Session: Transforming Earth and Planetary Science Through Data and Data Management: In Honor of MSA Distinguished Public Service Medal Awardee, Kerstin Lehnert
Presenting Author:
Prof. Xiaogang Ma
Authors:
Ma, Xiaogang1, Zhang, Jiyin2, Chen, Weilin3
(1) University of Idaho, Moscow, Idaho, USA, (2) University of Idaho, Moscow, Idaho, USA, (3) University of Idaho, Moscow, Idaho, USA,
Abstract:
In our recent work with the Mindat open data service, we have developed a modular, multi-agent system to automate the extraction of structured data and knowledge from Earth science documents. Traditionally reliant on expert curation, this process is now streamlined using Large Language Models (LLMs) within a Model Context Protocol framework. The system consists of four agent teams to convert raw PDFs into structured outputs. The key steps are designed to be vocabulary agnostic, allowing easy adaptation to different controlled vocabularies with minimal human effort. To ensure quality and reliability, the system incorporates mechanisms such as multi-version review, error reflection, and term matching validation. A case study using the Mineral Deposit Models demonstrates the system’s effectiveness in transforming legacy documents into machine-readable formats. By combining automated workflows, flexible vocabulary alignment, and robust quality checks, our work offers a scalable solution for data and knowledge extraction across Earth science domains, and we welcome reuse from the community.
An LLM-based Workflow for Extracting Data and Knowledge from Geoscience Literature
Category
Topical Sessions
Description
Preferred Presentation Format: Oral
Categories: Geoinformatics and Data Science; Mineralogy/Crystallography; Geochemistry