209-4 Conversion of the Treatise on Invertebrate Paleontology volumes into a FAIR database
Session: Deep-Time Earth and the AI Revolution
Presenting Author:
Bruce LiebermanAuthors:
Lieberman, Bruce S.1, López Carranza, Natalia2, Ogg, James3, Sivathanu, Aditya4, Chang, Kevin5, Ye, Jieping6, Xiang, Zhongyuan7, Wei, Juye8, Du, Wen9(1) Treatise on Invertebrate Paleontology, University of Kansas, Lawrence, KS, USA, (2) Biodiversity Institute, University of Kansas, Lawrence, KS, USA, (3) Key Lab of Deep-time Geography and Environment Reconstruction, Chengdu University of Technology, Chengdu, Sichuan, China; Deep-Time Digital Earth, Purdue University, West Lafayette, Indiana, USA, (4) School of Electrical and Computer Engineering, Purdue University, West Lafayette, Indiana, USA, (5) School of Electrical and Computer Engineering, Purdue University, West Lafayette, Indiana, USA, (6) Zhejiang Lab, Hangzhou, China, China, (7) GeoGPT, Zhejiang Lab, Hangzhou, China, (8) GeoGPT, Zhejiang Lab, Hangzhou, China, (9) Geologic TimeScale Foundation, Purdue University, West Lafayette, Indiana, USA,
Abstract:
A fruitful collaboration is ongoing between the Treatise on Invertebrate Paleontology (TIP), the Deep-Time Digital Earth project (DDE), and the GeoGPT-AI team. The TIP is an encyclopedic resource providing expert-vetted data on taxonomy, stratigraphy, biogeography, morphology, ecology and evolution of all major fossil invertebrate and microfossil phyla. More than 50 volumes containing > 30,000 pages and 12,000 figures produced by hundreds of paleontologists have been published since 1953. These are now available as open access PDFs and house a tremendous wealth of data relevant for scientific studies. However, data in PDFs are hard to extract, so the TIP and GeoGPT/DDE are working to make them more FAIR – Findable, Accessible, Interoperable and Reusable. Computational approaches are needed.
To accomplish this goal, two approaches have been applied. In one, using Python scripts, TIP PDFs are converted into text files and subsequently curated and standardized manually to create consistent text blocs for each taxon. Using regular expressions, paleobiologically relevant terms and data are extracted and compiled into a spreadsheet. Each row represents a taxon and contains information such as taxon author, type specimen, description, age span, geographic distribution, etc. In the second, a deep learning (GeoGPT) workflow is used. The Python-based approach is slower and more labor intensive, but data are accurate. The GeoGPT AI approach is faster, but idiosyncratic formatting of PDFs, such as when text is interrupted by figures, creates difficulties and errors with extraction. However, improvements in AI will likely mean this faster approach will yield more successful results in the near future.
There are challenges, including non-standardized geologic age assignments in the TIP. To address this, lookup tables automate assigning the corresponding international chronostratigraphic age and associated Myr. This enables diversity curves for thousands of genera in clades, such as brachiopods, graptolites and trilobites (and smaller subclades within each), to be accessed by anyone. The DDE has created open-access relational databases on global geologic formations and on chronostratigraphy that automatically link their fossil fields with the detailed information and imagery on genera in the TIP.
Once applied to all TIP volumes, the entire community will benefit from the open digital data.
Geological Society of America Abstracts with Program. Vol. 57, No. 6, 2025
doi: 10.1130/abs/2025AM-8306
© Copyright 2025 The Geological Society of America (GSA), all rights reserved.
Conversion of the Treatise on Invertebrate Paleontology volumes into a FAIR database
Category
Topical Sessions
Description
Session Format: Oral
Presentation Date: 10/21/2025
Presentation Start Time: 02:25 PM
Presentation Room: HBGCC, 301C
Back to Session