Published: 2 June 2026
At EarthScience Information Systems (EScIS), as a software company, we, like many other groups, use AI extensively in our internal workflows. A key finding from our work is that when using AI for data manipulation, the instructions sent to the AI engine must be highly specific to achieve reliable, repeatable results. Validation of the output by a supervising person is a key part of the process; validation of AI output can sometimes be time-consuming and complex.
Packaging AI processes into a product that already contains the specific instructions to complete the task and receive the desired results, and also guides users through the validation process in a methodical and efficient manner, is, we feel, essential to the broad and reliable application of the technology. In this article, we will elaborate on how we are implementing the above philosophy to resolve a specific need raised by a section of our ESdat users.
A recurring challenge we hear from organizations in our sector is that they have received data for a site of interest, but it is only available as a PDF report and not in a format intended for data exchange.
Extracting data from PDFs has traditionally been labor-intensive, and traditional coded software isn't great at automating much of it. This is an area in which AI can excel; however, given that it can also introduce errors, it needs to be implemented within a quality control process that requires attention to detail at each step by knowledgeable experts and a consistent, methodical approach.
For optimal success, a hybrid system integrates AI with coded software. Coded software excels by providing consistent data manipulation and processing on the AI-extracted data to help people validate the results and import them into the destination database.
EScIS is developing tools embedded into ESdat to guide users through that process, to make it streamlined and repeatable, to integrate the key strengths of AI, coded software, and human oversight; and to automate and track all data manipulation tasks so people supervising it can focus their attention where human input is critical - on validation.
The process to support extracting Analytical Results from a PDF Laboratory Certificate of Analysis is summarized in Figure 1. Validation steps are in blue, while the AI step is in orange, and coded automation steps are in black.
Figure 1: Key steps in the use of AI to extract and validate data from PDF reports.
The key principles of this workflow are:
- The process should be guided by the software as a series of logical, repeatable steps.
- The process should be able to be performed by anyone with the attention to detail and knowledge of the workflow to perform the validation steps.
- AI plays a key role, but the surrounding software-enabled workflow makes it truly usable.
- All data extraction and manipulation steps should be automated, traceable, and reviewed.
The above reflects the approach to the usage of AI we have adopted at EarthScience Information Systems; that AI can be incorporated to provide software solutions that were not feasible previously, and with this hybrid process, traditional coded software can provide the surrounding workflow to make it as intuitive and easy as possible for people supervising the process to validate the data, perform quality control, and track data lineage.
EarthScience Information Systems anticipates this will be available for extracting data from Certificates of Analysis issued by some key laboratories by late 2026, with other laboratories and other data tables (such as those included in the body of a report) to follow.
Categories
Data Management
Keywords
Environmental Data Management, Environmental Data