Please use this identifier to cite or link to this item: https://accedacris.ulpgc.es/jspui/handle/10553/165190
Title: Information Extraction from Electricity Invoices with General-Purpose Large Language Models
Authors: Javier Gómez
Sánchez, Javier 
UNESCO Clasification: 1203 Ciencia de los ordenadores
Issue Date: 2026
Journal: ArXiv.org 
Abstract: Information extraction from semi-structured business documents remains a critical challenge for enterprise management. This study evaluates the capability of general-purpose Large Language Models to extract structured information from Spanish electricity invoices without task-specific fine-tuning. Using a subset of the IDSEM dataset, we benchmark two architecturally distinct models, Gemini 1.5 Pro and Mistral-small, across 19 parameter configurations and 6 prompting strategies. Our experimental framework treats prompt engineering as the primary experimental variable, comparing zero-shot baselines against increasingly sophisticated few-shot approaches and iterative extraction strategies. Results demonstrate that prompt quality dominates over hyperparameter tuning: the F1-score variation across all parameter configurations is marginal, while the gap between zero-shot and the best few-shot strategy exceeds 19 percentage points. The best configuration (few-shot with cross-validation) achieves an F1-score of 97.61% for Gemini and 96.11% for Mistral-small, with document template structure emerging as the primary determinant of extraction difficulty. These findings establish that prompt design is the critical lever for maximizing extraction fidelity in LLM-based document processing, thereby providing an empirical framework for integrating general-purpose LLMs into business document automation.
URI: https://accedacris.ulpgc.es/jspui/handle/10553/165190
DOI: 10.48550/arXiv.2604.25927
Appears in Collections:Artículo preliminar
Adobe PDF (629,92 kB)
Show full item record

Google ScholarTM

Check

Altmetric


Share



Export metadata



Items in accedaCRIS are protected by copyright, with all rights reserved, unless otherwise indicated.