Identificador persistente para citar o vincular este elemento: http://hdl.handle.net/10553/134771
Título: A Bag-of-Words Approach for Information Extraction from Electricity Invoices
Autores/as: Sánchez, Javier 
Cuervo-Londoño, Giovanny A.
Clasificación UNESCO: Investigación
Palabras clave: electricity invoice
Machine learning
support vector machine
Fecha de publicación: 2024
Publicación seriada: AI
Resumen: In the context of digitization and automation, extracting relevant information from business documents remains a significant challenge. It is typical to rely on machine-learning techniques to automate the process, reduce manual labor, and minimize errors. This work introduces a new model for extracting key values from electricity invoices, including customer data, bill breakdown, electricity consumption, or marketer data. We evaluate several machine learning techniques, such as Naive Bayes, Logistic Regression, Random Forests, or Support Vector Machines. Our approach relies on a bag-of-words strategy and custom-designed features tailored for electricity data. We validate our method on the IDSEM dataset, which includes 75,000 electricity invoices with eighty-six fields. The model converts PDF invoices into text and processes each word separately using a context of eleven words. The results of our experiments indicate that Support Vector Machines and Random Forests perform exceptionally well in capturing numerous values with high precision. The study also explores the advantages of our custom features and evaluates the performance of unseen documents. The precision obtained with Support Vector Machines is 91.86% on average, peaking at 98.47% for one document template. These results demonstrate the effectiveness of our method in accurately extracting key values from invoices.
URI: http://hdl.handle.net/10553/134771
ISSN: 2673-2688
DOI: 10.3390/ai5040091
Fuente: AI 2024, 5(4), 1837-1857
Colección:Artículos
Adobe PDF (1,24 MB)
Vista completa

Google ScholarTM

Verifica

Altmetric


Comparte



Exporta metadatos



Los elementos en ULPGC accedaCRIS están protegidos por derechos de autor con todos los derechos reservados, a menos que se indique lo contrario.