Privacy-Aware Synthetic Tabular Data Generation for Healthcare: Application to Sepsis Detection

Macias-Fassio, Eric; Morales Moreno, Aythami; Pruenza, Cristina; Fierrez, Julian; Espósito, Carlos

Título:	Privacy-Aware Synthetic Tabular Data Generation for Healthcare: Application to Sepsis Detection
Autores/as:	Macias-Fassio, Eric Morales Moreno, Aythami Pruenza, Cristina Fierrez, Julian Espósito, Carlos
Clasificación UNESCO:	33 Ciencias tecnológicas
Palabras clave:	Machine Learning Sepsis Detection Synthetic Data
Fecha de publicación:	2026
Publicación seriada:	Bioengineering
Resumen:	Background: Machine learning-based Artificial Intelligence (AI) models have shown significant potential in the biomedical field, offering promising advances in diagnostics, personalized medicine, and patient care. However, to build these models, we have to deal with important challenges, including (1) the scarcity and low quality of available datasets in many important applications and (2) privacy concerns associated with sensitive patient data. Synthetic data (SD) generation has emerged as a promising strategy to address these challenges, yet many existing approaches struggle to simultaneously preserve privacy and accurately model tabular data, the predominant format in healthcare. Methods: We propose Kernel Density Estimation–K-Nearest Neighbors (KDE-KNN), a privacy-aware tabular data generation method, and evaluate its performance against state-of-the-art techniques. Using sepsis detection as a real-world case study, we assess both data utility and privacy protection. Results: Models trained on KDE-KNN-generated SD outperformed those trained on real data across both internal testing and external validation. In particular, a support vector machine achieved superior performance when trained on SD relative to real data. This gain is likely driven by the balanced class distribution of the synthetic dataset, underscoring KDE-KNN’s utility as an effective data balancing strategy. Consistent performance in external validation further supports the robustness and generalizability of the proposed approach. Privacy evaluation indicated a lower re-identification risk, with a mean distance to closest record of 4.971 between synthetic and real samples, compared with 2.715 among real samples. Conclusions: KDE-KNN effectively captures underlying population distributions while generating high-quality SD that preserve statistical fidelity and protect sensitive information. By balancing the trade-off between utility and privacy, the method produces representative datasets without exposing individual records. These findings position KDE-KNN as a valuable tool for data-scarce and privacy-sensitive applications, with broad potential across healthcare and other data-driven domains.
URI:	https://accedacris.ulpgc.es/jspui/handle/10553/168365
DOI:	10.3390/bioengineering13050511
Fuente:	Bioengineering[EISSN 2306-5354],v. 13 (5), (Mayo 2026)
Colección:	Artículos

Adobe PDF (1,01 MB)

Vista completa

Adobe PDF (1,01 MB)

Google Scholar^TM

Altmetric

Comparte

Exporta metadatos

Dirección

Contacto

Legal

De interés

Adobe PDF (1,01 MB)

Google ScholarTM

Altmetric

Comparte

Exporta metadatos

Dirección

Google Scholar^TM