Identificador persistente para citar o vincular este elemento:
https://accedacris.ulpgc.es/jspui/handle/10553/168365
| Título: | Privacy-Aware Synthetic Tabular Data Generation for Healthcare: Application to Sepsis Detection | Autores/as: | Macias-Fassio, Eric Morales Moreno, Aythami Pruenza, Cristina Fierrez, Julian Espósito, Carlos |
Clasificación UNESCO: | 33 Ciencias tecnológicas | Palabras clave: | Machine Learning Sepsis Detection Synthetic Data |
Fecha de publicación: | 2026 | Publicación seriada: | Bioengineering | Resumen: | Background: Machine learning-based Artificial Intelligence (AI) models have shown significant potential in the biomedical field, offering promising advances in diagnostics, personalized medicine, and patient care. However, to build these models, we have to deal with important challenges, including (1) the scarcity and low quality of available datasets in many important applications and (2) privacy concerns associated with sensitive patient data. Synthetic data (SD) generation has emerged as a promising strategy to address these challenges, yet many existing approaches struggle to simultaneously preserve privacy and accurately model tabular data, the predominant format in healthcare. Methods: We propose Kernel Density Estimation–K-Nearest Neighbors (KDE-KNN), a privacy-aware tabular data generation method, and evaluate its performance against state-of-the-art techniques. Using sepsis detection as a real-world case study, we assess both data utility and privacy protection. Results: Models trained on KDE-KNN-generated SD outperformed those trained on real data across both internal testing and external validation. In particular, a support vector machine achieved superior performance when trained on SD relative to real data. This gain is likely driven by the balanced class distribution of the synthetic dataset, underscoring KDE-KNN’s utility as an effective data balancing strategy. Consistent performance in external validation further supports the robustness and generalizability of the proposed approach. Privacy evaluation indicated a lower re-identification risk, with a mean distance to closest record of 4.971 between synthetic and real samples, compared with 2.715 among real samples. Conclusions: KDE-KNN effectively captures underlying population distributions while generating high-quality SD that preserve statistical fidelity and protect sensitive information. By balancing the trade-off between utility and privacy, the method produces representative datasets without exposing individual records. These findings position KDE-KNN as a valuable tool for data-scarce and privacy-sensitive applications, with broad potential across healthcare and other data-driven domains. | URI: | https://accedacris.ulpgc.es/jspui/handle/10553/168365 | DOI: | 10.3390/bioengineering13050511 | Fuente: | Bioengineering[EISSN 2306-5354],v. 13 (5), (Mayo 2026) |
| Colección: | Artículos |
Los elementos en ULPGC accedaCRIS están protegidos por derechos de autor con todos los derechos reservados, a menos que se indique lo contrario.