E-mail spam filter based on unsupervised neural architectures and thematic categories: design and analysis

Cabrera-León, Ylermi; García Báez, Patricio; Suárez-Araujo, Carmen Paz

Título:	E-mail spam filter based on unsupervised neural architectures and thematic categories: design and analysis
Autores/as:	Cabrera-León, Ylermi García Báez, Patricio Suárez-Araujo, Carmen Paz
Clasificación UNESCO:	3325 Tecnología de las telecomunicaciones 120304 Inteligencia artificial
Palabras clave:	Spam filtering Artificial neural networks Self-organizing maps Thematic category Term frequency, et al.
Fecha de publicación:	2019
Editor/a:	1860-949X
Publicación seriada:	Studies in Computational Intelligence
Resumen:	Spam, or unsolicited messages sent massively, is one of the threats that affects email and other media. Its huge quantity generates considerable economic and time losses. A solution to this issue is presented: a hybrid anti-spam filter based on unsupervised Artificial Neural Networks (ANNs). It consists of two steps, preprocessing and processing, both based on different computation models: programmed and neural (using Kohonen SOM). This system has been optimized by utilizing a dataset built with ham from “Enron Email” and spam from two different sources: traditional (user’s inbox) and spamtrap-honeypot. The preprocessing was based on 13 thematic categories found in spams and hams, Term Frequency (TF) and three versions of Inverse Category Frequency (ICF). 1260 system configurations were analyzed with the most used performance measures, achieving AUC > 0.95 the optimal ones. Results were similar to other researchers’ over the same corpus, although they utilize different Machine Learning (ML) methods and a number of attributes several orders of magnitude greater. The system was further tested with different datasets, characterized by heterogeneous origins, dates, users and types, including samples of image spam. In these new tests the filter obtained 0.75 < AUC < 0.96. Degradation of the system performance can be explained by the differences in the characteristics of the datasets, particularly dates. This phenomenon is called “topic drift” and it commonly affects all classifiers and, to a larger extent, those that use offline learning, as is the case, especially in adversarial ML problems such as spam filtering.
URI:	https://accedacris.ulpgc.es/handle/10553/42220
ISBN:	978-3-319-99282-2
ISSN:	1860-949X
DOI:	10.1007/978-3-319-99283-9_12
Fuente:	Studies in Computational Intelligence [ISSN 1860-949X], v. 792, p. 239-262
Colección:	Capítulo de libro

Vista completa

Visitas

Google Scholar^TM

Altmetric

Comparte

Exporta metadatos

Dirección

Contacto

Legal

De interés

Visitas

Google ScholarTM

Altmetric

Comparte

Exporta metadatos

Dirección

Google Scholar^TM