SFERA Archivio dei prodotti della Ricerca dell'Università di Ferrara

Imbalanced datasets can impair the learning performance of many Machine Learning techniques. Nevertheless, many real-world datasets, especially in the healthcare field, are inherently imbalanced. For instance, in the medical domain, the classes representing a specific disease are typically the minority of the total cases. This challenge justifies the substantial research effort spent in the past decades to tackle data imbalance at the data and algorithm levels. In this paper, we describe the strategies we used to deal with an imbalanced classification task on data extracted from a database generated from the Electronic Health Records of the Mental Health Service of the Ferrara Province, Italy. In particular, we applied balancing techniques to the original data, such as random undersampling and oversampling, and Synthetic Minority Oversampling Technique for Nominal and Continuous (SMOTE-NC). In order to assess the effectiveness of the balancing techniques on the classification task at hand, we applied different Machine Learning algorithms. We employed cost-sensitive learning as well and compared its results with those of the balancing methods. Furthermore, a feature selection analysis was conducted to investigate the relevance of each feature. Results show that balancing can help find the best setting to accomplish classification tasks. Since real-world imbalanced datasets are increasingly becoming the core of scientific research, further studies are needed to improve already existing technique

Machine learning from real data: A mental health registry case study

Gentili, Elisabetta^Primo;Franchini, Giorgia^Secondo;Zese, Riccardo;Alberti, Marco;Ferrara, Maria^{Funding Acquisition};Domenicano, Ilaria;Grassi, Luigi^Ultimo

2024

Abstract

Imbalanced datasets can impair the learning performance of many Machine Learning techniques. Nevertheless, many real-world datasets, especially in the healthcare field, are inherently imbalanced. For instance, in the medical domain, the classes representing a specific disease are typically the minority of the total cases. This challenge justifies the substantial research effort spent in the past decades to tackle data imbalance at the data and algorithm levels. In this paper, we describe the strategies we used to deal with an imbalanced classification task on data extracted from a database generated from the Electronic Health Records of the Mental Health Service of the Ferrara Province, Italy. In particular, we applied balancing techniques to the original data, such as random undersampling and oversampling, and Synthetic Minority Oversampling Technique for Nominal and Continuous (SMOTE-NC). In order to assess the effectiveness of the balancing techniques on the classification task at hand, we applied different Machine Learning algorithms. We employed cost-sensitive learning as well and compared its results with those of the balancing methods. Furthermore, a feature selection analysis was conducted to investigate the relevance of each feature. Results show that balancing can help find the best setting to accomplish classification tasks. Since real-world imbalanced datasets are increasingly becoming the core of scientific research, further studies are needed to improve already existing technique

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione
	
				2024
			
	DOI
	
				https://dx.doi.org/10.1016/j.cmpbup.2023.100132
			
	Titolo della Rivista
	
				COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE UPDATE
			
	Tutti gli autori
	
						Gentili, Elisabetta; Franchini, Giorgia; Zese, Riccardo; Alberti, Marco; Ferrara, Maria; Domenicano, Ilaria; Grassi, Luigi
					
	Appare nelle tipologie:
	
				03.1 Articolo su rivista

File in questo prodotto:

File	Dimensione	Formato
gentili ML real data 2024.pdf accesso aperto Descrizione: versione editoriale Tipologia: Full text (versione editoriale) Licenza: Creative commons Dimensione 711.24 kB Formato Adobe PDF Visualizza/Apri	711.24 kB	Adobe PDF	Visualizza/Apri

I documenti in SFERA sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11392/2533631

Citazioni

ND

14

ND

social impact