Host genetics and clinical settings in COVID-19: predictive models by statistical tools and machine learning algorithms (The FeMiNa study)

Antonica, Bianca

COVID-19 has had a tremendous impact worldwide, stressing the healthcare system. Frontline workers had to manage in a short time many patients with no tools to prioritise those at high risk. Age, male sex, and defined comorbidities appear to increase the severity and mortality of COVID-19. Nevertheless, low-risk individuals may evolve with severe disease, needing hospitalisation. For these reason, considerable efforts have been made since the beginning to identify informative indicators and risk factors influencing SARS-CoV-2 susceptibility and COVID-19 prognosis. Moreover, the unpredictable course of the disease has shown the importance of individual genetics in determining clinical phenotypes. In this context, analytical models such as statistical tools and Artificial Intelligence (AI) approaches have been used to optimize patient management. By a retrospective multicentric (Ferrara-Milano-Napoli) study, we refined the best predictive model for COVID-19 mortality and hospital stay, integrating 20 genetic and 12 clinical features. We performed PCA and binary logistic regression, and we trained three machine learning (ML) models (GBM, XGB, RF) on a dataset of 532 COVID-19 hospitalised Italian patients, recruited before vaccine availability. From the comparison of the two analytical approaches, we can see that in the ML approach the genetic data emerge more clearly. The latter allows a better interpretation of the obtained results and, given its more complex nature, highlights hidden relationships among the variables. For the mortality, all the models reached great values for accuracy, AUROC, f1, f2 and PR-AUC metrics. Although GBM’s PR-AUC optimization resulted in a better performance (0.473±0.21), we chose to deeply analyse GBM’s f1 optimization that provides fewer false negatives (Nf1=27 versus NPR-AUC=46), being our main goal to answer. We delved into the feature importance to understand which features contribute to the model’s decision: concerning the top 10, more than 50% was attributable to age, and the remainder was almost equally divided between genetic and clinical features. HLA-DRArs3135363, IL6rs1800795, ACE2rs2285666, CRPrs2808635 globally accounted for 20%, demonstrating that genetic data didn’t confound, but implemented the model. In particular, the HLA-DRArs3135363 GG homozygous genotype reaches significant values both in the whole population (OR=2.03, CI 95% 1.13-3.64; P-value=0.01) and in the female subgroup (♀OR=2.66, CI 95% 1.16-6.10; P-value=0.02), conferring a double risk of death. We also applied the same workflow to predict the duration of hospitalisation in our dataset among survivors. Using GBM and RF algorithms, we did not find any significant relationships between the dependent and independent variables (R² = -0.736 and -0.039, respectively), with mean predictions by chance. ML algorithms have created remarkable opportunities to develop the best models for risk analyses, reducing uncertainty and ambiguity. In particular, integrating genetics in intelligent systems is crucial for identifying high-risk cases, enhancing the management of hospitalised patients, and preventing severe progression, increasingly enabling P4-medicine. This field of study is essential for effectively applying ML algorithms in new clinical practices also to counteract future outbreaks.

Il COVID-19 ha avuto un impatto devastante a livello mondiale, stressando i sistemi sanitari. I lavoratori in prima linea hanno dovuto gestire in poco tempo numerosi pazienti senza strumenti per priorizzare quelli ad alto rischio. Età, sesso maschile e alcune specifiche comorbilità sembrano aumentare la gravità e la mortalità del COVID-19. Tuttavia, individui a basso rischio possono comunque evolvere verso una decorso grave, necessitando l’ospedalizzazione. Per questi motivi, fin dall'inizio sono stati compiuti notevoli sforzi per identificare indicatori informativi e fattori di rischio che influenzano la suscettibilità a SARS-CoV-2 e la prognosi del COVID-19. Inoltre, l'andamento imprevedibile della malattia ha evidenziato l'importanza della genetica individuale nel determinare i fenotipi clinici. In questo contesto, modelli analitici come strumenti statistici e approcci di Intelligenza Artificiale (IA) sono stati utilizzati per ottimizzare la gestione dei pazienti. Tramite uno studio retrospettivo multicentrico (Ferrara-Milano-Napoli), abbiamo affinato il miglior modello predittivo per la mortalità da COVID-19 e la durata del ricovero, integrando 20 varianti genetiche e 12 cliniche. Abbiamo eseguito PCA e regressione logistica binaria, e abbiamo addestrato tre modelli di machine learning (ML) (GBM, XGB, RF) su un dataset di 532 pazienti italiani ospedalizzati per COVID-19, reclutati prima della disponibilità dei vaccini. Dal confronto tra i due approcci analitici, emerge che nell'approccio ML i dati genetici risaltano più chiaramente. Quest'ultimo permette una migliore interpretazione dei risultati ottenuti e, data la sua natura più complessa, evidenzia relazioni nascoste tra le variabili. Per la mortalità, tutti i modelli hanno raggiunto valori eccellenti per le metriche di accuratezza, AUROC, f1, f2 e PR-AUC. Sebbene l'ottimizzazione della PR-AUC di GBM abbia prodotto prestazioni superiori (0,473±0,21), abbiamo scelto di analizzare in profondità l'ottimizzazione f1 di GBM, che fornisce meno falsi negativi (Nf1=27 versus NPR-AUC=46), in quanto il nostro obiettivo principale era rispondere a ciò. Abbiamo approfondito l’importanza delle variabili per comprendere quali contribuiscano alle decisioni del modello: tra le top 10, più del 50% era attribuibile all'età, e il resto era quasi equamente diviso tra feature genetiche e cliniche. HLA-DRArs3135363, IL6rs1800795, ACE2rs2285666, CRPrs2808635 hanno complessivamente contato per il 20%, dimostrando che i dati genetici non hanno confuso, ma hanno implementato il modello. In particolare, il genotipo omozigote GG di HLA-DRArs3135363 raggiunge valori significativi sia nella popolazione totale (OR=2,03, CI 95% 1,13-3,64; P-value=0,01) sia nel sottogruppo femminile (♀OR=2,66, CI 95% 1,16-6,10; P-value=0,02), conferendo un rischio doppio di morte. Abbiamo applicato lo stesso workflow per prevedere la durata del ricovero nel nostro dataset tra i sopravvissuti. Utilizzando gli algoritmi GBM e RF, non abbiamo trovato relazioni significative tra variabili dipendenti e indipendenti (R² = -0,736 e -0,039, rispettivamente), che significa predizione casuale. Gli algoritmi ML hanno creato opportunità straordinarie per sviluppare i migliori modelli per analisi di rischio, riducendo incertezza e ambiguità. In particolare, integrare la genetica in sistemi intelligenti è cruciale per identificare casi ad alto rischio, migliorare la gestione dei pazienti ospedalizzati e prevenire progressioni gravi, abilitando sempre più la P4-medicine. Questo campo di studio è essenziale per applicare efficacemente gli algoritmi ML nella pratica clinica, anche per contrastare future epidemie.