n recent years, there has been a widespread cross-fertilization between Medical Statistics and Machine Learning (ML) techniques (1). A broad range of ML methods are increasingly being used in many medical fields, such as oncology, internal medicine, cardiology, pediatrics, and genetics (2–4) with a particular focus on the development of prediction tools. For example, in personalized medicine ML techniques have been used to derive the probability of treatment response for each patient (5). In oncology and cardiology, the ML approach has focused on prognosis and risk estimation (6,7). Moreover, ML approaches have been often applied to guide treatment decisions, to counsel patients, and to address the critical steps of clinical trials design (8). Despite its popularity, it is difficult to find a universally agreed-upon definition for ML. It is widely recognized that the major difference between ML and a traditional statistical approach lies in their purpose. ML methods are focused on making predictions as accurate as possible, whereas statistical models are aimed at inferring relationships between variables. However, many statistical models can make predictions too. On the other hand, ML techniques can provide different degrees of interpretability, from neural networks, which sacrifice interpretability to predictive power, to the highly interpretable lasso regression approach. Traditional approaches for developing predictive models are mainly regression-based models, such as linear or logistic regression. Such models often use a small number of variables to predict the value of an outcome or the probability of an event and they are ubiquitous in clinical research because they estimate easy-to-interpret parameters (e.g. odds ratios, relative risks, and hazard ratios). However, traditional regression-based models rely on strong assumptions, such as additivity, linearity and distributional assumptions, that may be unrealistic in the clinical practice, where relationships between subject’s characteristics and a clinical endpoint are likely to be complex. ML methods possess several attractive properties, such as flexibility and freedom-from-assumptions, that make them valuable alternatives to traditional statistical approaches in medical research. As identified by Goldstein and colleagues (6), MLs are able to face modeling challenges that are often difficult to address with traditional statistical models. Among them the most important are: 1. Non-linearities. Traditional regression models assume that the effect of a predictor on the endpoint increases (or decreases) uniformly throughout the range of the predictor. Such assumption is not always true in practice, where the relationships between covariates and the outcome may be non-linear. For example, the risk of death is likely to increase sharply with increasing age. Effect modification or statistical interaction. It occurs when the effect of a predictor changes given the values of another variable. For example, it has been observed that air pollution may have a differential effect on adverse cardiovascular events depending on genotypes (9). ML techniques can automatically detect such heterogeneity of effects, which, in contrast, has to be a priori specified putting interaction terms in classical statistical approaches. 3. Few observations and many predictors. Datasets in clinical research are often characterized by a small number of patients/observations and many predictors. In such settings, it is crucial to develop predictive tools able to provide robust estimates. Traditional regression methods are known to have several limitations in such situations, especially when the aim is to select the most relevant risk factors. Although the rise of ML has been associated to an unprecedented wealth of data, several strategies can be adopted to overcome the issues of small dataset in building ML predictive models (10). 4. Multiplicity of models. Many models with different sets of features have nearly the same predictive accuracy. This is partly because in real-world data, it is very common to have some degree of correlation between features. Building a single data model means focusing on only one possible representation of the mapping from features to outcome (11). In 2001, in its seminal paper “Statistical Modeling: The Two Cultures” (12), Leo Breiman described the data modeling approach and the algorithmic modeling as two contrasting cultures. Over the last two decades, medical statistics and ML have blended more and more. ML techniques hold several potential benefits that are increasing their popularity in clinical research. Their significant spread highlights their crucial role in dealing with the integration of complex biomedical and healthcare data in scenarios where traditional statistical methods show limitations (13). However, ML models must be appropriately developed, evaluated, and eventually tailored to different situations (14). The growing trend on their application in clinical and epidemiological research requires they are assessed considering the established methodological standards applied for traditional prediction model research. This makes ML techniques another effective and powerful tool for carrying out data analysis. They can be also used instrumentally to traditional statistical methods. The two approaches should be considered as complementary rather than competitive. A proper blending may provide a wide variety of statistical and computational tools for theory testing, knowledge discovery, prediction and decision making. In the end, both ML and medical statistics are concerned with the same question: how do we learn from data?

### Machine learning in clinical and epidemiological research: Isn't it time for biostatisticians to work on it?

#####
*Azzolina D.;Berchialla P.;Giudici F.;Gregorio C.;Milanese A.;*

##### 2019

#### Abstract

n recent years, there has been a widespread cross-fertilization between Medical Statistics and Machine Learning (ML) techniques (1). A broad range of ML methods are increasingly being used in many medical fields, such as oncology, internal medicine, cardiology, pediatrics, and genetics (2–4) with a particular focus on the development of prediction tools. For example, in personalized medicine ML techniques have been used to derive the probability of treatment response for each patient (5). In oncology and cardiology, the ML approach has focused on prognosis and risk estimation (6,7). Moreover, ML approaches have been often applied to guide treatment decisions, to counsel patients, and to address the critical steps of clinical trials design (8). Despite its popularity, it is difficult to find a universally agreed-upon definition for ML. It is widely recognized that the major difference between ML and a traditional statistical approach lies in their purpose. ML methods are focused on making predictions as accurate as possible, whereas statistical models are aimed at inferring relationships between variables. However, many statistical models can make predictions too. On the other hand, ML techniques can provide different degrees of interpretability, from neural networks, which sacrifice interpretability to predictive power, to the highly interpretable lasso regression approach. Traditional approaches for developing predictive models are mainly regression-based models, such as linear or logistic regression. Such models often use a small number of variables to predict the value of an outcome or the probability of an event and they are ubiquitous in clinical research because they estimate easy-to-interpret parameters (e.g. odds ratios, relative risks, and hazard ratios). However, traditional regression-based models rely on strong assumptions, such as additivity, linearity and distributional assumptions, that may be unrealistic in the clinical practice, where relationships between subject’s characteristics and a clinical endpoint are likely to be complex. ML methods possess several attractive properties, such as flexibility and freedom-from-assumptions, that make them valuable alternatives to traditional statistical approaches in medical research. As identified by Goldstein and colleagues (6), MLs are able to face modeling challenges that are often difficult to address with traditional statistical models. Among them the most important are: 1. Non-linearities. Traditional regression models assume that the effect of a predictor on the endpoint increases (or decreases) uniformly throughout the range of the predictor. Such assumption is not always true in practice, where the relationships between covariates and the outcome may be non-linear. For example, the risk of death is likely to increase sharply with increasing age. Effect modification or statistical interaction. It occurs when the effect of a predictor changes given the values of another variable. For example, it has been observed that air pollution may have a differential effect on adverse cardiovascular events depending on genotypes (9). ML techniques can automatically detect such heterogeneity of effects, which, in contrast, has to be a priori specified putting interaction terms in classical statistical approaches. 3. Few observations and many predictors. Datasets in clinical research are often characterized by a small number of patients/observations and many predictors. In such settings, it is crucial to develop predictive tools able to provide robust estimates. Traditional regression methods are known to have several limitations in such situations, especially when the aim is to select the most relevant risk factors. Although the rise of ML has been associated to an unprecedented wealth of data, several strategies can be adopted to overcome the issues of small dataset in building ML predictive models (10). 4. Multiplicity of models. Many models with different sets of features have nearly the same predictive accuracy. This is partly because in real-world data, it is very common to have some degree of correlation between features. Building a single data model means focusing on only one possible representation of the mapping from features to outcome (11). In 2001, in its seminal paper “Statistical Modeling: The Two Cultures” (12), Leo Breiman described the data modeling approach and the algorithmic modeling as two contrasting cultures. Over the last two decades, medical statistics and ML have blended more and more. ML techniques hold several potential benefits that are increasing their popularity in clinical research. Their significant spread highlights their crucial role in dealing with the integration of complex biomedical and healthcare data in scenarios where traditional statistical methods show limitations (13). However, ML models must be appropriately developed, evaluated, and eventually tailored to different situations (14). The growing trend on their application in clinical and epidemiological research requires they are assessed considering the established methodological standards applied for traditional prediction model research. This makes ML techniques another effective and powerful tool for carrying out data analysis. They can be also used instrumentally to traditional statistical methods. The two approaches should be considered as complementary rather than competitive. A proper blending may provide a wide variety of statistical and computational tools for theory testing, knowledge discovery, prediction and decision making. In the end, both ML and medical statistics are concerned with the same question: how do we learn from data?I documenti in SFERA sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.