Machine learning in perioperative medicine: a systematic review

Background Risk stratification plays a central role in anesthetic evaluation. The use of Big Data and machine learning (ML) offers considerable advantages for collection and evaluation of large amounts of complex health-care data. We conducted a systematic review to understand the role of ML in the development of predictive post-surgical outcome models and risk stratification. Methods Following the Preferred Reporting Items for Systematic Reviews and Meta-analyses (PRISMA) guidelines, we selected the period of the research for studies from 1 January 2015 up to 30 March 2021. A systematic search in Scopus, CINAHL, the Cochrane Library, PubMed, and MeSH databases was performed; the strings of research included different combinations of keywords: “risk prediction,” “surgery,” “machine learning,” “intensive care unit (ICU),” and “anesthesia” “perioperative.” We identified 36 eligible studies. This study evaluates the quality of reporting of prediction models using the Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis (TRIPOD) checklist. Results The most considered outcomes were mortality risk, systemic complications (pulmonary, cardiovascular, acute kidney injury (AKI), etc.), ICU admission, anesthesiologic risk and prolonged length of hospital stay. Not all the study completely followed the TRIPOD checklist, but the quality was overall acceptable with 75% of studies (Rev #2, comm #minor issue) showing an adherence rate to TRIPOD more than 60%. The most frequently used algorithms were gradient boosting (n = 13), random forest (n = 10), logistic regression (LR; n = 7), artificial neural networks (ANNs; n = 6), and support vector machines (SVM; n = 6). Models with best performance were random forest and gradient boosting, with AUC > 0.90. Conclusions The application of ML in medicine appears to have a great potential. From our analysis, depending on the input features considered and on the specific prediction task, ML algorithms seem effective in outcomes prediction more accurately than validated prognostic scores and traditional statistics. Thus, our review encourages the healthcare domain and artificial intelligence (AI) developers to adopt an interdisciplinary and systemic approach to evaluate the overall impact of AI on perioperative risk assessment and on further health care settings as well.


Background
Risk stratification is a central part of the anesthetic evaluation. In fact, through the identification of highrisk patients, it is possible to conduct a specific risk/ benefit analysis, to reduce the risk of unexpected complications, to achieve a targeted perioperative optimization, to carefully plan the anesthesiologic management, and to provide an accurate and precise informed consent [1][2][3].
Over time, several scores have been published, from the most generic, like the American Society of Anesthesiologists Physical Status (ASA-PS) [4], to the most specific ones, as the European system for cardiac operative risk evaluation (EuroSCORE) [5] or the General Surgery Acute Kidney Injury Risk Index Classification System [6]. Unfortunately, these scores have some limits, mainly due to the lack of tailored predictions.
In the last decade, the interest about artificial intelligence (AI), including machine learning (ML) methods, have seen an exponential increase [2]. Considered an extension of traditional statistics, AI differs from standard approaches for its ability to learn from examples and mistakes, to improve continuously with the introduction of new data, and to create a model for individualized patient care [7].
Thanks to the growing informatization of health systems, large amounts of data have become available. The implementation of new technologies and the development of prediction algorithms paved the way for novel possibilities to exploit these huge data collections. Among the several branches of healthcare in which ML aroused enthusiasm, its application in perioperative medicine is showing promising results. In fact, in consideration of its specific characteristics, this analytical technique is suitable for the creation of predictive models, specifically concerning the optimization of resources and the development of warning score systems [8,9]. The application of these algorithms allows early detection and prediction of acute critical illness, facilitating the management of high-risk patients [10].
More recently, COVID-19 pandemic lighted on the importance of AI-based models for the fast development of algorithms that could integrate readily available data, helping the hospital systems and the clinicians in optimal patient care [11].
The use of ML techniques for the creation of predictive models of perioperative complications is in continuous expansion.
The aim of our review is to clarify the role of ML in perioperative settings, evaluating currently available predictive outcome models, the types of ML algorithms used more frequently, and their proved efficacy.

Literature search
This systematic review was conducted according to Preferred Reporting Items for Systematic Reviews and Meta-analyses (PRISMA) guidelines (http://prismastatement.org/documents/PRISMA_2020_checklist.pdf).
In the last 10 years, there was an exponential increase in literature concerning the application of AI in medicine. Therefore, we decided to perform the search in this time frame to include more homogeneous and easily comparable studies. We included studies if they evaluated ML models in surgical settings for the prediction of perioperative risk. Both prospective and retrospective studies were eligible for inclusion. The following types of study were excluded: papers published prior to 2015, papers concerning outpatient settings, animal studies, pediatric population, and studies written in languages other than English. Furthermore, primary study evaluating strictly surgical outcomes, and systematic reviews were considered uneligible.

Data extraction and quality assessment
The primary aim of our study was to assess the main perioperative outcomes in which ML methods are used, and their efficacy among different algorithms.
Two reviewers independently screened the selected articles, and a third reviewer resolved any discrepancies.
To assess the reporting quality of all included studies, we used the Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis (TRIPOD) checklist [12]. In fact, it provides guidance for extracting relevant information and calculating summary scores to determine adherence of primary prediction model to the TRIPOD. Two independent reviewers assessed for each selected study the compliance with the items described in the checklist. Moreover, to facilitate data extraction and scoring, the studies were analyzed according to the study design, predictor selection, outcome assessment, applied model, and its validation. The checklist includes 22 main items, of which ten are divided in sub items, all with four potential answer options: "yes," "not," "referenced," "not applicable." After adequately fulfilling each item of the checklist, the adherence to the TRIPOD is automatically calculated. We established different levels of adherence to TRIPOD, setting a scale from 0 to 100%, assuming that a research was more accurate with higher adherence to tripod checklist.

Results
One hundred forty-seven papers were identified through database searching. After the removal of the duplicates, 89 articles were screened, and 43 were found to be ineligible after reading the abstracts. Out of the 46 full text reviewed articles, 10 were excluded because of inadequate clinical setting or because concerning pediatric population. Finally, 36 articles were included for the review (Fig. 1).
Outlines all characteristics of the final selected articles (Table 1) , including the design, cohort, and objective of each study, as well as the ML methods used and the best performance.
Our analyses pointed out that more than 95% of included studies were published after 2018, and almost entirely performed in USA and Asia (Fig. 2).      The quality of the studies selected for the review was acceptable, with 75% of studied showing an adherence rate to TRIPOD more than 60% (Fig. 3). Specifically, in the first section of the checklist (Title and Abstract), a mean of 42% of studies adhere to tripod item. Concerning the methods section, all the articles defined the study design, or the source of data, while 53% of papers described the handling of missing data. In the results section, measures applied and models used were not always appropriated in the included studies, specifically 8% of papers presented the full prediction model and explained how to use it, while 19% of studies reported performance measures for the prediction model (Rev #2, comm #3).
Nearly all manuscripts discussed about the limitations of the study and gave an overall interpretation of results.
Supervised models were used in most of cases (Fig. 5). The most frequently used algorithms were gradient boosting (n = 13), random forest (n = 10), logistic regression (LR; n = 7), artificial neural networks (ANNs; n = 6), and support vector machines (SVM; n = 6). Deep learning, decision trees, and Naïve Bayes were other models commonly applied in the included manuscripts.
In the totality of reviewed papers, ML algorithms proved to be effective in outcome prediction. Half of the selected studies compared different types of ML to identify the best performing method. Gradient boosting and random forest were found to be the models with the highest accuracy, achieving an area under the curve (AUC) greater than 0.90 in most of cases. Moreover, a few studies compared automatically obtained algorithms to conventional scores, revealing the outperformance of ML models [25].

Discussion
The number of manuscripts regarding ML implementation in health care settings is steadily increasing over the last few years, as clearly suggested by a recently published review on AI utility to provide decision support to clinicians in ICU setting [49,50].
In fact, the availability of electronic health records, and the diffusion of Big Data systems have enabled new possibilities in data collection and storage. The interpretation of this amount of data with traditional methods could not only be extremely complicated, but even reductive. In this regard, the advent of AI-based technologies has opened up new perspectives, providing a different form of research [51].
Anesthesia and assessment of perioperative risk appear to be excellent fields to develop and apply ML systems, as reported in literature [52,53], and confirmed by our research. The identification of modifiable risk factors and the subsequent optimization of the preoperative phase appear to be a crucial factor to decrease the incidence of post-operative complications [54]. Furthermore, risk stratification allows the acquisition of an adequate informed consent and an accurate anesthesiologic planning, tailored to each patient. ML systems are well suitable for this context, where the possibility to collect a large number of data and the choice of the variable that is selected by the model itself, allows the discovery of new factors and a different interpretation of already known items. Thus, the availability of interpretations and predictions in real time could allow to enter a new era of anesthesia.
From a practical point of view, the method starts with multi-source data extrapolated and collected; subsequently, they are placed in ML systems able to return interpretative and predictive models, providing suitable tools for daily technologies with validated scores. Among conventional scores, the one used more frequently for comparison is the ASA-PS Classification System that has been in use for over 60 years. Comparing existing scores with new models is an essential step to understand whether this investment of time and resources could finally improve the perioperative risk stratification. Moreover, in addition to the risk of post-operative complications, ML would also be able to answer more complex questions and create models capable of providing early predictions of adverse events, thus enabling a perioperative optimization.
The results that emerge from this systematic analysis are promising. In studies that compared ML models with traditional scores, most confirmed their outperformance. In particular, the use of AI-based technologies provided excellent results regarding events of great interest in the field of Anesthesia, as post-induction hypotension and post-intubation hypoxia [13], or the risk of AKI or delirium after surgery [19,27,55].
Finally, it is interesting to underline that not only clinical outcomes are relevant, but also administrative ones, as length of hospital stay, or need for recovery in intensive care settings, that may have a great relapse into hospital logistics and in economic strategies (Fig. 6). A systematic use of AI might allow the achievement of innovative results in other fields as well, such as scientific research and health organization, especially when associated with other data management technologies such as Big Data and Blockchain.
Among several ML algorithms currently applied, Gradient boosting and random forest were found to be the models with the best performance and the highest accuracy, achieving an area under the curve (AUC) greater than 0.90 (Ref #2, comm #3). Still, it is not possible to make a uniform evaluation and draw conclusions about the best algorithm for predictive models of perioperative complications, because of the heterogeneity of settings and the difference in the algorithms evaluated. The lack of uniformity of the included studies prevented us from performing a meta-analysis using univariate and multivariate random effect models (Ref #2, comm #3). Moreover, the models in most of the studies lack an external validation.
Further, even if we practically use AUC as an evaluation criterion, we acknowledge its limits in the setting of AI, especially in case of unbalanced dataset. Note that other criteria can also be used to evaluate ML models, such as model relevance, efficiency, and interpretability  [56]. However, to achieve high-quality and high-quantity data sets, it is of paramount importance the screening of each step of the process, from data collection to ML model selection and its algorithm (Rev #2, comm #3, comm #4). Despite their growing diffusion, the use of these technologies in perioperative medicine is raising limitations and challenges. Along with technological progress, data quality will inevitably become increasingly important. A viable choice could be blockchain technology, to ensure adequate quality and enable secure data sharing. Its implementation could allow the safe management of large files and consequently the approval of algorithms that are progressively developed [57].
Furthermore, as recently reported for ICU-setting [50], despite the potential role of AI to improve clinical outcomes, the vast majority of developed models remain within the testing and prototyping environment. A uniform and structured approach could enable the implementation and safe delivery of AI technologies in ICU and overall, in health care settings.
Finally, the creation of predictive scores should guarantee precise rules. Unfortunately, these technologies are so innovative that the evaluation of their performance is not always so linear. Therefore, a new version of the TRIPOD statement specific for AI/ML systems (TRI-POD-ML) is currently under development. It will focus on the introduction of ML prediction algorithms to establish methodological and reporting standards for ML studies in health care [58].
Technologies are becoming more and more present in health-care settings. Both clinical and organizational decision-making processes can take advantage of these technologies. Nevertheless, high-quality studies are needed to demonstrate the real impact of ML in this context.
Our research group is starting a study that aims to validate a safe discharge score from the PACU (postanesthesia care unit) using AI techniques; the score will no longer be generic, but based on the local clinical reality and on the specific population. Similarly, we are working on the application of AI algorithms in OR (operating room) management settings, developing a prospective trial "Bloc-op" (NCT 05106621), in collaboration with the engineering department, to optimize OR organization and resources allocation. We believe that multidisciplinary collaboration is essential to integrate AI technologies into routine clinical practice, thus leading to a great improvement in the quality of care.
We proposed that AI should become an essential technical and non-technical skill for the future anesthesiologists. In order to achieve this goal, a primary focus should be the education and training of physicians and researchers, who need to be adequately prepared on the uses and limitations of AI based technologies (Rev #2, comm #4).

Conclusions
This systematic review shows the potential role of ML in perioperative medicine, and particularly in the creation of models for the prediction of perioperative risk. Our results are encouraging.
Undoubtedly, the exploitation of a large amount of data is possible solely thanks to the application of AI. ML algorithms offer increasingly precise solutions in terms of optimization of the perioperative risk. A personalized risk/benefit analysis can result in an accurate prediction in terms of length of hospital stay and ICU recovery, thus positively influencing patient management and health costs.
Further research is needed to develop a framework standardizing AI evaluation measures, and this will be possible with interdisciplinary approaches, allowing to constantly improve high-quality care.