A machine learning-based approach for individualized prediction of short-term outcomes after anterior cervical corpectomy

Article information

Asian Spine J. 2024;18(4):541-549
Publication date (electronic) : 2024 August 8
doi : https://doi.org/10.31616/asj.2024.0048
Department of Neurosurgery, Mount Sinai Health System, New York, NY, USA
Corresponding Author: Konstantinos Margetis, Department of Neurosurgery, Mount Sinai Health System, New York, NY, USA, Tel: +1-212-241-3649, Fax: +1-212-410-0603, E-mail: konstantinos.margetis@mountsinai.org
Received 2024 February 5; Revised 2024 March 20; Accepted 2024 April 15.

Abstract

Study Design

A retrospective machine learning (ML) classification study for prognostic modeling after anterior cervical corpectomy (ACC).

Purpose

To evaluate the effectiveness of ML in predicting ACC outcomes and develop an accessible, user-friendly tool for this purpose.

Overview of Literature

Based on our literature review, no study has examined the capability of ML algorithms to predict major short-term ACC outcomes, such as prolonged length of hospital stay (LOS), non-home discharge, and major complications.

Methods

The American College of Surgeons’ National Surgical Quality Improvement Program database was used to identify patients who underwent ACC. Prolonged LOS, non-home discharges, and major complications were assessed as the outcomes of interest. ML models were developed with the TabPFN algorithm and integrated into an open-access website to predict these outcomes.

Results

The models for predicting prolonged LOS, non-home discharges, and major complications demonstrated mean areas under the receiver operating characteristic curve (AUROC) of 0.802, 0.816, and 0.702, respectively. These findings highlight the discriminatory capacities of the models: fair (AUROC >0.7) for differentiating patients with major complications from those without, and good (AUROC >0.8) for distinguishing between those with and without prolonged LOS and non-home discharges. According to the SHapley Additive Explanations analysis, single- versus multiple-level surgery, age, body mass index, preoperative hematocrit, and American Society of Anesthesiologists physical status repetitively emerged as the most important variables for each outcome.

Conclusions

This study has considerably enhanced the prediction of postoperative results after ACC surgery by implementing advanced ML techniques. A major contribution is the creation of an accessible web application, highlighting the practical value of the developed models. Our findings imply that ML can serve as an invaluable supplementary tool to stratify patient risk for this procedure and can predict diverse postoperative adverse outcomes.

Introduction

Anterior cervical corpectomy (ACC) with fusion is performed to treat various cervical spine disorders, including degenerative diseases, traumas, neoplasms, and infections. Aside from ACC with fusion, anterior cervical spine surgery approaches also include anterior cervical discectomy and fusion (ACDF), and both have been associated with favorable clinical outcomes [1,2]. Compared with ACC, ACDF can better preserve the biomechanical stability of the cervical spine while decreasing the likelihood of complications, such as implant displacement [3]. Moreover, potential limitations of ACDF comprise incomplete decompression, spinal cord injury, limited surgical exposure, and increased incidence of pseudarthrosis [4,5]. Conversely, ACC is more prone to dural tears, cerebrospinal fluid leakage, bone graft displacement, and other complications [6,7]. In degenerative diseases, corpectomy is indicated for compression behind the vertebral body beyond the disc space level. It permits adequate spinal cord decompression in such cases. ACC allows for direct and adequate decompression of the anterior spinal cord, resection of lesions involving the cervical vertebral bodies, and enhancement of the cervical lordosis [8]. Given the potential benefits and drawbacks of this technique, preoperatively determining patients likely to have post-ACC adverse events becomes critical.

The monitoring and generation of risk-adjusted estimates for adverse postoperative outcomes aim to curb healthcare costs. Consequently, clinicians must process increasing volumes of complex data, necessitating more advanced analytical techniques [9]. Machine learning (ML)-based predictive models can effectively leverage high-dimensional clinical data to create precise patient risk assessment tools. ML has been successfully utilized to predict short-term outcomes after various spinal conditions and procedures [1013]. However, based on our literature review, no study has examined the capability of ML algorithms to predict major short-term adverse events after ACC, such as prolonged length of hospital stay (LOS), non-home discharges, and major complications. Thus, this study aimed to evaluate the effectiveness of ML in predicting ACC outcomes and develop an accessible, user-friendly tool for this purpose.

Materials and Methods

Ethical approval

This study was conducted in compliance with the principles of the Declaration of Helsinki. This study was deemed exempt from review by the institutional review board of Icahn School of Medicine at Mount Sinai (STUDY-22-01302-MOD002) because it involved analysis of deidentified patient data. Informed consent was not required given the retrospective analysis of previously collected deidentified information.

Study design and guidelines

This retrospective ML classification study (with binary categorical outcomes) for prognostic modeling after ACC followed the Transparent Reporting of Multivariable Prediction Models for Individual Prognosis or Diagnosis [14] and Journal of Medical Internet Research Guidelines for Developing and Reporting Machine Learning Predictive Models in Biomedical Research [15] guidelines.

Data source

The American College of Surgeons National Surgical Quality Improvement Program (ACS-NSQIP) database was examined to identify patients undergoing ACC between 2014 and 2020. ACS-NSQIP is a national registry comprising data on major surgeries across all specialties performed on adult patients at >700 participating medical centers in the United States [16,17]. The structure and data collection methods of the database have been discussed extensively elsewhere [18].

Study population

The NSQIP database was queried to identify patients who met the following inclusion criteria: (1) Common Procedural Terminology (CPT) code 63081: vertebral corpectomy (vertebral body resection), partial or complete, anterior approach with decompression of the spinal cord and/or nerve root(s), (2) elective surgery, (3) surgery under general anesthesia, and (4) surgical subspecialty of neurosurgery or orthopedics. Patients were excluded based on the following criteria: (1) emergency surgery, (2) unclean wounds (wound classes 2–4), (3) patients with sepsis/shock/systemic inflammatory response syndrome 48 hours before surgery, (4) American Society of Anesthesiologists (ASA) physical status classification score ≥4 or unassigned, (5) LOS exceeding the 30-day postoperative period captured in the database, and (6) patients who died in hospital, left against medical advice, or discharged to hospice. Those undergoing concurrent cervical surgeries, thoracic and lumbar fusions, revision procedures, or intraspinal lesion resections were also excluded. Supplement 1 provides the CPT codes for these exclusionary surgeries.

Predictor variables

The predictor variables included in the analysis consisted of data that would have been known before the outcomes of interest following surgery. The predictors fell into four categories: (1) demographics, including age, sex, race, Hispanic ethnicity, body mass index (BMI), and transfer status; (2) comorbidities, namely, diabetes mellitus, smoking within a year, dyspnea, severe chronic obstructive pulmonary disease history, congestive heart failure, hypertension, acute kidney injury, currently requiring or on dialysis, disseminated cancer, steroids or immunosuppressants for a chronic disease, >10% weight loss over 6 months, bleeding disorders, ≥1 unit of red blood cell transfusion 72 hours before surgery, ASA physical status classification, and preoperative functional status; (3) preoperative laboratory values, including hematocrit, white blood cell count, platelet count, prothrombin time, international normalized ratio (INR), partial thromboplastin time (PTT), serum sodium, blood urea nitrogen, serum creatinine, serum albumin, total bilirubin, serum glutamic-oxaloacetic transaminase (SGOT), and serum alkaline phosphatase (ALP); and (4) operative details, such as surgical specialty, single- versus multiple-level surgery (CPT code 63082). Definitions for predictor variables are provided in the ACS-NSQIP participant user data file guides (www.facs.org/quality-programs/data-and-registries/acs-nsqip/participant-use-data-file).

Three outcomes were analyzed, namely, prolonged LOS, non-home discharge, and occurrence of any major complications. Prolonged LOS was defined as LOS exceeding 80% of the total patient cohort, corresponding to ≥4 days. Discharge destinations were classified as home versus non-home. Non-home discharges included facilities requiring post-hospital care such as rehabilitation, skilled nursing, separate acute inpatient stay, unskilled care, or senior living communities. Home discharges included those discharged to a facility serving as the residence or actual home of the patients. Major postoperative complications comprised one or more of the following: deep surgical site infections, wound dehiscence, need for reintubation, pulmonary embolism, prolonged mechanical ventilation >48 hours, renal insufficiency or failure requiring dialysis, cardiac arrest, myocardial infarction, bleeding requiring ≥1 unit of red blood cell transfusion, deep venous thrombosis, sepsis, and septic shock. The database also recorded less severe complications such as superficial surgical site infections, pneumonia, and urinary tract infections. However, these were not classified as major for this analysis. Any patients with missing data for prolonged LOS, non-home discharge, or major complications were excluded when analyzing that outcome.

Data preprocessing and partition

Imputation methods were used to substitute missing values and prevent potential bias from excluding patients with incomplete data. For continuous variables, missing values were imputed using the k-nearest neighbor algorithm when <25% of data were missing [19]. Any continuous variables with >25% missing values were excluded (prothrombin time, INR, PTT, serum albumin, total bilirubin, SGOT, and serum ALP). For categorical variables, missing values were filled as “unknown” or “unknown/other”; if again <25% of data were missing, the variable was excluded from the analysis (smoking within a year).

To address potential class imbalance in the training data, the Synthetic Minority Over-sampling Technique (SMOTE) was utilized before model training. SMOTE resolves skewed class distributions by synthetically generating new cases belonging to the minority class rather than duplicating available samples. This technique expands the number of cases from under-represented groups and improves model performance compared with simply replicating minority observations [20]. Applying SMOTE ensured adequate instances of all classes and prevented learning bias during training that tends to favor majority groups.

Model development and evaluation

TabPFN, a modified prior-data fitted network architecture, was used to develop the prediction models [21,22]. Prior-data fitted networks, including TabPFN, are pre-trained on synthetic data to emulate Bayesian inference on real-world information. TabPFN uses a meta-learning approach to adapt to new, unseen data by training across different datasets [22,23]. This pre-training enables TabPFN to learn complex patterns in tabular data and smoothly transition when applied to novel datasets.

A fivefold stratified cross-validation approach was used to evaluate model performance. The data were divided into five roughly equal partitions, balancing the ratio of outcome classes across folds (stratification) to ensure consistent class representation. In each fold, the initial training set (80% of total data) was further split into final training (70% of total data) and validation (10% of total data) subsets. This step produced a 70:10:20 ratio for training to validation to holdout testing. The validation subsets enabled isotonic calibration to match predicted risks with actual outcomes [24]. scikit-learn’s “CalibratedClassifierCV” class handled this sigmoid calibration fitting on the validation data [25].

The calibrated TabPFN model generated predictions and probability estimates on each test fold. Overall performance was evaluated by aggregating the results across all folds because cross-validation allows reliable assessment of generalizable predictive ability. For enhanced interpretability, SHapley Additive Explanation (SHAP) values were used to determine the relative importance of features. The SHAP plot displayed the variables chosen hierarchically, with the most impactful at the top. To enable full transparency, the model code is publicly available in the study GitHub repository (https://github.com/mertkarabacak/NSQIP-ACC).

Model performance was assessed visually using a receiver operating characteristic (ROC) curve showing the tradeoff between true- and false-positive rates across thresholds and a precision–recall curve (PRC) illustrating the precision–recall balance. Several metrics were calculated: weighted precision (ratio of true positives to total positive predictions weighted by class frequencies), weighted recall (ratio of true positives to total actual positives weighted class frequencies), F1 score (harmonic mean of precision and recall), area under the ROC curve (AUROC, evaluates class discrimination across thresholds), area under the PRC (AUPRC, reflects precision–recall tradeoff across thresholds), and Brier score (mean squared difference between predicted probabilities and actual outcomes). A 95% bootstrap confidence interval with 1,000 resampled datasets was calculated for each metric. This involved sampling with replacement to generate 1,000 new samples, with the 2.5th and 97.5th percentiles of the bootstrap distribution to determine the 95% confidence intervals.

Web application

A web application was developed to facilitate patient-level outcome predictions for each outcome. The application integrated these trained models with the implementation source code shared publicly on Hugging Face, which is a platform enabling model distribution. The application can be accessed at https://huggingface.co/spaces/MSHS-Neurosurgery-Research/NSQIP-ACC.

Descriptive statistics

For continuous variables following a normal distribution, means±standard deviations were reported, whereas medians (interquartile ranges) were presented for non-normally distributed variables. Categorical variables are presented as percentages.

Results

Initially, 5,363 patients were identified with the CPT code 63081 and other inclusion criteria. The exclusion criteria were sequentially applied, excluding 2,451 patients (Fig. 1). Subsequently, analyses included 2,909 patients for prolonged LOS (23.7%), 2,908 patients for non-home discharges (5.5%), and 2,912 patients for major complications (2.8%). The characteristics of the 2,912 patients before outcome-specific exclusions are presented in Table 1.

Fig. 1

Patient selection flowchart. ASA, American Society of Anesthesiologists; LOS, length of stay.

Patient characteristics

The model for predicting prolonged LOS, non-home discharges, and major complications demonstrated mean AUROCs of 0.802, 0.816, and 0.702, respectively. These findings highlight the models’ discriminatory capacities: fair (AUROC >0.7) for differentiating patients with from those without major complications and good (AUROC >0.8) for distinguishing between those with and without prolonged LOS and non-home discharges [26]. A comprehensive analysis of model performance metrics is presented in Table 2.

Models’ performance metrics

Figs. 2 and 3 illustrate the ROC curves and PRCs for the three outcomes of interest, respectively. Fig. 4 displays the SHAP bar plots for the models, demonstrating the collective contribution of the top 10 features to the predictions for each outcome. The bar length signifies the mean SHAP value, reflecting an attribute’s influence strength on the predicted outcome, with features organized by significance order from top to bottom. Single- versus multiple-level surgery, age, BMI, preoperative hematocrit, and ASA physical status emerged repetitively as the most important variables for each outcome.

Fig. 2

Models’ receiver operating characteristic (ROC) curves for the outcome of prolonged length of stay (A), non-home discharges (B), and major complications (C). AUROC, area under ROC curve.

Fig. 3

Models’ receiver precision-recall curves (PRCs) for the outcome of prolonged length of stay (A), non-home discharges (B), and major complications (C). AUPRC, area under PRC

Fig. 4

The ten most important features and their mean SHapley Additive Explanations (SHAP) values for the model predicting prolonged length of stay (A), non-home discharges (B), and major complications (C). BMI, body mass index; ASA, American Society of Anesthesiologists; WBC, white blood cell.

Discussion

This study aimed to develop ML models for predicting adverse short-term ACC outcomes and improve their accessibility by creating a web application for healthcare professionals. In this tool, patient information is entered to generate personalized risk estimates for each outcome, potentially offering an advantage over traditional methods, often relying on qualitative risk communication or quantitative estimates based on clinicians’ experiences and population statistics. By providing patient-specific predictions, our models can identify patients at increased risk for postoperative adverse events, facilitating more informed discussions of risks during preoperative counseling. In addition, the web application has potential implications for quality assurance by allowing the review of cases with adverse events in patients previously deemed at low risk, potentially uncovering institutional deficiencies and guiding policies to optimize resource allocation and improve patient outcomes.

The approach described herein serves as a robust tool to estimate outcome probabilities for patients undergoing ACC while improving the interpretability of these predictions. As presented in Fig. 4, the SHAP bar charts provide a global perspective on how the models arrive at their predictions overall. Rather than focusing on individual predictions, this holistic view examines patterns, trends, and relationships identified by the models across the entire dataset. On the contrary, SHAP visualizations incorporated into our web application offer granular local interpretations tailored to each patient. This unique feature ensures that users gain clear insight into how predictions relate to particular variable values for a patient. Therefore, our methodology enables an unprecedented level of understanding of the factors that influence outcomes that were not feasible with previous models or platforms.

Although the current web application offers a useful interface to predict adverse ACC outcomes, it is intended as a research tool and should not presently direct clinical decisions. Further validation across varied patient groups and institutions is imperative to confirm its predictive precision. We hope this tool provides an initial foundation for more exhaustive models integrating extra details such as imaging results and granular clinical information to further improve accuracy and clinical utility. As with any predictive method, the generated estimates must be interpreted within the whole clinical context of each patient to enable personalized surgical counseling and planning.

The limitations share similarities with those associated with existing online prognostic models. First, the patient population within the ACS-NSQIP database may not fully reflect real-world patients with ACC. Biases could arise relating to the hospitals because participating institutions potentially possess superior infrastructures and resources to average facilities. In addition, patients in the database may have different health states, age distribution, or socioeconomic backgrounds from the general population. Huffman et al. [27] validated the use of ACS-NSQIP for examining postoperative outcomes, and such factors affect the generalizability of findings despite the overall dependability of the data source. Second, database studies are invariably prone to inaccuracies from coding errors or other misclassifications. Despite the widespread adoption of the ACS-NSQIP, few studies have investigated its coding precision for specific domains. For example, Rolston et al. [28] identified multiple inconsistencies in neurosurgical CPT designations. Finally, we did not establish causal connections between patient attributes and outcomes. Our models are not suited for causal inference and do not provide insight into mechanisms behind the observed predictor–outcome associations. We strongly caution against causal interpretations stemming from our results.

Conclusions

This study has considerably enhanced the prediction of ACC outcomes by implementing advanced ML techniques. A major contribution is the creation of an accessible web application, highlighting the practical value of the developed models. Our findings imply that ML can serve as an invaluable supplementary tool to stratify patient risk for this procedure and can predict diverse postoperative complications. This methodology has the potential to transform patient counseling discussions toward a more personalized, patient-centered, data-driven approach. Overall, this work signifies a meaningful step forward in enabling precision medicine for spinal procedures.

Key Points

  • The study demonstrated the effectiveness of machine learning models in predicting outcomes of anterior cervical corpectomy, with models achieving good discriminatory ability in predicting prolonged length of hospital stay and non-home discharges and fair discriminatory ability in predicting major complications.

  • Important variables influencing postoperative outcomes included single- versus multiple-level surgery, age, body mass index, preoperative hematocrit levels, and American Society of Anesthesiologists physical status.

  • An accessible web application was created for healthcare professionals for the utilization of these machine learning models, prediction of each patient outcome, and assistance in preoperative counseling and risk stratification.

  • The study highlights the potential of machine learning to contribute to precision medicine in spinal surgery by enabling person

Acknowledgments

Data for this study were obtained from the American College of Surgeons (ACS) National Surgical Quality Improvement Program (NSQIP). ACS NSQIP participant use file access is a benefit of NSQIP participation and is reserved for staff at participating and active ACS NSQIP hospitals. ACS policies do not allow access or sale to nonparticipants. Additional information is provided in https://www.facs.org/quality-programs/data-and-registries/acs-nsqip/participant-use-data-file/. The American College of Surgeons National Surgical Quality Improvement Program and the hospitals participating in the American College of Surgeons National Surgical Quality Improvement Program are the source of the data used herein; they have not verified and are not responsible for the statistical validity of the data analysis or the conclusions derived by the authors. The source code for preprocessing and analyzing the data is available on GitHub (https://github.com/mertkarabacak/NSQIP-ACC).

Notes

Conflict of Interest

No potential conflict of interest relevant to this article was reported.

Author Contributions

Conceptualization: MK, AJS, MTC, KM. Methodology: MK, KM. Data curation: MK. Formal analysis: MK. Visualization: MK. Project administration: MK, KM. Writing–original draft preparation: MK. Writing–review and editing: AJS, MTC, KM. Supervision: KM. Final approval of the manuscript: all authors.

Supplementary Materials

Supplementary materials can be available from https://doi.org/10.31616/asj.2024.0048.

Supplement 1. CPT codes used for exclusion.

asj-2024-0048-Supplementary-Table-1.pdf

References

1. Fehlings MG, Wilson JR, Kopjar B, et al. Efficacy and safety of surgical decompression in patients with cervical spondylotic myelopathy: results of the AOSpine North America prospective multi-center study. J Bone Joint Surg Am 2013;95:1651–8.
2. Al-Tamimi YZ, Guilfoyle M, Seeley H, Laing RJ. Measurement of long-term outcome in patients with cervical spondylotic myelopathy treated surgically. Eur Spine J 2013;22:2552–7.
3. Lau D, Chou D, Mummaneni PV. Two-level corpectomy versus three-level discectomy for cervical spondylotic myelopathy: a comparison of perioperative, radiographic, and clinical outcomes. J Neurosurg Spine 2015;23:280–9.
4. Fountas KN, Kapsalaki EZ, Nikolakakos LG, et al. Anterior cervical discectomy and fusion associated complications. Spine (Phila Pa 1976) 2007;32:2310–7.
5. Oh MC, Zhang HY, Park JY, Kim KS. Two-level anterior cervical discectomy versus one-level corpectomy in cervical spondylotic myelopathy. Spine (Phila Pa 1976) 2009;34:692–6.
6. Shamji MF, Massicotte EM, Traynelis VC, Norvell DC, Hermsmeyer JT, Fehlings MG. Comparison of anterior surgical options for the treatment of multilevel cervical spondylotic myelopathy: a systematic review. Spine (Phila Pa 1976) 2013;38(22 Suppl 1):S195–209.
7. Banno F, Zreik J, Alvi MA, Goyal A, Freedman BA, Bydon M. Anterior cervical corpectomy and fusion versus anterior cervical discectomy and fusion for treatment of multilevel cervical spondylotic myelopathy: insights from a national registry. World Neurosurg 2019;132:e852–61.
8. Son S, Lee SG, Yoo CJ, Park CW, Kim WK. Single stage circumferential cervical surgery (selective anterior cervical corpectomy with fusion and laminoplasty) for multilevel ossification of the posterior longitudinal ligament with spinal cord ischemia on MRI. J Korean Neurosurg Soc 2010;48:335–41.
9. Kim JS, Merrill RK, Arvind V, et al. Examining the ability of artificial neural networks machine learning models to accurately predict complications following posterior lumbar spine fusion. Spine (Phila Pa 1976) 2018;43:853–60.
10. Jain D, Durand W, Burch S, Daniels A, Berven S. Machine learning for predictive modeling of 90-day readmission, major medical complication, and discharge to a facility in patients undergoing long segment posterior lumbar spine fusion. Spine (Phila Pa 1976) 2020;45:1151–60.
11. Etzel CM, Veeramani A, Zhang AS, et al. Supervised machine learning for predicting length of stay after lumbar arthrodesis: a comprehensive artificial intelligence approach. J Am Acad Orthop Surg 2022;30:125–32.
12. Zhang AS, Veeramani A, Quinn MS, Alsoof D, Kuris EO, Daniels AH. Machine learning prediction of length of stay in adult spinal deformity patients undergoing posterior spine fusion surgery. J Clin Med 2021;10:4074.
13. Gowd AK, O’Neill CN, Barghi A, O’Gara TJ, Carmouche JJ. Feasibility of machine learning in the prediction of short-term outcomes following anterior cervical discectomy and fusion. World Neurosurg 2022;168:e223–32.
14. Collins GS, Reitsma JB, Altman DG, Moons KG. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the TRIPOD Statement. BMC Med 2015;13:1.
15. Luo W, Phung D, Tran T, et al. Guidelines for developing and reporting machine learning predictive models in biomedical research: a multidisciplinary view. J Med Internet Res 2016;18:e323.
16. Khuri SF, Henderson WG, Daley J, et al. The patient safety in surgery study: background, study design, and patient populations. J Am Coll Surg 2007;204:1089–102.
17. Hall BL, Hamilton BH, Richards K, Bilimoria KY, Cohen ME, Ko CY. Does surgical quality improve in the American College of Surgeons National Surgical Quality Improvement Program: an evaluation of all participating hospitals. Ann Surg 2009;250:363–76.
18. American College of Surgeons. ACS National Surgical Quality Improvement Program [Internet] Chicago (IL): American College of Surgeons; c2024. [cited 2024 Jan 20]. Available from: https://www.facs.org/quality-programs/data-and-registries/acs-nsqip/about-acs-nsqip/.
19. Beretta L, Santaniello A. Nearest neighbor imputation algorithms: a critical evaluation. BMC Med Inform Decis Mak 2016;16(Suppl 3):74.
20. Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 2002;16:321–57.
21. Schaul T, Schmidhuber J. Metalearning. Scholarpedia [Internet] 2010. [cited 2024 Jan 20]. 54650. Available from: http://www.scholarpedia.org/article/Metalearning.
22. Hollmann N, Muller S, Eggensperger K, Hutter F. Tabpfn: a transformer that solves small tabular classification problems in a second. arXiv [Preprint] 2022. Jul. 5. https://doi.org/10.48550/arXiv.2207.01848.
23. Muller S, Hollmann N, Arango SP, Grabocka J, Hutter F. Transformers can do Bayesian inference. arXiv [Preprint] 2021. Dec. 20. https://doi.org/10.48550/arXiv.2112.10510.
24. Platt J. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Adv Larg Margin Classif 1999;10:61–74.
25. CalibratedClassifierCV [Internet] place unknown: scikit-learn; c2024. [cited 2024 Jan 10]. Available from: https://scikit-learn.org/stable/modules/generated/sklearn.calibration.CalibratedClassifierCV.html.
26. Hanley JA, McNeil BJ. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 1982;143:29–36.
27. Huffman KM, Cohen ME, Ko CY, Hall BL. A comprehensive evaluation of statistical reliability in ACS NSQIP profiling models. Ann Surg 2015;261:1108–13.
28. Rolston JD, Han SJ, Chang EF. Systemic inaccuracies in the National Surgical Quality Improvement Program database: implications for accuracy and validity for neurosurgery outcomes research. J Clin Neurosci 2017;37:44–7.

Article information Continued

Fig. 1

Patient selection flowchart. ASA, American Society of Anesthesiologists; LOS, length of stay.

Fig. 2

Models’ receiver operating characteristic (ROC) curves for the outcome of prolonged length of stay (A), non-home discharges (B), and major complications (C). AUROC, area under ROC curve.

Fig. 3

Models’ receiver precision-recall curves (PRCs) for the outcome of prolonged length of stay (A), non-home discharges (B), and major complications (C). AUPRC, area under PRC

Fig. 4

The ten most important features and their mean SHapley Additive Explanations (SHAP) values for the model predicting prolonged length of stay (A), non-home discharges (B), and major complications (C). BMI, body mass index; ASA, American Society of Anesthesiologists; WBC, white blood cell.

Table 1

Patient characteristics

Characteristic Category Value
Age (yr) 57 (IQR, 17)
Sex Female 1,429 (49.0)
Male 1,486 (51.0)
Race White 2,177 (74.7)
Black 431 (14.8)
Asian 69 (2.4)
Other/unknown 238 (8.2)
Hispanic ethnicity No 2,575 (88.3)
Yes 157 (5.4)
Unknown 183 (6.3)
Body mass index (kg/m2) 29.6 (IQR, 7.8)
Transfer status Not transferred 2,878 (98.7)
Transferred 37 (1.3)
Diabetes mellitus No 2,365 (81.1)
Yes 550 (18.9)
Dyspnea No 2,755 (94.5)
Yes 160 (5.5)
S_evere chronic obstructive pulmonary disease history No 2,776 (95.2)
Yes 139 (4.8)
Congestive heart failure No 2,912 (99.9)
Yes 3 (0.1)
Hypertension No 1,516 (52.0)
Yes 1,399 (48.0)
Acute kidney injury No 2,914 (>99.9)
Yes 1 (<0.1)
Currently requiring or on dialysis No 2,904 (99.6)
Yes 11 (0.4)
Disseminated cancer No 2,898 (99.4)
Yes 17 (0.6)
S_teroids or immunosuppressants for a chronic disease No 2,809 (96.4)
Yes 106 (3.6)
>10% weight loss over 6 months No 2907 (99.7)
Yes 8 (0.3)
Bleeding disorder No 2,891 (99.2)
Yes 24 (0.8)
≥_1 unit of RBC transfusion in the 72 hours preceding surgery No 2,914 (>99.9)
Yes 1 (<0.1)
ASA classification 1 89 (3.0)
2 1,367 (47.0)
3 1,459 (50.0)
Preoperative functional status Independent 2,851 (97.8)
Partially dependent 53 (1.8)
Totally dependent 5 (0.2)
Unknown 6 (0.2)
Preoperative serum sodium 139.8 (IQR, 3)
P_reoperative serum blood urea nitrogen 15.0±5.8
Preoperative serum creatinine 0.89±0.26
P_reoperative white blood cell count (×1,000) 6.9±2.7
Preoperative hematocrit 41.4 (IQR, 5.2)
P_reoperative platelet count (×1,000) 242 (IQR, 84.5)
Surgical specialty Neurosurgery 2,279 (78.2)
Orthopedics 636 (21.8)
Single or multiple level surgery Single 1,406 (48.2)
Multiple 1,509 (51.8)

Values are presented as median (interquartile range), number (%), or mean±standard deviation.

RBC, red blood cell; ASA, American Society of Anesthesiologists.

Table 2

Models’ performance metrics

Weighted precision Weighted recall F1 score Accuracy AUROC AUPRC Brier score
Prolonged LOS 0.822 (0.782–0.861) 0.882 (0.876–0.888) 0.178 (0.032–0.316) 0.882 (0.876–0.888) 0.802 (0.77–0.833) 0.437 (0.409–0.464) 0.091 (0.089–0.093)
Non-home discharge 0.918 (0.9–0.937) 0.946 (0.945–0.949) 0.161 (0.053–0.275) 0.946 (0.945–0.949) 0.816 (0.775–0.857) 0.392 (0.349–0.433) 0.045 (0.042–0.049)
Major complications 0.944 (0.943–0.945) 0.972 (0.971–0.973) 0.144 (0.021–0.179) 0.972 (0.971–0.973) 0.702 (0.610–0.754) 0.214 (0.156–0.273) 0.025 (0.024–0.026)

Values are presented as odds ratio (95% confidence interval).

AUROC, area under the receiver operating characteristic curve; AUPRC, area under the precision-recall curve; LOS, length of stay.