Performance and clinical implications of machine learning models for detecting cervical ossification of the posterior longitudinal ligament: a systematic review
Article information
Abstract
Ossification of the posterior longitudinal ligament (OPLL) is a significant spinal condition that can lead to severe neurological deficits. Recent advancements in machine learning (ML) and deep learning (DL) have led to the development of promising tools for the early detection and diagnosis of OPLL. This systematic review evaluated the diagnostic performance of ML and DL models and clinical implications in OPLL detection. A systematic review was conducted following the Preferred Reporting Items for Systematic Reviews and Meta-Analyses guidelines. PubMed/Medline and Scopus databases were searched for studies published between January 2000 and September 2024. Eligible studies included those utilizing ML or DL models for OPLL detection using imaging data. All studies were assessed for the risk of bias using appropriate tools. The key performance metrics, including accuracy, sensitivity, specificity, and area under the curve (AUC), were analyzed. Eleven studies, comprising a total of 6,031 patients, were included. The ML and DL models demonstrated high diagnostic performance, with accuracy rates ranging from 69.6% to 98.9% and AUC values up to 0.99. Convolutional neural networks and random forest models were the most used approaches. The overall risk of bias was moderate, and concerns were primarily related to participant selection and missing data. In conclusion, ML and DL models show great potential for accurate detection of OPLL, particularly when integrated with imaging techniques. However, to ensure clinical applicability, further research is warranted to validate these findings in more extensive and diverse populations.
Introduction
Ossification of the posterior longitudinal ligament (OPLL) is characterized by abnormal ligament calcification along the spinal column, predominantly affecting the cervical spine [1,2]. This progressive condition can lead to spinal canal stenosis and subsequently myelopathy, resulting in severe neurological deficits such as motor and sensory impairments [3,4]. The prevalence of OPLL varies geographically, with higher incidence rates observed in East Asian populations than in Western countries. Early and accurate diagnosis is crucial to prevent irreversible neurological damage and plan appropriate surgical interventions [1,2,5,6].
Current standard imaging modalities, such as plain radiography, computed tomography (CT), and magnetic resonance imaging (MRI), are commonly used for diagnosing OPLL [2,7,8]. However, these techniques have limited sensitivity and specificity, particularly in detecting early or subtle cases. Advanced imaging techniques such as CT myelography offer better diagnostic accuracy but are invasive and associated with higher radiation exposure [9]. Consequently, noninvasive, automated diagnostic tools are increasingly needed to enhance the accuracy and efficiency of OPLL detection.
Machine learning (ML) and deep learning (DL) models have demonstrated significant potential in medical imaging, enhancing diagnostic capabilities across various clinical areas (Fig. 1) [10–12]. Recently, ML and DL have been increasingly applied to spinal conditions, including OPLL, to improve diagnostic accuracy and reduce the burden on healthcare systems. These models, particularly convolutional neural networks (CNNs), can analyze complex imaging data, identify subtle patterns, and differentiate OPLL from other spinal pathologies accurately [11,13–15]. Several studies have demonstrated the potential of ML and DL models in OPLL detection, for example, neural networks can detect OPLL on plain cervical radiographs, highlighting the utility of the model in clinical screening [16,17]. However, several factors hinder the adoption of ML and DL models in clinical practice, including heterogeneity in study designs, small sample sizes, and the lack of external validation in diverse populations [16–18]. Therefore, the diagnostic performance and clinical utility of these models in OPLL detection must be comprehensively evaluated.
This systematic review aimed to provide an in-depth analysis of the diagnostic performance of ML and DL models in OPLL detection, assessing their accuracy, sensitivity, specificity, and clinical implications and highlighting the strengths and limitations of existing research. The findings will guide future research directions and present the potential clinical applications of ML and DL in the early diagnosis and management of OPLL.
Materials and Methods
This study was conducted in accordance with the Declaration of Helsinki and with approval from the Ethics Committee and Institutional Review Board (IRB) of University of Phayao (IRB approval no., HREC-UP-HSST 1.1/003/68). The data used in this research were acquired from a public resource.
Literature search strategy
A systematic literature search was conducted across PubMed/Medline, Scopus, and Google Scholar databases to extract studies that evaluate the performance of ML and DL models in diagnosing OPLL. The search strategy followed the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines using a combination of Medical Subject Headings terms and relevant keywords [19]. The search terms included the following: “ossification of the posterior longitudinal ligament,” “OPLL,” “cervical OPLL,” “cervical spine ossification,” “machine learning,” “deep learning,” “artificial intelligence,” “convolutional neural network,” “CNN,” “neural network,” “random forest,” “support vector machine,” “radiography,” “X-ray,” “computed tomography,” “CT,” and “magnetic resonance imaging.” The search was restricted to studies published in English between January 2000 and September 2024. Additional relevant studies were identified by manually searching the reference lists of the included articles. Two reviewers independently screened the studies, disagreements were resolved through discussions, and a third reviewer was consulted when necessary. This ensures consistency and transparency in the study selection process. The PRISMA workflow diagram is presented in Fig. 2.
Inclusion and exclusion criteria
Original research articles evaluating the diagnostic performance of ML or DL models in OPLL detection were eligible. Studies must involve patients with confirmed OPLL or related spinal conditions and use imaging data such as from radiography, CT, or MRI for model development and evaluation and report diagnostic performance metrics, such as accuracy, sensitivity, specificity, and area under the curve (AUC). Reviews, meta-analyses, case reports, conference abstracts, editorials, studies not primarily focused on the diagnostic application of ML or DL models for OPLL detection and those that did not report relevant diagnostic performance metrics were excluded.
Data extraction
Data were independently extracted by two reviewers using a standardized extraction form. Extracted information included study characteristics (author, year of publication, country, study design, sample size, and patient demographics), model information (ML or DL model type, imaging modality used, and training/validation methods), diagnostic performance metrics (accuracy, sensitivity, specificity, and AUC), and key findings and limitations. Disagreements were resolved by consensus or involving a third reviewer to ensure data accuracy and completeness.
Assessment of the risk of bias
The risk of bias for each included study was assessed using the Risk of Bias in Nonrandomized Studies of Interventions (ROBINS-I) tool [20]. This tool evaluates the risk of bias across seven domains: biases due to confounding, selection of participants, classification of interventions, deviations from intended interventions, missing data, measurement of outcomes, and selection of reported results. Each domain was rated as “low,” “moderate,” “serious,” or “critical” risk of bias based on predefined criteria. The overall risk of bias for each study was determined by the highest level of risk identified in any single domain. Studies with a “serious” or “critical” risk of bias were considered to have significant limitations that could affect the validity of their findings.
Results
Study characteristics
This systematic review included a total of 11 studies involving 6,031 patients, and all focused on the diagnostic performance of ML and DL models for OPLL detection. These studies were conducted between 2021 and 2024 and originated from Japan, South Korea, China, and Israel. Most studies employed a retrospective design, and only one was a prospective multicenter trial. The sample sizes varied from 100 to 901 patients, employing diverse imaging modalities, including plain radiography, CT, and MRI. Demographics and characteristics of the studies are shown in Table 1 [7,16,17,21–28].
Risk of bias analysis
The risk of bias across the included studies was assessed using the ROBINS-I tool [20]. The overall risk of bias was moderate, with common issues related to participant selection, confounding, and missing data. The studies by Murata et al. [16] and Shemesh et al. [7] exhibited serious concerns because of biases in participant selection and missing data, potentially affecting the generalizability of the studies. Although most studies demonstrated a low risk of bias in the intervention classification and outcome measurement, they were limited by small sample sizes and lack of external validation, which may affect the validity of their findings. A detailed assessment of the risk of bias for each study is presented in Fig. 3 and Table 2 [7,16,17,21–28].

Result of risk of bias analysis using the Risk of Bias in Nonrandomized Studies of Interventions (ROBINS-I) tool.
Performance of ML and DL models in OPLL detection
The included studies highlighted the high diagnostic performance of ML and DL models in OPLL detection (Table 3) [7,16,17,21–28]. The reported accuracy ranged from 69.6% to 98.9%, with AUC values up to 0.99. Murata et al. [16] reported the highest accuracy of 98.9% using a Residual Neural Network (ResNet12) on cervical lateral X-ray images, achieving a sensitivity of 97.0%, specificity of 99.4%, and AUC of 0.99. Tamai et al. [22] demonstrated an AUC of 0.94 with the EfficientNetB2 CNN model, outperforming experienced spine surgeons. Maki et al. [21] utilized various ML models, such as LightGBM and XGBoost, to predict surgical outcomes in patients with OPLL, and the random forest model showed an AUC of 0.75 at the 2-year follow-up. Chae et al. [25] reported that their DL model significantly improved radiologist performance in diagnosing OPLL, achieving a sensitivity of 91% and an AUC of 0.851. The diagnostic performance of ML models in detecting cervical OPLL across multiple studies. Fig. 4A highlights the high accuracy of models such as ResNet12 and ResNet101 [7,16,21–24,28]. Fig. 4B illustrates the balance between sensitivity and specificity, and certain models achieved near-perfect specificity [7,16,21–24,28]. Fig. 4C emphasizes the AUC, where models such as ResNet101 demonstrate robust diagnostic performance [7,16,21–24,28]. Fig. 4D provides a comprehensive heatmap summarizing the accuracy, sensitivity, specificity, and AUC, showcasing the relative strengths and limitations of each model [7,16,17,21–28].

The performance and artificial neural network in cervical ossification of the posterior longitudinal ligament and clinical implications

Performance metrics of machine learning models for ossification of the posterior longitudinal ligament (OPLL) detection. (A) Accuracy across studies, (B) sensitivity and specificity comparison, (C) area under the curve (AUC) of machine learning models, and (D) Heatmap summarizing the performance metrics across studies.
Clinical Implications of ML and DL Models
ML and DL models have shown significant clinical potential for the early detection and management of OPLL (Table 3) [7,16,17,21–28]. High-accuracy models, such as those developed by Murata et al. [16] and Miura et al. [17], could be integrated into clinical workflows for screening, particularly in primary care and emergency settings, reducing the need for invasive and costly imaging modalities such as CT or MRI. Predictive models, such as those developed by Maki et al. [21] and Ito et al. [26], could aid in preoperative planning and risk stratification, enabling clinicians to identify high-risk cases and optimize surgical strategies. In addition, DL models used by Chae et al. [25] and Shemesh et al. [7] enhanced the diagnostic performance of radiologists, particularly in complex cases. They could be employed to support less experienced clinicians, thus improving the overall diagnostic accuracy and patient care.
Limitations and future directions
Despite the promising outcomes, several limitations must be addressed (Table 3) [7,16,17,21–28]. Most studies were conducted in single-center settings with limited sample sizes, primarily involving East Asian populations, which may restrict the generalizability of the findings to other demographics. Moreover, many studies lacked external validation and prospective designs, underscoring the need for larger multicenter trials to verify the clinical utility of these models. Future studies should validate these models in diverse populations and integrate them into clinical practice to assess their actual effect on patient outcomes and healthcare workflows.
Discussion
Recent advancements in artificial intelligence (AI), particularly ML and DL, have greatly improved the diagnostic accuracy of medical imaging. Techniques such as ResNet and gradient-weighted class activation mapping (Grad-CAM) are widely used. ResNet overcomes challenges in training deep models using skip connections, enabling better feature extraction for tasks such as detecting OPLL. Grad-CAM enhances model interpretability by creating heatmaps that highlight important regions in images, offering clinicians valuable insights into AI decision-making. These methods are transforming OPLL detection and diagnosis, bridging gaps in radiological expertise. The findings of this systematic review indicate that ML and DL models demonstrate high diagnostic performance in OPLL detection using various imaging modalities, such as radiography, CT, and MRI. Models such as ResNet12 and EfficientNetB2 have achieved remarkable diagnostic accuracies, often surpassing the performance of experienced spine surgeons [16,22]. Murata et al. [16] reported an accuracy of 98.9% and an AUC of 0.99 using a Residual Neural Network on cervical radiographs, suggesting the potential utility of the model for clinical screening and early detection of OPLL. Tamai et al. [22] reported that the DL model, based on EfficientNetB2, achieved an AUC of 0.94, outperforming spine surgeons in diagnostic accuracy. Such high levels of performance underscore the capability of DL models to detect subtle patterns in imaging data that may be missed by human observers.
The clinical relevance of these findings is substantial, particularly considering the challenges associated with the diagnosis of OPLL during its early stages. Traditional imaging modalities, such as plain radiographs and MRI, often struggle to detect early or subtle OPLL cases, which can lead to delayed diagnosis and progression of neurological symptoms [1,2,29]. DL models can analyze complex imaging data with high precision, facilitating earlier and more accurate diagnosis. This capability is valuable in primary care and emergency settings, where access to specialized spinal imaging and expertise may be limited. By integrating these models into routine clinical workflows, healthcare providers could improve the accuracy and efficiency of OPLL diagnosis, thereby reducing the need for more invasive and costly imaging techniques, such as CT myelography.
Moreover, several studies in this review explored the use of ML models to predict surgical outcomes and complications in patients with OPLL [22,25,26]. DL employed a combination of the LightGBM, XGBoost, and random forest models to predict clinically significant improvements following surgery for cervical OPLL. The models demonstrated good predictive ability, with an AUC of 0.75 for the random forest model at the 2-year follow-up [21]. Such predictive models could be highly beneficial in clinical practice, aiding surgeons in preoperative planning and patient counseling. By identifying key prognostic factors, these models can help clinicians better stratify surgical risk, optimize patient selection for surgical interventions, and develop personalized treatment plans [21].
The limitations and challenges must be resolved before these models can be widely adopted in clinical practice. A major drawback of the current literature is the predominance of single-center retrospective studies, which may introduce biases related to patient selection and data heterogeneity. Furthermore, most studies have been conducted within East Asian populations, which limits the generalizability of the findings to other demographic groups. The ML and DL models used, along with the performance metrics reported across studies, vary considerably, which hinder identifying the most effective approach for OPLL detection. This heterogeneity highlights the need for standardized methodologies and performance metrics in future research.
Another critical challenge is the interpretability of ML and DL models. Although these models can achieve high levels of diagnostic accuracy, their decision-making processes are often vague, making it challenging for clinicians to understand how specific diagnoses are determined. This lack of clarity can hinder the acceptance of these tools in clinical settings, where explainability is essential for ensuring trust and facilitating shared decision-making between clinicians and patients. To improve the clinical applicability of these models, future investigations should focus on developing more interpretable algorithms, possibly by using visual explanation techniques, such as Grad-CAM, or incorporating hybrid models that combine DL with traditional statistical methods [17].
The integration of ML and DL models into clinical practice also poses significant logistical and technical challenges. Their implementation requires a robust infrastructure, including high-quality annotated imaging datasets and computational resources for model training and validation [11,26,30]. Moreover, healthcare providers must be trained for effective use of these tools, and clinical workflows may need to be adapted to incorporate automated diagnostic support. Addressing these challenges will require collaboration among clinicians, data scientists, and policymakers to develop practical strategies for the deployment and maintenance of ML and DL tools in healthcare settings [11]. To facilitate the transition of ML and DL models for OPLL detection into clinical practice, several critical aspects must be addressed, such as acquisition of regulatory approval, integration with hospital systems, and development of practical implementation strategies. Compliance with local and international regulations, such as the U.S. Food and Drug Administration guidelines and the European Union’s Medical Device Regulation, is essential to ensure the safety, performance, and explainability of AI tools. Seamless integration with existing hospital infrastructures, including compliance with electronic health record (her) systems, radiological workflows, and data protection laws, is necessary for successful adoption. Practical strategies, such as phased implementation, clinician training, and continuous feedback mechanisms, can help mitigate implementation challenges. Moreover, cost–benefit analyses and stakeholder engagement are crucial to overcoming barriers such as high initial costs and resistance to new technologies. By addressing these aspects, ML and DL models can significantly enhance the diagnostic accuracy and efficiency of OPLL detection and ensure their safe and reliable deployment in clinical settings.
ML and DL models offer considerable advantages in OPLL diagnosis and management. These models improve diagnostic accuracy by facilitating earlier and more reliable identification of OPLL, which reduces diagnostic errors and improves patient outcomes. They also streamline clinical workflows by automating time-consuming diagnostic processes, thereby shortening the time required for image interpretation and decision-making. Furthermore, ML and DL models improve surgical planning by identifying patient-specific risks and predicting surgical outcomes, leading to better resource allocation and increased patient satisfaction. However, several challenges must be addressed before fully integrating these technologies into clinical practice. ML and DL systems require substantial initial investment, including the costs of acquiring advanced computational hardware, integrating these systems into existing hospital infrastructures, and ensuring compatibility with electronic health record systems. Comprehensive training programs for healthcare professionals, such as radiologists and spine surgeons, are essential to maximize the utility of these models, requiring both time and financial resources. In addition, ongoing costs for system maintenance, software updates, and quality assurance processes add to the long-term financial burden. Despite these challenges, the economic implications of ML and DL integration are encouraging. In high-volume centers, the initial costs can be justified by the long-term benefits, such as improved diagnostic efficiency and optimized surgical planning. Cost–benefit analyses highlight the financial benefits of these systems in institutions attending to large patient caseloads, where decreases in diagnostic errors and resource utilization translate into significant savings. Over time, ML and DL models may also reduce overall healthcare costs by optimizing workflows and decreasing reliance on expensive imaging modalities, offering a sustainable solution to enhancing patient care.
Future studies should focus on large-scale multicenter investigations to validate the diagnostic performance of these models across diverse populations and clinical settings. Such studies should strive to standardize imaging protocols and data preprocessing methods to improve the comparability and reproducibility of the results. Moreover, prospective studies are needed to assess the effects of ML and DL models on clinical outcomes, including diagnostic accuracy, treatment decision-making, and patient satisfaction. Evaluating the cost-effectiveness of these models will be crucial for their wider acceptance, as healthcare systems increasingly prioritize interventions that offer high value relative to their costs.
Conclusions
ML and DL models show significant promise for OPLL detection and management, with potential applications ranging from screening to surgical planning. However, various challenges related to data availability, model interpretability, and clinical integration must be addressed before these tools can be widely implemented. Future investigations should prioritize extensive validation studies and the development of clear, user-friendly models that can seamlessly integrate into clinical practice, ultimately improving the care of patients with OPLL.
Key Points
Machine learning (ML) and deep learning (DL) models demonstrated high diagnostic accuracy for detecting cervical ossification of the posterior longitudinal ligament, with accuracy rates ranging from 69.6% to 98.9% and area under the curve values up to 0.99.
Convolutional neural networks and random forest models were the most frequently used ML/DL approaches. These models utilized imaging modalities such as plain radiography, computed tomography, and magnetic resonance imaging.
ML/DL models have the potential to enhance diagnostic accuracy, reduce reliance on invasive imaging methods, and support clinical decision-making in primary care, emergency, and surgical planning settings.
Standardization of methodologies, large-scale multicenter validation studies, and improved model interpretability are critical for the integration of ML/DL tools into clinical practice in future directions.
Notes
Conflict of Interest
No potential conflict of interest relevant to this article was reported.
Acknowledgments
The authors would like to thank the Thailand Science Research and Innovation Fund (Fundamental Fund 2025, Grant No. 5025/2567) and the School of Medicine, University of Phayao.
Author Contributions
Conceptualization: WL. Methodology: WL. Data curation: WL, STC, PS, WC, NT, PT, IH. Formal analysis: WL, WC. Visualization: WL, STC, PS, WC, NT, PT, IH. Project administration: WL. Writing–original draft preparation: WL. Writing–review and editing; WL. Supervision: WL. Final approval of the manuscript: all authors.