Introduction
Kyphoplasty has become a principal intervention for vertebral compression fractures (VCFs), particularly amid growing concerns over the efficacy and safety of vertebroplasty [
1–
4]. This minimally invasive procedure involves balloon inflation within the fractured vertebra to restore height, followed by cement augmentation, thereby achieving the dual benefits of vertebral stabilization and deformity correction [
5,
6]. Between 2007 and 2014, the use of both vertebroplasty and kyphoplasty declined, although vertebroplasty showed a sharper decrease. This trend reflects evolving clinical guidelines, increasing scrutiny of vertebroplasty outcomes, and the perception of kyphoplasty as a safer and more effective alternative [
7]. However, debate persists regarding kyphoplasty’s long-term benefits, cost-effectiveness, and comparative efficacy relative to non-surgical and emerging conservative treatments [
6,
8–
10].
Central to this debate is the quality and statistical rigor of randomized controlled trials (RCTs), which form the cornerstone of evidence-based clinical practice [
11]. Although
p-values remain the conventional measure of statistical significance, excessive reliance on them can obscure the true robustness of trial results. Because
p-values are highly sensitive to sample size, data variability, and methodological nuances, a statistically significant result may hinge on only a few event reversals, which challenges the stability of clinical conclusions [
12]. Fragility metrics such as the fragility index (FI), reverse fragility index (rFI), and fragility quotient (FQ) provide a more nuanced understanding of the stability of study outcomes [
13,
14]. These metrics quantify the smallest number of event changes required to alter a result from significant to nonsignificant (or vice versa), thereby offering a tangible measure of a study’s statistical robustness. Although fragility analyses have been applied in select areas of spine surgery, the literature concerning kyphoplasty remains limited [
15–
17]. Given kyphoplasty’s widespread use in managing VCFs, it is crucial to critically assess the reliability of the RCT evidence that supports its clinical application.
The objective of this study was to conduct a comprehensive fragility analysis of RCTs evaluating the efficacy, and safety of kyphoplasty. By applying fragility metrics, we aimed to assess the robustness of existing trial data and identify potential vulnerabilities in the evidence base that informs clinical practice. We hypothesized that RCTs investigating kyphoplasty will demonstrate considerable statistical fragility, emphasizing the need for cautious interpretation of findings and the development of more rigorous study designs.
Results
The initial database searches yielded 550 studies after duplicate removal. Following title and abstract screening, 150 studies were excluded. The remaining 159 full-text articles were assessed, and 36 RCTs published across 22 journals met the inclusion criteria for final analysis. A PRISMA flow diagram summarizing the screening process and literature selection is displayed in
Fig. 2.
Across the 36 included RCTs, a total of 282 dichotomous outcomes were identified (
Table 3), comprising 18 statistically significant (
p<0.05) and 264 nonsignificant results. For all outcomes combined, the median FI was 4 (IQR, 4–5), and the median FQ was 0.020 (IQR, 0.013–0.040), suggesting that a change in outcomes for approximately 2% of participants could alter the statistical conclusions of these trials. Among the 18 significant outcomes, the median FI was 2 (IQR, 1–4), with a corresponding FQ of 0.015 (IQR, 0.011–0.029). In contrast, the 264 nonsignificant outcomes demonstrated a median FI of 5 (IQR, 4–5) and a median FQ of 0.020 (IQR, 0.013–0.041).
Subgroup analyses of the reported outcomes are summarized in
Table 4. Trials comparing kyphoplasty with vertebroplasty represented the most fragile subgroup, with a median FI of 5 (IQR, 4–5) and a median FQ of 0.013 (IQR, 0.010–0.029), suggesting that the reversal of just 1.3% of patients would alter the statistical significance of results. The next most fragile subgroup was complications, with a median FI of 5 (IQR, 4–5) and a median FQ of 0.016 (IQR, 0.013–0.020). Fracture-related outcomes had a median FI of 5 (IQR, 4–6) and a median FQ of 0.020 (IQR, 0.013–0.038), while pain outcomes had a median FI of 4 (IQR, 3–5) and a median FQ of 0.025 (IQR, 0.014–0.039). The least fragile subgroup was cement leakage, with a median FI of 4 (IQR, 3–5) and a median FQ of 0.048 (IQR, 0.029–0.061). Complications were the most frequently reported outcomes (n=142), whereas pain-related outcomes were the least reported (n=24) (
Table 4). Notably, the number of patients lost to follow-up exceeded the FI in 148 of the 282 outcomes (52.48%) (
Table 3), suggesting that the inclusion of these patients could have reversed the statistical significance in more than half of the reported outcomes.
Discussion
Recent evaluations of spine surgery literature have revealed a recurring pattern in which statistically significant findings often lack robustness, calling into question the reliability of reported treatment effects [
15,
20]. Although VCFs are common among older adults and kyphoplasty represents a substantial healthcare expenditure, few studies have examined the statistical fragility of kyphoplasty-related outcomes [
7,
21]. The present study addresses this gap by systematically evaluating the fragility of RCTs evaluating kyphoplasty outcomes. Among available fragility metrics, the FQ provides a distinct advantage by accounting for sample size, thereby addressing a key limitation of the FI and the rFI, both of which disregard the proportion of affected patients [
22]. Accordingly, median FQ was selected as the principal measure of outcome stability in this analysis. Our results demonstrated that kyphoplasty RCTs, particularly those comparing kyphoplasty with vertebroplasty and those assessing complication rates, exhibited considerable statistical fragility, suggesting that even minimal changes in patient outcomes could reverse the reported significance of many findings.
The median FQ across all kyphoplasty RCT outcomes was 0.02 (IQR, 0.013–0.040), indicating that a reversal in outcomes for only 2% of participants would be sufficient to alter statistical significance. This degree of fragility is comparable to that reported in a review of 32 Food and Drug Administration investigational device exemption trials in spine surgery, which demonstrated an average FQ of 0.027 [
23]. Similarly, Muthu and Ramakrishnan [
15] examined 70 spine surgery RCTs and observed an even lower median FQ of 0.0148 (IQR, 0–0.033). In that study, 27.1% of outcomes lost statistical significance when reanalyzed using two-sided Fisher’s exact tests, resulting in an FI of 0 for those trials. Comparable trends have been identified in orthopedic literature, where even FQs as high as 9.7% in distal radius fracture RCTs have been characterized as fragile [
24]. Collectively, these findings suggest that the fragility observed in kyphoplasty RCTs aligns with the broader pattern of statistical instability reported across spine and orthopedic surgery research.
When examining the significance of outcomes, statistically significant results demonstrated a 25% lower median FQ of 0.015 (IQR, 0.011–0.029) compared with nonsignificant outcomes, which had a median FQ of 0.02 (IQR, 0.013–0.041). Moreover, more than half of all analyzed outcomes originated from trials in which the number of patients lost to follow-up exceeded the number required to reverse the result based on FQ. For instance, a multicenter RCT by Beall et al. [
25] involving 285 patients reported fewer cardiovascular-related unplanned readmissions for a PEEK (polyether ether ketone) implant compared with balloon kyphoplasty. In that trial, 32 patients were lost to follow-up, whereas the reversal of outcomes in just 1.1 patients (0.39%) would have nullified the statistical significance. This example illustrates the risk of overinterpreting significant findings in studies with high attrition, particularly when statistical fragility is not considered. Such limitations may contribute to overly optimistic conclusions that may influence clinical recommendations and could promote interventions whose true efficacy or safety remains uncertain.
Our findings also highlight the intrinsic relationship between trial design characteristics and statistical fragility. Smaller sample sizes inherently reduce the number of event reversals needed to alter statistical significance, resulting in artificially low FIs that may exaggerate treatment effects. Similarly, high attrition amplified this vulnerability. In more than half of the analyzed outcomes, the number of patients lost to follow-up exceeded the FI, indicating that complete follow-up alone could have reversed the study’s conclusion. These methodological limitations (limited sample size and incomplete follow-up) collectively contribute to the fragility of kyphoplasty RCTs and may partially explain the variability observed across outcome subgroups, particularly those related to pain and complications. Future trials should therefore ensure adequate sample sizes, minimize loss to follow-up, and incorporate fragility analysis into the reporting of dichotomous outcomes.
We also observed considerable variability in statistical fragility across different outcome types. Pain and cement leakage outcomes demonstrated the highest median FQs but were associated with the lowest median FIs, indicating that even these relatively stable endpoints were susceptible to reversal with only a few event changes. Complication outcomes, by contrast, exhibited particularly low FQs. Even for cement leakage, among the most robust subgroup, the median FQ was only 0.048 (IQR, 0.029–0.061), suggesting persistent vulnerability. Notably, trials comparing kyphoplasty with vertebroplasty showed the lowest FQs overall. This finding is concerning, as many studies advocating for kyphoplasty’s clinical superiority rely heavily on comparative outcomes such as postoperative pain, cement leakage, or fracture incidence [
26–
28].
An illustrative example is the trial by Vogl et al. [
29], which compared a cement-directed kyphoplasty system with vertebroplasty in 77 patients with painful osteoporotic VCFs. The authors reported that kyphoplasty resulted in significantly less cement leakage. However, our analysis indicated that a change in outcomes for only 2.9% of patients would have nullified this finding. This observation raises concern regarding the reliability of frequently cited differences between kyphoplasty and vertebroplasty. Replication of these results in larger, methodologically rigorous trials is essential to determine whether the observed outcome differences are clinically meaningful or merely statistically unstable. Accordingly, there is a pressing need for more robust RCTs comparing kyphoplasty and vertebroplasty, given the significant implications these findings hold for treatment selection, patient safety, and postoperative quality of life.
As investigations continue to explore the comparative merits of nonoperative care, kyphoplasty, and vertebroplasty for managing VCFs, the present findings underscore the need to strengthen the methodological foundation of future research [
30]. Including fragility measures such as the FQ alongside conventional p-values, coupled with concerted efforts to minimize attrition, can help reduce the overestimation of treatment effects and enhance the interpretability of outcomes. These methodological improvements will improve the reliability of future kyphoplasty RCTs and contribute to more robust and transparent evidence across procedural outcome research in spine surgery.
This study has several important limitations. First, the analysis was confined to dichotomous outcomes, as fragility metrics are best validated for categorical data. Continuous variables such as pain intensity scores or functional assessments may display different patterns of fragility and should be evaluated in future investigations. Second, there is no universally accepted threshold defining what constitutes a “fragile” FI or FQ within spine surgery, which may limit interpretability. Nonetheless, our emphasis on the FQ and its relationship with patient attrition provides a clinically meaningful context for understanding result stability. Third, because the fragility analysis relied exclusively on p-values, it did not account for other critical aspects of trial quality, including randomization procedures, inclusion criteria, blinding methods, or variability in patient populations, all of which are integral to the broader reliability and validity of study findings.
Another concern relates to follow-up reporting. Although the number of patients lost to follow-up was recorded for each study, few trials detailed the reasons for attrition, making it difficult to assess how missing data may have influenced fragility estimates. This issue reflects the limitations of primary study reporting rather than the fragility analysis itself, but highlights the need for greater transparency in clinical trial methodology. Lastly, by focusing exclusively on dichotomous outcomes, this analysis did not incorporate other data types frequently used to assess kyphoplasty efficacy, which may restrict the generalizability of our conclusions. Future work to adapt fragility methods for continuous outcomes would be valuable.