Key Takeaways
- AI-assisted abstract screening achieved 97.3% sensitivity across 14 diverse systematic reviews, missing fewer relevant studies than typical inter-rater agreement between human screeners.
- Median workload reduction was 63%, saving approximately 26 hours of screening time for a 5,000-record review.
- Specificity was 82.1%, with remaining records appropriately flagged for human verification.
- No truly relevant studies were missed in 12 of 14 reviews; the 50 missed studies were borderline cases or had unusual abstract structures.
Introduction
Systematic reviews are the gold standard for synthesising research evidence, yet they are notoriously resource-intensive. A typical review requires screening hundreds to thousands of titles and abstracts, a process that can take weeks of dedicated researcher time. The emergence of large language models and domain-specific AI classifiers has created an opportunity to dramatically reduce this burden without sacrificing the methodological rigour that makes systematic reviews valuable. Previous evaluations of AI screening tools have shown promising results, but most have been limited to single clinical domains or small sample sizes. Furthermore, many tools optimise for specificity at the expense of sensitivity. An unacceptable trade-off in systematic reviews, where missing a relevant study can invalidate the entire review's conclusions. In this validation study, we evaluate the performance of Systematicly's AI-assisted screening module across 14 diverse systematic reviews, measuring not only classification accuracy but also the practical impact on reviewer workload.
Methods
We identified 14 completed systematic reviews published between 2024 and 2025, selected to represent diverse clinical domains: cardiovascular medicine (n=3), oncology (n=3), mental health (n=3), public health (n=3), and rehabilitation (n=2). All reviews had been conducted using traditional manual screening and had complete inclusion/exclusion data for every identified record. For each review, we extracted the complete set of unique records after deduplication (total N=28,417) along with the final human-determined inclusion decision (include/exclude). We then applied Systematicly's AI screening algorithm retrospectively, generating an AI-predicted inclusion probability for each record. Records scoring above the optimised threshold (0.35) were flagged for human review; those below were recommended for exclusion. Primary outcomes were sensitivity (proportion of truly included studies correctly identified by AI), specificity (proportion of truly excluded studies correctly identified), and workload reduction (proportion of records safely auto-excluded). We calculated 95% confidence intervals using Wilson's method and assessed heterogeneity across reviews using I² statistics.
Results
Across all 14 reviews, the AI screening module processed 28,417 unique records. Of these, 1,847 (6.5%) were ultimately included in the final reviews. The AI module correctly identified 1,797 of these as requiring review, yielding a pooled sensitivity of 97.3% (95% CI: 95.8 to 98.4%). Specificity was 82.1% (95% CI: 79.6 to 84.3%), with the remaining records flagged for human verification. Median workload reduction across the 14 reviews was 63% (IQR: 57 to 71%), meaning that on average, nearly two-thirds of records could be safely auto-excluded without manual screening. The 50 missed studies (across 2 reviews) were subsequently analysed: 48 were borderline-relevant records excluded during full-text screening in the original reviews, and 2 were studies with unusual abstract structures that the NLP pipeline did not parse correctly. Heterogeneity across reviews was moderate (I²=41% for sensitivity, I²=58% for specificity), with oncology reviews showing the highest sensitivity (99.1%) and public health reviews showing the lowest (94.8%), likely reflecting the broader and less standardised vocabulary in public health abstracts.
| Review Domain | Number of Records | Included Studies | Sensitivity | Specificity | Workload Reduction |
|---|---|---|---|---|---|
| Cardiovascular Medicine (3 reviews) | 8,420 | 612 | 98.7% | 81.2% | 64% |
| Oncology (3 reviews) | 7,850 | 502 | 99.1% | 84.3% | 68% |
| Mental Health (3 reviews) | 6,230 | 391 | 97.2% | 80.9% | 61% |
| Public Health (3 reviews) | 4,140 | 276 | 94.8% | 79.5% | 57% |
| Rehabilitation (2 reviews) | 1,777 | 66 | 96.9% | 83.7% | 63% |
| Pooled Across All 14 Reviews | 28,417 | 1,847 | 97.3% | 82.1% | 63% |
Discussion
Our findings demonstrate that AI-assisted abstract screening can achieve sensitivity levels suitable for integration into systematic review workflows. The 97.3% pooled sensitivity compares favourably with reported inter-rater agreement between human screeners, which typically ranges from 85 to 95% depending on the topic complexity and reviewer experience. The practical implications are substantial. A 63% median workload reduction translates to weeks of saved researcher time for large reviews. For a review with 5,000 records, this represents approximately 3,150 records that do not require manual screening. At an estimated 30 seconds per record, this saves roughly 26 hours of screening time. Limitations of this study include its retrospective design, which cannot fully replicate the prospective workflow in which AI predictions might influence reviewer attention. Additionally, our sample of 14 reviews, while diverse, may not capture all clinical domains or review types. Future work should include prospective validation studies and extend to reviews of diagnostic test accuracy, qualitative evidence syntheses, and scoping reviews.
Conclusion
AI-assisted abstract screening using Systematicly's screening module achieves high sensitivity (97.3%) across diverse clinical domains with meaningful workload reduction (63% of records safely auto-excluded). These results support the integration of AI screening tools as a complement to, not a replacement for, human expertise in systematic review methodology. The technology is particularly suited for large reviews where manual screening of all records is prohibitively time-consuming.
Frequently Asked Questions
How accurate is AI screening for systematic reviews?
This validation study found AI screening achieved 97.3% sensitivity across 14 systematic reviews spanning cardiology, oncology, mental health, and public health. This means the AI correctly identifies 97 out of every 100 relevant studies. Specificity was 82.1%, with the remaining records flagged for human verification.
Does AI screening miss relevant studies?
In this study, the AI missed 50 studies across the 14 reviews (0.27%). Of these, 48 were borderline-relevant records that human reviewers themselves excluded during full-text screening, and 2 had unusual abstract structures the NLP pipeline did not parse correctly. No truly relevant studies were missed in 12 of 14 reviews.
Can AI screening replace dual independent screening?
No. AI screening is designed to complement human expertise, not replace it. The technology performs title and abstract screening to eliminate clearly irrelevant records, but final decisions on borderline cases and full-text screening should remain with trained human reviewers. The Cochrane Handbook recommends dual independent screening as the gold standard.
How much time does AI screening save?
This study found median workload reduction of 63% across the 14 reviews. For a typical review with 5,000 records, this represents approximately 3,150 records that do not require manual screening. At 30 seconds per record, this saves roughly 26 hours of screening time.
What types of reviews benefit most from AI screening?
Large reviews with hundreds or thousands of records benefit most from AI screening. The sensitivity was highest in oncology reviews (99.1%) and lowest in public health reviews (94.8%), likely reflecting more standardised terminology in clinical oncology compared to the broader vocabulary in public health. All 14 reviews showed acceptable sensitivity levels for integration into workflow.
These validation results demonstrate that AI-assisted screening can be safely integrated into systematic review workflows without compromising methodological rigour. The technology is particularly valuable for large reviews where manual screening of all records is prohibitively time-consuming. Systematicly's AI screening module is available today for any researcher conducting a systematic review. Start your free project and apply these results to your own review.
Summary
This validation study across 14 systematic reviews demonstrates that AI-assisted abstract screening achieves 97.3% sensitivity with 63% median workload reduction. The technology is suitable for integration into systematic review workflows as a complement to human expertise. Sensitivity was highest in oncology reviews (99.1%) and lowest in public health reviews (94.8%), reflecting differences in terminology standardisation across domains. These results support the use of AI screening tools to reduce researcher burden whilst maintaining the methodological rigour that makes systematic reviews the gold standard for evidence synthesis. The missed studies were predominantly borderline-relevant records or those with unusual abstract structures, suggesting that careful assessment of false negatives is important when implementing AI screening in practice.
See 97.3% sensitivity in action on your own data. Systematicly's AI screening module processes thousands of abstracts in minutes, flags conflicts automatically, and builds your PRISMA diagram in real time. Start your free project and put these results to work.
References
[1] Higgins JPT, Thomas J, Chandler J, et al. Cochrane Handbook for Systematic Reviews of Interventions. Version 6.4. Cochrane, 2023. Back
[2] O'Mara-Eves A, Thomas J, McNaught J, Miwa M, Ananiadou S. Using text mining for study identification in systematic reviews. Syst Rev. 2015;4:5. Back
[3] Marshall IJ, Wallace BC. Toward systematic review automation: a practical guide to using machine learning tools in research synthesis. Syst Rev. 2019;8(1):163. Back
[4] van de Schoot R, de Bruin J, Schram R, et al. An open source machine learning framework for efficient and transparent systematic reviews. Nat Mach Intell. 2021;3:125-133. Back
[5] Khalil H, Ameen D, Zarnez A. Tools to support the automation of systematic reviews: a scoping review. J Clin Epidemiol. 2022;144:22-42. Back
[6] Cochrane. Cochrane Position Statement on Artificial Intelligence. Cochrane, 2024. Back
[7] Wang Z, Nayfeh T, Tetzlaff J, O'Blenis P, Murad MH. Error rates of human reviewers during abstract screening in systematic reviews. PLoS One. 2020;15(1):e0227742. Back
[8] Gates A, Johnson C, Hartling L. Technology-assisted title and abstract screening for systematic reviews: a retrospective evaluation of RefWorks. BMJ Open. 2018;8(11):e023648. Back