AI Screening Accuracy: 14-Review Validation

Abstract

Background: Manual title and abstract screening remains the most time-intensive phase of systematic reviews, often consuming 40–60% of total project hours. AI-assisted screening tools promise to reduce this burden, but independent validation across diverse clinical topics is limited. Objective: To evaluate the sensitivity, specificity, and workload reduction of Systematicly's AI screening module across 14 completed systematic reviews spanning cardiovascular medicine, oncology, mental health, and public health. Methods: We retrospectively applied the AI screening algorithm to 14 published systematic reviews comprising 28,417 unique records. Each record had a human-determined gold-standard inclusion decision. We measured sensitivity (recall), specificity, F1 score, and the proportion of records that could be safely excluded without manual review. Results: The AI module achieved a pooled sensitivity of 97.3% (95% CI: 95.8–98.4%) and specificity of 82.1% (95% CI: 79.6–84.3%). Median workload reduction was 63% of records safely auto-excluded. No included study was missed in 12 of 14 reviews. Conclusions: AI-assisted screening demonstrates high sensitivity suitable for integration into systematic review workflows, with substantial time savings for research teams.

Keywords: artificial intelligence; systematic review; abstract screening; machine learning; research methodology

Key Takeaways

AI-assisted abstract screening achieved 97.3% sensitivity across 14 diverse systematic reviews, missing fewer relevant studies than typical inter-rater agreement between human screeners.
Median workload reduction was 63%, saving approximately 26 hours of screening time for a 5,000-record review.
Specificity was 82.1%, with remaining records appropriately flagged for human verification.
No truly relevant studies were missed in 12 of 14 reviews; the 50 missed studies were borderline cases or had unusual abstract structures.

Introduction

Systematic reviews are the gold standard for synthesising research evidence, yet they are notoriously resource-intensive. A typical review requires screening hundreds to thousands of titles and abstracts, a process that can take weeks of dedicated researcher time. The emergence of large language models and domain-specific AI classifiers has created an opportunity to dramatically reduce this burden without sacrificing the methodological rigour that makes systematic reviews valuable. Previous evaluations of AI screening tools have shown promising results, but most have been limited to single clinical domains or small sample sizes. Furthermore, many tools optimise for specificity at the expense of sensitivity. An unacceptable trade-off in systematic reviews, where missing a relevant study can invalidate the entire review's conclusions. In this validation study, we evaluate the performance of Systematicly's AI-assisted screening module across 14 diverse systematic reviews, measuring not only classification accuracy but also the practical impact on reviewer workload.

Methods

We identified 14 completed systematic reviews published between 2024 and 2025, selected to represent diverse clinical domains: cardiovascular medicine (n=3), oncology (n=3), mental health (n=3), public health (n=3), and rehabilitation (n=2). All reviews had been conducted using traditional manual screening and had complete inclusion/exclusion data for every identified record. For each review, we extracted the complete set of unique records after deduplication (total N=28,417) along with the final human-determined inclusion decision (include/exclude). We then applied Systematicly's AI screening algorithm retrospectively, generating an AI-predicted inclusion probability for each record. Records scoring above the optimised threshold (0.35) were flagged for human review; those below were recommended for exclusion. Primary outcomes were sensitivity (proportion of truly included studies correctly identified by AI), specificity (proportion of truly excluded studies correctly identified), and workload reduction (proportion of records safely auto-excluded). We calculated 95% confidence intervals using Wilson's method and assessed heterogeneity across reviews using I² statistics.

Results

Across all 14 reviews, the AI screening module processed 28,417 unique records. Of these, 1,847 (6.5%) were ultimately included in the final reviews. The AI module correctly identified 1,797 of these as requiring review, yielding a pooled sensitivity of 97.3% (95% CI: 95.8 to 98.4%). Specificity was 82.1% (95% CI: 79.6 to 84.3%), with the remaining records flagged for human verification. Median workload reduction across the 14 reviews was 63% (IQR: 57 to 71%), meaning that on average, nearly two-thirds of records could be safely auto-excluded without manual screening. The 50 missed studies (across 2 reviews) were subsequently analysed: 48 were borderline-relevant records excluded during full-text screening in the original reviews, and 2 were studies with unusual abstract structures that the NLP pipeline did not parse correctly. Heterogeneity across reviews was moderate (I²=41% for sensitivity, I²=58% for specificity), with oncology reviews showing the highest sensitivity (99.1%) and public health reviews showing the lowest (94.8%), likely reflecting the broader and less standardised vocabulary in public health abstracts.

Review Domain	Number of Records	Included Studies	Sensitivity	Specificity	Workload Reduction
Cardiovascular Medicine (3 reviews)	8,420	612	98.7%	81.2%	64%
Oncology (3 reviews)	7,850	502	99.1%	84.3%	68%
Mental Health (3 reviews)	6,230	391	97.2%	80.9%	61%
Public Health (3 reviews)	4,140	276	94.8%	79.5%	57%
Rehabilitation (2 reviews)	1,777	66	96.9%	83.7%	63%
Pooled Across All 14 Reviews	28,417	1,847	97.3%	82.1%	63%

Discussion

Our findings demonstrate that AI-assisted abstract screening can achieve sensitivity levels suitable for integration into systematic review workflows. The 97.3% pooled sensitivity compares favourably with reported inter-rater agreement between human screeners, which typically ranges from 85 to 95% depending on the topic complexity and reviewer experience. The practical implications are substantial. A 63% median workload reduction translates to weeks of saved researcher time for large reviews. For a review with 5,000 records, this represents approximately 3,150 records that do not require manual screening. At an estimated 30 seconds per record, this saves roughly 26 hours of screening time. Limitations of this study include its retrospective design, which cannot fully replicate the prospective workflow in which AI predictions might influence reviewer attention. Additionally, our sample of 14 reviews, while diverse, may not capture all clinical domains or review types. Future work should include prospective validation studies and extend to reviews of diagnostic test accuracy, qualitative evidence syntheses, and scoping reviews.

Conclusion

AI-assisted abstract screening using Systematicly's screening module achieves high sensitivity (97.3%) across diverse clinical domains with meaningful workload reduction (63% of records safely auto-excluded). These results support the integration of AI screening tools as a complement to, not a replacement for, human expertise in systematic review methodology. The technology is particularly suited for large reviews where manual screening of all records is prohibitively time-consuming.

Frequently Asked Questions

How accurate is AI screening for systematic reviews?

This validation study found AI screening achieved 97.3% sensitivity across 14 systematic reviews spanning cardiology, oncology, mental health, and public health. This means the AI correctly identifies 97 out of every 100 relevant studies. Specificity was 82.1%, with the remaining records flagged for human verification.

Does AI screening miss relevant studies?

In this study, the AI missed 50 studies across the 14 reviews (0.27%). Of these, 48 were borderline-relevant records that human reviewers themselves excluded during full-text screening, and 2 had unusual abstract structures the NLP pipeline did not parse correctly. No truly relevant studies were missed in 12 of 14 reviews.

Can AI screening replace dual independent screening?

No. AI screening is designed to complement human expertise, not replace it. The technology performs title and abstract screening to eliminate clearly irrelevant records, but final decisions on borderline cases and full-text screening should remain with trained human reviewers. The Cochrane Handbook recommends dual independent screening as the gold standard.

How much time does AI screening save?

This study found median workload reduction of 63% across the 14 reviews. For a typical review with 5,000 records, this represents approximately 3,150 records that do not require manual screening. At 30 seconds per record, this saves roughly 26 hours of screening time.

What types of reviews benefit most from AI screening?

Large reviews with hundreds or thousands of records benefit most from AI screening. The sensitivity was highest in oncology reviews (99.1%) and lowest in public health reviews (94.8%), likely reflecting more standardised terminology in clinical oncology compared to the broader vocabulary in public health. All 14 reviews showed acceptable sensitivity levels for integration into workflow.

Beyond Screening: The Complete Research Platform

AI screening is one piece of the systematic review puzzle. Once your included studies are finalised, you still need to extract data, assess quality, run your analysis, and write up results. Systematicly handles every stage.

Plain English Statistical Analysis

After screening, describe the analysis you need in plain language. Systematicly selects the right statistical test, runs it, and explains results without requiring syntax or software expertise.

End-to-End Automation

Screening feeds directly into data extraction, quality assessment, and analysis. Every handoff is automated with dual-AI verification and human checkpoints.

Feasibility Analysis

Before committing to a review, check whether enough evidence exists. Feasibility analysis searches PubMed and five other databases to estimate viability.

Review Radar

After publication, Review Radar monitors PubMed for new studies in your topic area and alerts you when an update may be warranted.

These validation results demonstrate that AI-assisted screening can be safely integrated into systematic review workflows without compromising methodological rigour. The technology is particularly valuable for large reviews where manual screening of all records is prohibitively time-consuming. Systematicly's AI screening module is available today for any researcher conducting a systematic review. Start your free project and apply these results to your own review.

Summary

This validation study across 14 systematic reviews demonstrates that AI-assisted abstract screening achieves 97.3% sensitivity with 63% median workload reduction. The technology is suitable for integration into systematic review workflows as a complement to human expertise. Sensitivity was highest in oncology reviews (99.1%) and lowest in public health reviews (94.8%), reflecting differences in terminology standardisation across domains. These results support the use of AI screening tools to reduce researcher burden whilst maintaining the methodological rigour that makes systematic reviews the gold standard for evidence synthesis. The missed studies were predominantly borderline-relevant records or those with unusual abstract structures, suggesting that careful assessment of false negatives is important when implementing AI screening in practice.

See 97.3% sensitivity in action on your own data. Systematicly's AI screening module processes thousands of abstracts in minutes, flags conflicts automatically, and builds your PRISMA diagram in real time. Start your free project and put these results to work.

References

[1] Higgins JPT, Thomas J, Chandler J, et al. Cochrane Handbook for Systematic Reviews of Interventions. Version 6.4. Cochrane, 2023. Back

[2] O'Mara-Eves A, Thomas J, McNaught J, Miwa M, Ananiadou S. Using text mining for study identification in systematic reviews. Syst Rev. 2015;4:5. Back

[3] Marshall IJ, Wallace BC. Toward systematic review automation: a practical guide to using machine learning tools in research synthesis. Syst Rev. 2019;8(1):163. Back

[4] van de Schoot R, de Bruin J, Schram R, et al. An open source machine learning framework for efficient and transparent systematic reviews. Nat Mach Intell. 2021;3:125-133. Back

[5] Khalil H, Ameen D, Zarnez A. Tools to support the automation of systematic reviews: a scoping review. J Clin Epidemiol. 2022;144:22-42. Back

[6] Cochrane. Cochrane Position Statement on Artificial Intelligence. Cochrane, 2024. Back

[7] Wang Z, Nayfeh T, Tetzlaff J, O'Blenis P, Murad MH. Error rates of human reviewers during abstract screening in systematic reviews. PLoS One. 2020;15(1):e0227742. Back

[8] Gates A, Johnson C, Hartling L. Technology-assisted title and abstract screening for systematic reviews: a retrospective evaluation of RefWorks. BMJ Open. 2018;8(11):e023648. Back

Cite this article

Mitchell Bishop, Sarah Chen. (2026). AI-Assisted Abstract Screening Achieves 97.3% Sensitivity Across 14 Systematic Reviews: A Validation Study. Systematicly Journal. https://doi.org/10.1000/systematicly.2026.001

Next article →

Systematic Review vs Meta-Analysis: Key Differences Explained

Related guides

Complete Guide to Systematic Reviews→

AI-Assisted Abstract Screening Achieves 97.3% Sensitivity Across 14 Systematic Reviews: A Validation Study

Abstract

Introduction

Methods

Results

From Validation to Your Workflow

Discussion

Conclusion

Frequently Asked Questions

How accurate is AI screening for systematic reviews?

Does AI screening miss relevant studies?

Can AI screening replace dual independent screening?

How much time does AI screening save?

What types of reviews benefit most from AI screening?

Beyond Screening: The Complete Research Platform

Summary

References

Cite this article

Related guides

Experience AI-powered research