How to Screen Studies for a Systematic Review

Screening studies for a systematic review is the process of evaluating every record your search retrieves against a set of predefined criteria to decide what stays and what goes. It sounds simple. In practice, it is the single most time-consuming phase of any review, often accounting for 40 to 60 percent of total project hours.^[1] It is also where the greatest risk of bias lives: miss one relevant study and your conclusions may shift; include one irrelevant study and your analysis gets muddier. Getting screening right is not optional. It is foundational.

Key Takeaways

Screening happens in two stages: title and abstract screening first, then full-text screening of the shortlist.
Dual independent screening with conflict resolution is the gold standard recommended by the Cochrane Handbook.
Clear, testable inclusion and exclusion criteria (built from your PICO framework) prevent subjective decisions.
AI-assisted screening can reduce workload by 60 percent or more while maintaining sensitivity above 95 percent.
Every screening decision should be documented for PRISMA reporting and audit trail purposes.

How Systematicly Makes Screening Easy

Systematicly automates the most labour-intensive parts of study screening. Its AI module screens titles and abstracts against your inclusion criteria with 97.3% sensitivity (validated across 14 reviews), flags conflicts between reviewers automatically, and builds your PRISMA flow diagram in real time as you screen. Two features handle the heavy lifting:

AI-Assisted Screening: reduces screening time from weeks to hours while catching studies human reviewers miss
End-to-End Automation, Review Radar & more: from search to synthesis, the complete toolkit for modern systematic reviews

What Is Study Screening in a Systematic Review?

Study screening is the structured evaluation of identified records to determine which meet your review's eligibility criteria. The Cochrane Handbook defines it as the process of assessing each record retrieved by the search strategy against predefined inclusion and exclusion criteria.^[2] In practical terms, you start with thousands of records and progressively filter down to the dozens or hundreds that actually answer your research question.

The process is divided into two distinct stages. Title and abstract screening is the first pass, where reviewers read the title and abstract of each record and make a quick include, exclude, or uncertain decision. Full-text screening comes next, where the shortlisted records are read in their entirety and assessed against detailed eligibility criteria. According to Lefebvre and colleagues, this two-stage approach balances thoroughness with efficiency.^[3]

The numbers involved can be daunting. A typical systematic review in medicine retrieves between 1,000 and 10,000 records from database searches. After title and abstract screening, between 5 and 15 percent usually proceed to full-text review. After full-text screening, the final included studies might represent less than 2 percent of the original search results.

Writing Effective Inclusion and Exclusion Criteria

Inclusion and exclusion criteria are the rules that determine which studies enter your review. They must be specific enough to apply consistently, broad enough to capture all relevant evidence, and defined before screening begins.

The best criteria flow directly from your PICO framework:

PICO Element	Example Criterion	Screening Application
Population	Adults aged 18+ with Type 2 diabetes	Exclude paediatric studies, Type 1 diabetes, gestational diabetes
Intervention	SGLT2 inhibitors (any dose, any duration)	Exclude studies of other drug classes without SGLT2 arm
Comparator	Placebo or active comparator	Exclude single-arm studies without a comparison group
Outcome	HbA1c change at 12+ weeks	Exclude studies not reporting HbA1c or with follow-up under 12 weeks
Study Design	Randomised controlled trials	Exclude observational studies, case reports, narrative reviews

Common additional criteria include language restrictions, date ranges, and publication status. The Cochrane Handbook recommends against language restrictions where possible, as excluding non-English studies can introduce bias.^[2] If you must restrict by language, document the rationale in your protocol.

Pilot Testing: Before screening your full dataset, pilot test your criteria on 50 to 100 records. This reveals ambiguous criteria, calibrates reviewers, and prevents having to re-screen thousands of records after discovering a criterion is too vague. Calculate inter-rater agreement (Cohen's kappa) during the pilot. A kappa below 0.60 suggests your criteria need refinement.

Stage 1: Title and Abstract Screening

Title and abstract screening is the rapid first pass through your search results. The goal is to remove records that are clearly irrelevant while retaining anything that might be eligible.

The cardinal rule at this stage is: when in doubt, include. It is far better to carry a few extra records into full-text screening than to accidentally exclude a relevant study at the abstract stage. According to a study by Waffenschmidt and colleagues, even experienced reviewers miss between 2 and 5 percent of relevant records during title and abstract screening.^[4]

Practical tips for efficient title and abstract screening:

Screen in batches. Reviewing 200 to 300 abstracts per session with breaks reduces fatigue-related errors.
Use a standardised form. Even a simple three-button interface (include, exclude, uncertain) is better than free-text notes.
Record the reason for exclusion. While PRISMA only requires exclusion reasons at the full-text stage, tracking them at abstract level helps calibrate your team.
Screen independently. Dual independent screening is the gold standard. Both reviewers should screen without seeing each other's decisions.

Speed varies widely. Manual screening of titles and abstracts typically takes 30 seconds to 2 minutes per record. For a review with 3,000 records screened by two reviewers, that is 50 to 200 hours of screening time before you even open a full-text PDF.

Dual Screening and Conflict Resolution

Dual independent screening means two reviewers independently assess every record, then compare decisions. The Cochrane Handbook strongly recommends this approach because single-reviewer screening consistently misses relevant studies.^[2]

When two reviewers disagree on a record, that disagreement is called a conflict. Conflicts are resolved through one of three methods:

Discussion between reviewers. The two reviewers review the record together and reach consensus. This is the most common approach.
Third reviewer arbitration. A senior researcher makes the final decision. Used when discussion fails to resolve the conflict.
Inclusive resolution. Any record that either reviewer includes moves forward. This maximises sensitivity at the cost of carrying more records to full-text screening.

Inter-rater reliability is measured using Cohen's kappa statistic. A kappa of 0.61 to 0.80 indicates substantial agreement; above 0.80 is almost perfect.^[5] If your kappa is below 0.60 during early screening, stop and recalibrate. The problem is almost always ambiguous criteria, not incompetent reviewers.

Screening Thousands of Abstracts Without the Burnout

The maths is stark: 3,000 records, two reviewers, 90 seconds each. That is 150 hours of screening before a single full-text PDF is opened. Systematicly's AI screening module processes your entire dataset against your inclusion criteria, flagging likely includes and likely excludes with 97.3% sensitivity. You still make every final decision, but instead of reading 3,000 abstracts, you are reviewing the AI's recommendations and focusing your attention on the borderline cases. Conflicts between reviewers are detected and queued automatically, with the original record and both decisions displayed side by side.

Stage 2: Full-Text Screening

Full-text screening is the detailed assessment of each shortlisted study against your complete eligibility criteria. Unlike title and abstract screening, every exclusion at this stage must be documented with a specific reason.

PRISMA 2020 requires you to report the number of full-text articles excluded and the reasons for each exclusion, grouped by category.^[6] Common exclusion reasons include wrong population, wrong intervention, wrong outcome, wrong study design, duplicate publication, and conference abstract only.

Full-text screening is slower per record but involves far fewer records. Expect to spend 5 to 15 minutes per full-text article, depending on complexity and how clearly the study reports its methods. For a typical review with 150 to 300 full texts to assess, this stage takes 15 to 75 hours across both reviewers.

Practical considerations at this stage:

PDF retrieval. Not all full texts are freely available. Budget time for interlibrary loans, institutional access, and contacting authors for unpublished data.
Multiple reports of the same study. A single trial may be published across multiple papers (protocol, primary results, secondary analyses, long-term follow-up). Link these together rather than treating them as separate studies.
Borderline cases. When a study sits right on the boundary of your criteria, document why you included or excluded it. This transparency strengthens your review.

AI-Assisted Screening: How It Works and When to Use It

AI-assisted screening uses machine learning models to predict whether each record is likely to meet your inclusion criteria, based on its title and abstract text. The AI does not replace human judgement. It prioritises and triages, allowing reviewers to focus their attention where it matters most.

There are three main approaches to AI-assisted screening:

Approach	How It Works	Typical Sensitivity
Active learning	Prioritises records most likely to be relevant; reviewer screens in order of predicted relevance	90 to 95%
Semi-automated	AI flags a subset as safe to auto-exclude; reviewer screens the rest manually	93 to 97%
Full AI triage	AI classifies every record as include/exclude; reviewer verifies AI decisions	95 to 98%

The critical metric for any AI screening tool is sensitivity (recall), not accuracy. In systematic review screening, a false negative (missing a relevant study) is far more damaging than a false positive (including an irrelevant study that gets caught at full-text stage). According to a validation study across 14 systematic reviews, AI-assisted screening achieved a pooled sensitivity of 97.3% with a median workload reduction of 63%.^[7]

When is AI screening appropriate? AI screening works best for reviews with 500 or more records, clear inclusion criteria, and a standard biomedical vocabulary. For very small reviews (under 200 records) or highly specialised topics with unusual terminology, manual screening may be just as fast. For living reviews or review updates, AI screening is particularly valuable because the model can learn from your previous screening decisions.

Documenting Screening for PRISMA Compliance

The PRISMA 2020 flow diagram is the standard way to report study screening results. It tracks the flow of records through four phases: identification (records found), screening (records assessed), eligibility (full texts evaluated), and inclusion (studies in the review).^[6]

Your flow diagram must report:

Total records identified from each database (before and after deduplication)
Records excluded at title and abstract screening
Full-text articles assessed for eligibility
Full-text articles excluded, with reasons grouped by category
Studies included in qualitative synthesis and, if applicable, quantitative synthesis (meta-analysis)

If you used automation tools during screening, PRISMA 2020 recommends reporting this in the methods section, including which tool was used, what role it played, and how human oversight was maintained. For a detailed walkthrough of the flow diagram, see our PRISMA flow diagram guide.

Common Screening Mistakes and How to Avoid Them

After working with hundreds of systematic review teams, certain screening errors appear repeatedly:

Vague criteria. "Studies about diabetes" is not a criterion. "Adults aged 18+ diagnosed with Type 2 diabetes mellitus" is. Specificity prevents disagreements.
Screening fatigue. Error rates increase significantly after 90 minutes of continuous screening. Take breaks. Screen in batches.
Not pilot testing. Jumping straight into full screening without a pilot round almost always leads to re-screening later.
Single-reviewer screening. It is tempting when pressed for time, but single-reviewer screening consistently misses 5 to 10 percent of relevant studies compared to dual screening.
Inconsistent exclusion reasons. Use a standardised set of exclusion codes, not free text. "Wrong population" applied consistently is better than twenty variations of the same idea.

For a broader overview of the systematic review process that screening fits into, see our systematic review methods guide. For understanding how your screening results feed into the analysis phase, our systematic review vs meta-analysis explainer covers the downstream steps.

Frequently Asked Questions

How many reviewers should screen studies for a systematic review?

At least two independent reviewers should screen every record at both the title/abstract and full-text stages. The Cochrane Handbook recommends dual independent screening as the gold standard because single-reviewer screening misses 5 to 10 percent of relevant studies. A third reviewer should be available to resolve conflicts when the two primary reviewers disagree.

What is an acceptable inter-rater agreement for screening?

A Cohen's kappa of 0.61 or higher indicates substantial agreement and is generally considered acceptable for systematic review screening. Kappa values above 0.80 indicate almost perfect agreement. If your kappa falls below 0.60 during pilot screening, recalibrate your criteria and re-train reviewers before proceeding with the full dataset.

Can AI replace human reviewers in systematic review screening?

No. Current AI screening tools assist human reviewers by prioritising records, flagging likely includes and excludes, and reducing workload by 50 to 70 percent. However, human oversight remains essential. AI tools achieve high sensitivity (95 to 98 percent) but not perfection, and systematic reviews require a level of methodological transparency that demands human accountability for every inclusion decision.

How long does screening take for a typical systematic review?

Screening duration depends on the number of records and the number of reviewers. A review with 3,000 records typically requires 50 to 200 hours for title and abstract screening (across two reviewers) and 15 to 75 hours for full-text screening. AI-assisted screening can reduce the title and abstract phase to 10 to 40 hours by automating triage of clearly irrelevant records.

What exclusion reasons should I record during full-text screening?

PRISMA 2020 requires exclusion reasons at the full-text stage, grouped by category. Common categories include wrong population, wrong intervention, wrong comparator, wrong outcome, wrong study design, duplicate publication, full text unavailable, and conference abstract only. Use standardised codes rather than free text to ensure consistency across reviewers.

Beyond Screening: The Complete Research Platform

Screening is one phase of a much larger process. Once your included studies are finalised, you still need to extract data, assess quality, run your analysis, and write up results. Systematicly handles every stage, not just screening.

Plain English Statistical Analysis

After screening, your included studies need analysis. Describe the comparison you want ("compare treatment vs control HbA1c change") and Systematicly selects the right statistical test from 250+ options, runs it, and explains the output in plain language. No syntax, no software switching.

End-to-End Automation

Screening feeds directly into data extraction, which feeds into analysis, which feeds into your PRISMA diagram and manuscript. Every handoff is automated with dual-AI verification and human checkpoints, so nothing falls through the cracks between phases.

Feasibility Analysis

Wondering whether your topic has enough studies to screen in the first place? Feasibility analysis searches PubMed and five other databases, counts existing reviews, and estimates whether your review is viable before you commit months of screening time.

Review Radar

Finished screening and published your review? Review Radar monitors PubMed for new RCTs in your topic area and alerts you when enough new evidence accumulates to warrant an update. Keeps your review current without manual surveillance.

If study screening is the bottleneck in your systematic review workflow, AI-assisted tools can recover hundreds of hours without sacrificing the sensitivity your review depends on. At Systematicly, screening is built into a platform that carries your data from search through to publication-ready results. Start a free project at research.systematicly.com to try it with your own data.

Summary

Screening studies for a systematic review involves two stages: a rapid title and abstract pass to remove clearly irrelevant records, followed by detailed full-text assessment against your eligibility criteria. Dual independent screening with conflict resolution remains the gold standard. Clear, PICO-derived criteria, pilot testing, and consistent documentation are the foundations of reliable screening. AI-assisted tools like Systematicly can reduce screening workload by over 60 percent while maintaining sensitivity above 97 percent, and automatically generate your PRISMA flow diagram as you work.

Cut your screening time without cutting corners. Systematicly's AI screens your records with 97.3% sensitivity, flags conflicts automatically, and builds your PRISMA diagram in real time. Start your free project and see how many hours you save on your next review.

References

Borah R, Brown AW, Capers PL, Kaiser KA. Analysis of the time and workers needed to conduct systematic reviews of medical interventions using data from the PROSPERO registry. BMJ Open. 2017;7(2):e012545. ↩
Higgins JPT, Thomas J, Chandler J, et al., editors. Cochrane Handbook for Systematic Reviews of Interventions, Version 6.4. Cochrane. 2023. ↩
Lefebvre C, Glanville J, Briscoe S, et al. Searching for and selecting studies. In: Higgins JPT, Thomas J, editors. Cochrane Handbook for Systematic Reviews of Interventions. 2nd ed. Wiley; 2019:67-107. ↩
Waffenschmidt S, Knelangen M, Sieben W, Bühn S, Pieper D. Single screening versus conventional double screening for study selection in systematic reviews. BMC Med Res Methodol. 2019;19:132. ↩
McHugh ML. Interrater reliability: The kappa statistic. Biochem Med. 2012;22(3):276-282. ↩
Page MJ, McKenzie JE, Bossuyt PM, et al. The PRISMA 2020 statement: An updated guideline for reporting systematic reviews. BMJ. 2021;372:n71. ↩
Bishop M, Chen S. AI-assisted abstract screening achieves 97.3% sensitivity across 14 systematic reviews: A validation study. Systematicly Research Lab. 2026. ↩

How to Screen Studies for a Systematic Review: A Practical Guide

Abstract

How Systematicly Makes Screening Easy

What Is Study Screening in a Systematic Review?

Writing Effective Inclusion and Exclusion Criteria

Stage 1: Title and Abstract Screening

Dual Screening and Conflict Resolution

Screening Thousands of Abstracts Without the Burnout

Stage 2: Full-Text Screening

AI-Assisted Screening: How It Works and When to Use It

Documenting Screening for PRISMA Compliance

Common Screening Mistakes and How to Avoid Them

Frequently Asked Questions

How many reviewers should screen studies for a systematic review?

What is an acceptable inter-rater agreement for screening?

Can AI replace human reviewers in systematic review screening?

How long does screening take for a typical systematic review?

What exclusion reasons should I record during full-text screening?

Beyond Screening: The Complete Research Platform

Summary

References

Cite this article

Related guides

Experience AI-powered research