Back to Blog

AI-assisted vs manual systematic review screening: evidence, workflows, and limits

Compare manual dual screening with AI-assisted triage using PRISMA 2020 and Cochrane study-selection guidance, plus honest notes on how Study Screener implements each path.

George Burchell
November 10, 2025
8 min read
Study Screener AI screening workspace for criteria entry and batch classification results

Scope. This article compares title and abstract screening—not full-text PDF review, data extraction, or risk-of-bias assessment. Timing examples below are illustrative unless you measure them in your own pilot.

Disclosure. Study Screener offers manual collaborative screening and a separate AI-assisted batch classifier. We describe both honestly, including where our AI path differs from tools that re-rank records while you screen.

Two different questions

Researchers often conflate three ideas:

  1. Manual dual screening — two people independently code include/exclude (or maybe) against a protocol; disagreements are resolved. This remains the norm in Cochrane-style reviews and PRISMA reporting (Cochrane Handbook, Ch. 4; PRISMA 2020).
  2. Prioritization while you screen — software surfaces likely relevant records earlier in the queue; efficiency depends on your stopping rule and how much of the set you still review.
  3. LLM batch classification — each record is scored against written criteria in one pass (as in Study Screener’s AI screening jobs). Throughput can be high, but recall depends on criteria quality, model behavior, and your validation plan—not on efficiency metrics from a different type of tool.

Confusing (2) and (3) is how unsupported “95% sensitivity” claims appear in marketing. The comparison below treats manual dual review and batch AI triage as separate workflows you document under PRISMA 2020 study-selection reporting.

Manual screening: what “good” looks like

Typical workflow (PRISMA-aligned)

  1. Register protocol and eligibility criteria.
  2. Search databases; merge and deduplicate outside the screening tool (Study Screener does not remove duplicates on import today, this feature will be added soon enough).
  3. Import a single cleaned RIS (or PubMed .txt) file.
  4. Two reviewers screen independently; use blinding so neither sees the other’s vote until both decide (Cochrane Handbook, Ch. 4).
  5. Resolve conflicts (discussion or third reviewer).
  6. Export decision logs and included RIS for full-text retrieval and PRISMA counts.

Manual screening: abstract panel with include, maybe, and exclude actions

Figure: Manual screening in Study Screener (demo). Keyboard shortcuts: I / M / E.

Time and agreement (realistic ranges, not guarantees)

The Cochrane Handbook does not give a single “hours per 10,000 records” figure because speed depends on topic difficulty, abstract quality, and team experience. Useful anchors:

SourceWhat it implies
Practitioner guides (e.g. our beginner screening guide)Roughly 100–300 title/abstract decisions per hour per experienced reviewer after calibration—not on day one.
Dual screening multiplierTwo blinded reviewers ≈ person-time before conflict resolution.
Inter-rater agreementReport kappa or percent agreement on a pilot sample; Cochrane expects documented resolution of disagreements. Treat any κ value as study-specific, not universal.

Illustrative arithmetic (not a handbook figure): 10,000 records × ~20 seconds each ≈ 55 person-hours per reviewer; dual screening ≈ 110 person-hours before conflicts. Use your pilot median seconds per decision instead.

Strengths and failure modes

StrengthFailure mode
Transparent to auditors and journalsFatigue and drift on large sets
Handles nuanced criteria when trainedVague criteria → endless “maybe”
No model version to documentSlow at scale without prioritization

AI-assisted screening: concepts without over-claiming

Some tools help you screen fewer records early by ranking likely includes. Others, including Study Screener’s AI screening, run a batch pass over the full set:

  • You upload RIS/TXT and enter inclusion, exclusion, and research question text.
  • The backend classifies each record (include / exclude / maybe) with confidence and rationale per row.
  • Results export to CSV/RIS; PRISMA modal can pull job aggregates (duplicate removal in that export is still simplified—enter dedupe counts manually when needed).

AI screening workspace: criteria and batch results

Figure: AI screening workspace. Treat outputs as triage suggestions until you validate against your protocol.

We do not publish independent sensitivity benchmarks for this LLM classifier yet (see our docs performance note). Plan a validation sample (e.g. 200–400 records double-screened manually) before trusting auto-excludes on a final review. PRISMA 2020 expects transparent reporting of how studies were identified and selected—document any AI step, criteria, and human checks in your methods.

Side-by-side: manual vs AI-assisted (conceptual)

DimensionManual dual screening (Study Screener)AI-assisted batch (Study Screener)
Primary outputPer-reviewer decisions + audit trailPer-record AI label + confidence
Efficiency evidenceTeam throughput (pilot timing)Your validation study; job runtime
Recall riskReviewer fatigue, criteria driftModel + prompt + criteria ambiguity
PRISMA reportingStandard dual-review narrative (PRISMA 2020)Document tool, criteria, validation, overrides
In one combined project?No — manual and AI jobs are separate pipelines today; bridge via exports if neededSame

Worked example A — manual dual screening (small RCT review)

Setup (fictional but realistic): 3,200 records after deduplication; two reviewers; blinded screening in Study Screener.

StepActionNote
1Pilot n = 80 independentlyCalculate agreement; rewrite 2–3 criteria bullets
2Screen blinded; use Maybe for borderlinePRISMA “other reasons” minimized if you resolve maybes before full text
3Owner unblinds or auto-unblind when both finishConflicts visible in library / team views
4Third reviewer resolves disagreement casesLog resolution in decision notes
5Export included RIS + decision CSVFull-text stage outside app

Reporting sentence (methods): “Two reviewers independently screened titles and abstracts in Study Screener with blinding; conflicts were resolved by [consensus / third reviewer], consistent with Cochrane Handbook, Ch. 4 and PRISMA 2020 study-selection reporting.”

Worked example B — AI triage + manual validation (large scoping review)

Setup: 18,000 records; tight deadline; AI job for first-pass triage only.

StepActionWhy
1Run AI on written PICO-style criteriaForces explicit rules the model can apply
2Export low-confidence and maybe setsHighest risk of false negatives
3Manually dual-screen a random 5–10% of AI excludesEstimate missed includes; adjust thresholds
4Manually screen all AI includes (or dual-screen includes + borderline)Protects precision for downstream full text
5PRISMA diagramUse diagram builder or AI job PRISMA export; manually enter dedupe counts

PRISMA 2020 diagram builder with editable stage counts

Figure: Standalone PRISMA 2020 builder—use when AI auto-fill does not match your dedupe log.

Reporting sentence (methods): “We used LLM-assisted title/abstract triage (Study Screener, model version [X], date), then dual human review on [includes + sample of excludes], per PRISMA 2020 study-selection guidance. We report [validation sample size] and any protocol deviations.”

Governance checklist before you rely on AI excludes

  • Protocol pre-specifies whether AI is used and for which stage
  • Criteria are binary/testable where possible (misconception post)
  • Validation sample size justified (pilot κ + false-negative spot checks)
  • Low-confidence policy written (always human-review?)
  • Exports archived (CSV decision log, model/job IDs if available)
  • PRISMA counts match dedupe log (not only tool defaults)
  • Journal/regulator accepts AI-assisted selection for your field (check target journal)

When manual-only is enough

  • Small corpora (< ~500 records) where setup time dominates gains.
  • Criteria require full-text judgment at title stage (rare but protocol-specific).
  • You cannot document AI validation to journal standards.

When AI-assisted triage is worth testing

  • > ~2,000 records with stable criteria.
  • You can budget human time to validate excludes.
  • You separate triage (AI) from final inclusion (humans).

Study Screener paths (verified product behavior)

PathURLNotes
Manual screening/screening, demoBlinding on by default; exports RIS/CSV per reviewer decisions
AI screening/ai-screeningCredit-based; separate from manual project records
PRISMA/prisma-diagramManual counts; AI modal uses job stats

Plan limits (current code): free tier 1 owned manual project; AI plan 5 owned projects—see pricing before promising capacity to a consortium.

References

  1. Page MJ, et al. PRISMA 2020 statement. BMJ 2021. https://doi.org/10.1136/bmj.n71
  2. Cochrane Handbook, Chapter 4 — Study selection. https://training.cochrane.org/handbook/current/chapter-04

Educational note: Always report what you validated on your dataset. PRISMA 2020 and the Cochrane Handbook are the authoritative sources for study-selection methods—not vendor efficiency claims.

Related Articles

systematic review
paper screening

Paper screening tool for systematic review: a beginner’s step-by-step guide

Starting your first systematic review can make screening feel daunting. You may have thousands of records and a deadline. A paper screening tool turns that pile...

16 min read
systematic review
screening

Why 'Just Read the Abstracts' Is the Biggest Misconception in Evidence Screening

Why "Just Read the Abstracts" Is the Biggest Misconception in Evidence Screening TL;DR: Screening abstracts isn't a simple yes/no process—it's complex decision-...

4 min read
systematic review
AI automation

The One Research Task I'd Hand Over to AI Tomorrow

The One Research Task I'd Hand Over to AI Tomorrow TL;DR: Screening is the most time-consuming bottleneck in systematic reviews, and AI is perfectly suited to h...

7 min read

Ready to Streamline Your Systematic Review?

Experience the power of AI-assisted screening and cut your review time by up to 80%. Join thousands of researchers who trust our platform for their systematic reviews.

George Burchell - Systematic Review Expert

About the Author

Connect on LinkedIn

George Burchell

George Burchell is a specialist in systematic literature reviews and scientific evidence synthesis with significant expertise in integrating advanced AI technologies and automation tools into the research process. With over four years of consulting and practical experience, he has developed and led multiple projects focused on accelerating and refining the workflow for systematic reviews within medical and scientific research.

Systematic Reviews
Evidence Synthesis
AI Research Tools
Research Automation