Manuscript screening has become more complex than simple text comparison. Research papers, academic submissions, and editorial documents often include not only paragraphs, but also figures, charts, tables, captions, diagrams, equations, screenshots, and supplementary materials.
A text-only plagiarism checker can find copied phrases or paraphrased passages, but it may miss reused visual evidence or copied data structures. A manuscript can look original in the body text while still reusing the same figure, table, chart, dataset, or experimental image from another source.
Multimodal AI helps solve this problem by analyzing different parts of a manuscript together. It can compare text, figures, tables, captions, labels, layout, and metadata. Instead of asking only whether the words are original, multimodal screening asks whether the full research evidence has been reused in a way that needs human review.
Why Text-Only Screening Is No Longer Enough
Traditional plagiarism detection focuses mainly on written text. This is still important, but many academic and scientific claims appear outside the main paragraphs. A figure may show the central result of an experiment. A table may contain the main dataset. A chart may present the strongest evidence for a conclusion.
When a screening system ignores figures and tables, it sees only part of the manuscript. This creates blind spots. A paper may pass a text similarity check while still containing reused visual material or copied numerical data.
Multimodal AI reduces this risk by treating the manuscript as a structured document. It separates and analyzes each modality, then connects the evidence into one reviewable report.
What Counts as Reuse in a Multimodal Manuscript?
Reuse can appear in many forms. Some cases are obvious, such as direct text copying or duplicated images. Other cases are harder to detect because the reused material has been transformed, reformatted, cropped, translated, relabeled, or partially rewritten.
| Modality | Common Reuse Type | Why It Matters |
| Text | Copied passages, paraphrasing, patchwriting, translated reuse | Text reuse can misrepresent originality, authorship, or source attribution. |
| Figures | Duplicated panels, cropped images, relabeled diagrams, reused charts | Figures often carry core research evidence and may hide visual duplication. |
| Tables | Copied values, reused structure, renamed headers, reordered rows | Tables can reveal reused data even when surrounding text is rewritten. |
| Captions | Copied figure descriptions, paraphrased table notes, repeated labels | Captions connect visual evidence to the manuscript’s claims. |
| Layout | Repeated figure structure, similar table placement, reused document patterns | Layout can help connect related evidence across different document formats. |
Text Reuse
Text reuse is the most familiar form of plagiarism detection. It includes exact copying, near-copying, paraphrasing, translated reuse, and repeated sections from earlier work.
In academic writing, some repeated language may be acceptable. Methods sections, standard definitions, citations, and technical terminology can naturally overlap. This is why a strong screening system should not only find matches, but also help reviewers understand context.
Common Forms of Text Reuse
- Direct copying without attribution.
- Patchwriting from multiple sources.
- Paraphrased passages with the same structure.
- Translated reuse from another language.
- Repeated methods text without clear disclosure.
- Self-plagiarism from earlier publications.
- Citation-based concealment where sources are mentioned but reused too heavily.
Figure Reuse
Figure reuse can be harder to detect than text reuse. Authors may crop an image, rotate it, adjust contrast, change labels, rearrange panels, or insert the same visual material into a new context.
In scientific manuscripts, reused figures can be especially serious because figures often represent experimental evidence. A duplicated microscopy image, blot panel, chart, or diagram may affect how readers interpret the study.
Examples of Figure Reuse
- Duplicated image panels in different papers.
- Cropped or resized scientific images.
- Rotated, flipped, or contrast-adjusted visuals.
- Relabeled plots with similar data patterns.
- Reused diagrams with changed terms.
- Repeated screenshots or interface images.
- Composite figures with rearranged panels.
| Figure Type | Reuse Risk | Detection Challenge |
| Microscopy image | Duplicated visual evidence | Cropping, resizing, and contrast changes can hide reuse. |
| Western blot | Repeated or relabeled panels | Partial reuse may appear only in small regions. |
| Chart or plot | Same data redrawn in a new format | The visual style may change while values remain similar. |
| Diagram | Conceptual or structural reuse | Labels may be rewritten while the logic stays the same. |
| Screenshot | Interface or dataset reuse | Compression and resizing can affect image comparison. |
| Flowchart | Repeated process structure | Layout may be modified while the sequence remains similar. |
Table Reuse
Tables are another important source of reuse evidence. A table can be copied even when the surrounding explanation is rewritten. Authors may rename columns, reorder rows, round values, change units, or reformat the table.
Table reuse detection must therefore look beyond visible text. It should compare structure, values, headers, relationships, and data patterns.
What Table Reuse Can Look Like
- The same rows and columns appear with minor wording changes.
- Headers are renamed but the table structure stays the same.
- Values are rounded or converted into percentages.
- Rows are reordered to make copying less obvious.
- The same dataset appears as a chart in another paper.
- Statistical patterns match across different documents.
| Table Signal | What It Shows | Reviewer Value |
| Cell value match | Exact or near-exact numeric overlap | Helps detect copied data. |
| Header similarity | Similar variable names or categories | Helps connect related tables. |
| Row and column structure | Repeated organization of data | Shows reuse even after visual reformatting. |
| Numeric pattern | Similar distribution, ranking, or ratio | Helps find transformed data reuse. |
| Caption link | Connection between table title and content | Improves context for human review. |
Why Multimodal AI Is Needed
Manuscripts are mixed-format documents. A single PDF may contain body text, tables, captions, figures, references, equations, footnotes, metadata, and supplementary links. Each part can carry meaning.
A text-only system may miss reuse because one modality can hide another. The text may be rewritten while the table stays the same. A figure may be reused while the caption is changed. A chart may be redrawn from the same dataset. A caption may be copied while the image is slightly edited.
Multimodal AI is designed to connect these signals. It can compare text with text, images with images, tables with tables, captions with captions, and even tables with charts.
Core Architecture of a Multimodal Reuse Detection System
A multimodal screening system should first parse the manuscript into meaningful components. Then each component should go through a specialized pipeline. After that, the system should combine the signals into a report that reviewers can inspect.
Suggested Architecture Flow
Manuscript
↓
Document Parser
├── Text Pipeline
├── Figure Pipeline
├── Table Pipeline
├── Caption Pipeline
└── Metadata Pipeline
↓
Multimodal Similarity Fusion
↓
Reviewer Report
This architecture keeps each detection task focused. Text comparison needs different methods than figure comparison. Table analysis needs different logic than caption analysis. The fusion layer connects all signals and helps reviewers see the full picture.
| Pipeline | Main Task | Output for Review |
| Text pipeline | Find copied, paraphrased, or translated passages | Highlighted text matches and source links |
| Figure pipeline | Detect visual similarity and duplicated panels | Side-by-side figure comparison |
| Table pipeline | Compare structure, headers, and numeric values | Aligned rows, columns, and value matches |
| Caption pipeline | Analyze figure and table descriptions | Caption similarity and context notes |
| Metadata pipeline | Track file, author, source, and submission data | Audit-ready context for reviewers |
Document Parsing and Layout Analysis
Before multimodal AI can compare anything, it must understand the document structure. A manuscript is not just a stream of words. It has sections, captions, tables, figures, references, and visual regions.
Layout analysis helps the system identify where each element starts and ends. It can separate body text from captions, tables from paragraphs, and figure panels from surrounding labels.
Elements to Extract
- Title and abstract.
- Section headings.
- Paragraphs and passages.
- Tables and table captions.
- Figures and figure captions.
- Composite figure panels.
- References and citations.
- Footnotes and endnotes.
- Supplementary materials.
- Embedded labels and annotations.
Strong layout analysis improves every later step. If the system mistakes a caption for body text or fails to split a composite figure into panels, detection quality can drop.
Text Reuse Detection Layer
The text layer checks written content against existing sources. It can use exact matching, fuzzy matching, semantic similarity, and paraphrase detection.
Exact matching helps detect direct copying. Semantic similarity helps detect rewritten passages that preserve the same meaning. Passage alignment helps reviewers see which parts of the manuscript match which parts of the source.
Core Text Detection Methods
- Fingerprinting for fast exact and near-exact matching.
- N-gram comparison for phrase-level overlap.
- Fuzzy matching for lightly edited passages.
- Sentence embeddings for semantic similarity.
- Transformer-based models for paraphrase detection.
- Source retrieval for finding possible original documents.
- Passage alignment for reviewer-friendly reports.
Figure Reuse Detection Layer
The figure layer compares visual content. It must be robust because reused images may be edited before reuse. Simple pixel matching is not enough.
A strong figure pipeline can compare entire images, figure panels, local regions, and visual embeddings. It can also detect common transformations such as cropping, resizing, rotation, flipping, contrast adjustment, or compression.
Figure Detection Methods
- Perceptual hashing for fast visual similarity checks.
- Feature matching for local image regions.
- Image embeddings for semantic visual comparison.
- Siamese-style comparison for duplicate detection.
- Nearest-neighbor search for large visual indexes.
- Panel segmentation for composite figures.
- OCR for labels, axis text, legends, and annotations.
Figure Panel Segmentation
Many scientific figures are composite figures. They contain several panels, often labeled A, B, C, or D. A full-page figure comparison may miss duplication if only one panel was reused.
Panel segmentation solves this problem by splitting a composite figure into smaller parts. Each panel can then be compared separately against source images.
Why Panel-Level Detection Matters
- One reused panel may be hidden inside a larger figure.
- Panel labels may be changed or removed.
- Panels may be rearranged in a new order.
- Only a cropped region may be duplicated.
- Reviewer reports become more precise.
Table Reuse Detection Layer
The table layer extracts and compares tabular data. This can be difficult in PDFs because table lines, merged cells, footnotes, and formatting styles vary.
A table reuse detector should understand both structure and values. Two tables may look different but still contain the same data. Another table may preserve the same logic while changing headers or row order.
Important Table Analysis Tasks
- Detect table boundaries.
- Identify rows and columns.
- Recognize headers and subheaders.
- Handle merged cells.
- Extract numeric values.
- Normalize units and percentages.
- Compare row and column relationships.
- Connect tables with captions and text mentions.
Caption and Label Analysis
Captions are essential because they explain what a figure or table means. Even if visual material is modified, the caption may preserve similar wording, terminology, or experimental context.
Labels also matter. Axis labels, legends, annotations, panel names, and table headers can reveal reuse that image comparison alone may not catch.
Caption and Label Signals
| Signal | Example | Why It Helps |
| Copied caption | Same figure description appears in another manuscript | Connects visual reuse to source context. |
| Paraphrased caption | Same experiment described with changed wording | Helps detect concealed reuse. |
| Axis label match | Same chart variables and measurement units | Supports chart or dataset comparison. |
| Panel label match | Same subfigure labels or descriptions | Helps align composite figures. |
| Table header match | Similar column names or statistical categories | Supports table structure comparison. |
OCR and Layout-Aware AI
OCR helps extract text from images, scanned pages, chart labels, table cells, and figure annotations. However, OCR alone is not enough.
A manuscript screening system also needs layout awareness. It must understand where text appears, what role it plays, and how it connects to nearby visual elements.
For example, the same word may appear in a paragraph, a caption, a table header, or an axis label. Each location has a different meaning. Layout-aware AI helps preserve that context.
How OCR and Layout AI Work Together
| Component | Role | Example Output |
| OCR | Extract readable text from visual regions | Axis labels, table cells, panel annotations |
| Layout analysis | Detect document regions and structure | Caption, table, figure, paragraph, reference |
| Visual embeddings | Represent image content for comparison | Similarity between figure panels |
| Text embeddings | Represent meaning of captions and passages | Semantic caption or paragraph match |
| Fusion layer | Combine evidence across modalities | Unified reviewer report |
Cross-Modal Reuse Scenarios
Some reuse patterns cross the boundary between modalities. This is where multimodal AI becomes especially useful.
A table in one manuscript may become a chart in another. A figure may be reused while the caption is rewritten. A text paragraph may describe the same dataset that appears visually elsewhere. A supplementary file may contain reused evidence that does not appear in the main manuscript.
Examples of Cross-Modal Reuse
| Scenario | What Changes | What AI Should Compare |
| Table-to-chart reuse | Data values are redrawn as a graph | Numeric values, chart labels, and data patterns |
| Figure-to-caption mismatch | Image is reused but caption is rewritten | Visual similarity and caption meaning |
| Text rewritten, evidence reused | Paragraphs change but table or figure stays similar | Visual and numeric evidence, not only text |
| Supplementary reuse | Reused material appears outside the main paper | Main manuscript and supplementary files together |
| Translated labels | Figure labels change language | OCR, translation-aware text similarity, and visual match |
Similarity Fusion: Combining Signals
Multimodal screening should not hide everything behind one vague score. A single score may be useful as a summary, but reviewers need to see the evidence behind it.
The system can calculate separate scores for text, figures, tables, captions, and metadata. Then it can combine them into a risk signal or review priority.
How Combined Signals Can Be Interpreted
| Signal Combination | Possible Meaning | Reviewer Action |
| Text match + figure match | Strong reuse evidence across written and visual content | Compare source and manuscript together. |
| Figure match only | Possible reused visual evidence | Inspect panels, labels, and source context. |
| Table values match | Possible data reuse | Check dataset attribution and permission. |
| Caption + image match | Likely reused figure context | Verify source, citation, and reuse policy. |
| Text rewritten + same table | Possible concealed reuse | Review data origin and related publications. |
| Same chart structure | Possible redrawn data reuse | Extract and compare values. |
Reviewer Workflow and Explainability
Multimodal AI should support human review, not replace it. The system should make evidence easier to inspect. It should show what matched, where it matched, how strong the signal is, and which source should be checked.
A good reviewer report should group related findings. If the same source appears in text, figures, and captions, the report should connect those signals instead of showing them as separate random alerts.
Useful Report Elements
- Document-level summary of reuse signals.
- Separate text, figure, table, and caption findings.
- Side-by-side comparison with matched source material.
- Panel-level figure comparison.
- Aligned table rows, columns, and values.
- Caption and label similarity explanation.
- Severity labels and confidence indicators.
- Filters for weak or common matches.
- Reviewer notes and decision fields.
- Exportable report for audit or editorial review.
False Positives in Multimodal Detection
Not every match is misconduct. Multimodal screening can produce false positives if it does not understand context.
Some figures are standard diagrams. Some tables come from public datasets. Some methods sections contain expected wording. Some templates are reused legitimately. Some charts may look similar because they use the same official statistics.
| False Positive Source | Why It Happens | How to Reduce It |
| Standard methods text | Researchers describe the same procedure | Use section-aware weighting and citation context. |
| Public datasets | Many papers use the same official data | Check attribution and dataset source. |
| Generic diagrams | Common concepts may use similar visuals | Compare labels, captions, and originality context. |
| Journal templates | Formatting and structure may repeat | Ignore known template regions. |
| Reference tables | Citation lists can look similar | Separate references from research evidence. |
| Stock visuals or icons | Reusable visual assets may appear in many documents | Use source type filters and reviewer labels. |
Evaluation Metrics for Multimodal Reuse Detection
Multimodal AI should be evaluated across each modality. Text detection, image detection, table extraction, and reviewer usability all need separate quality checks.
A system may perform well on text but poorly on figures. Another system may detect images accurately but produce a confusing report. Technical accuracy and reviewer usefulness should be measured together.
| Evaluation Area | Metric or Question | Why It Matters |
| Text detection | Precision, recall, F1, passage-level overlap | Measures copied and paraphrased text detection. |
| Image detection | Duplicate accuracy, transformation robustness, panel recall | Measures figure reuse detection quality. |
| Table extraction | Cell accuracy, row alignment, header recognition | Measures whether table data was parsed correctly. |
| Table similarity | Numeric similarity, structure match, pattern detection | Measures data reuse detection. |
| Caption analysis | Semantic similarity and source alignment | Connects visual elements with textual context. |
| Reviewer usefulness | Time saved, clarity, confidence, false-positive reduction | Shows whether the report helps real decisions. |
Security and Privacy Considerations
Manuscripts are sensitive documents. They may include unpublished research, confidential results, personal data, peer-review material, commercial information, or copyrighted content.
A multimodal screening platform must protect this material during upload, parsing, indexing, comparison, reporting, storage, and deletion.
Important Security Controls
- Encryption in transit.
- Encryption at rest.
- Role-based access control.
- Tenant isolation for institutions and publishers.
- Audit logs for document access.
- Secure deletion rules.
- Retention settings by client or policy.
- Restricted reviewer permissions.
- Protected backup storage.
- Clear policy for model training and data reuse.
Submitted manuscripts should not be reused for model training unless there is clear permission, legal basis, and transparent policy. Trust is essential in manuscript screening.
Common Mistakes in Multimodal Screening Design
Treating PDF as Plain Text Only
This removes figures, tables, captions, and layout from the analysis. The system may miss the most important evidence.
Comparing Full Figures Without Panel Detection
A duplicated panel may be hidden inside a larger composite figure. Panel-level comparison gives more precise results.
Ignoring Table Structure
Tables can be reused after headers are renamed or rows are reordered. Numeric and structural comparison is necessary.
Using One Score for Everything
A single score can hide important details. Reviewers need separate evidence for text, figures, tables, and captions.
No Explanation Layer
A flag without explanation is not enough. Reviewers need to understand why the system marked an item as suspicious.
Best Practices Checklist
- Parse manuscripts into text, figures, tables, captions, and metadata.
- Use separate pipelines for each modality.
- Segment composite figures into individual panels.
- Apply OCR to embedded labels, legends, and annotations.
- Compare tables by values, structure, and relationships.
- Connect captions to figures and tables.
- Use multimodal fusion for stronger evidence.
- Show source-side comparison to reviewers.
- Separate exact reuse from transformed reuse.
- Control false positives with context filters.
- Keep human review in the loop.
- Protect unpublished manuscripts with strict security controls.
- Evaluate both technical accuracy and reviewer usefulness.
Suggested Multimodal Screening Summary
| Modality | Detection Method | Main Risk Detected | Reviewer Output |
| Text | Fingerprinting, semantic similarity, passage alignment | Copying, paraphrasing, translation | Highlighted passages and source links |
| Figures | Image embeddings, perceptual hashing, panel matching | Visual reuse, manipulation, duplication | Side-by-side figure comparison |
| Tables | Structure and numeric comparison | Data reuse and copied table logic | Aligned rows, columns, and values |
| Captions | Text similarity and OCR | Reused figure or table descriptions | Caption match explanation |
| Layout | Layout-aware parsing | Hidden context and document structure reuse | Section-aware report |
| Metadata | File, source, and submission metadata analysis | Reuse patterns and source connections | Audit-ready evidence |
Conclusion
Text-only plagiarism detection is no longer enough for complex manuscript screening. Modern academic and scientific documents communicate meaning through paragraphs, figures, tables, captions, visual evidence, and structured data.
Multimodal AI helps reviewers see the full document. It can detect copied text, reused figures, duplicated panels, repeated table values, similar captions, and cross-modal reuse patterns. This creates a stronger and more complete review process.
The best multimodal systems do not replace human judgment. They organize evidence, reduce blind spots, and help reviewers decide which similarities matter. A strong screening platform should protect not only the originality of the text, but also the integrity of the data, figures, and research evidence behind it.
FAQ
What is multimodal reuse detection?
Multimodal reuse detection checks similarity across different parts of a manuscript, including text, figures, tables, captions, images, layout, and metadata.
Why is text-only plagiarism detection limited?
Text-only systems may miss copied figures, reused tables, redrawn charts, duplicated image panels, or reused data structures.
Can AI detect reused scientific figures?
AI can help detect visual duplication and transformed image reuse, including cropping, rotation, resizing, contrast changes, and partial edits. Human review is still needed to confirm context and severity.
How can tables be checked for reuse?
Tables can be compared by cell values, row and column structure, headers, numeric patterns, captions, and links to charts or related text.
Should multimodal AI make the final misconduct decision?
No. Multimodal AI should provide evidence and prioritization. Human reviewers should make the final decision based on context, attribution, permission, and policy.