Emerging Technologies

Multimodal AI for Detecting Reuse Across Text, Figures, and Tables

Written by

June 30, 2026 12 min read

Reading Time: 12 minutes

Manuscript screening has become more complex than simple text comparison. Research papers, academic submissions, and editorial documents often include not only paragraphs, but also figures, charts, tables, captions, diagrams, equations, screenshots, and supplementary materials.

A text-only plagiarism checker can find copied phrases or paraphrased passages, but it may miss reused visual evidence or copied data structures. A manuscript can look original in the body text while still reusing the same figure, table, chart, dataset, or experimental image from another source.

Multimodal AI helps solve this problem by analyzing different parts of a manuscript together. It can compare text, figures, tables, captions, labels, layout, and metadata. Instead of asking only whether the words are original, multimodal screening asks whether the full research evidence has been reused in a way that needs human review.

Why Text-Only Screening Is No Longer Enough

Traditional plagiarism detection focuses mainly on written text. This is still important, but many academic and scientific claims appear outside the main paragraphs. A figure may show the central result of an experiment. A table may contain the main dataset. A chart may present the strongest evidence for a conclusion.

When a screening system ignores figures and tables, it sees only part of the manuscript. This creates blind spots. A paper may pass a text similarity check while still containing reused visual material or copied numerical data.

Multimodal AI reduces this risk by treating the manuscript as a structured document. It separates and analyzes each modality, then connects the evidence into one reviewable report.

What Counts as Reuse in a Multimodal Manuscript?

Reuse can appear in many forms. Some cases are obvious, such as direct text copying or duplicated images. Other cases are harder to detect because the reused material has been transformed, reformatted, cropped, translated, relabeled, or partially rewritten.

Modality	Common Reuse Type	Why It Matters
Text	Copied passages, paraphrasing, patchwriting, translated reuse	Text reuse can misrepresent originality, authorship, or source attribution.
Figures	Duplicated panels, cropped images, relabeled diagrams, reused charts	Figures often carry core research evidence and may hide visual duplication.
Tables	Copied values, reused structure, renamed headers, reordered rows	Tables can reveal reused data even when surrounding text is rewritten.
Captions	Copied figure descriptions, paraphrased table notes, repeated labels	Captions connect visual evidence to the manuscript’s claims.
Layout	Repeated figure structure, similar table placement, reused document patterns	Layout can help connect related evidence across different document formats.

Text Reuse

Text reuse is the most familiar form of plagiarism detection. It includes exact copying, near-copying, paraphrasing, translated reuse, and repeated sections from earlier work.

In academic writing, some repeated language may be acceptable. Methods sections, standard definitions, citations, and technical terminology can naturally overlap. This is why a strong screening system should not only find matches, but also help reviewers understand context.

Common Forms of Text Reuse

Direct copying without attribution.
Patchwriting from multiple sources.
Paraphrased passages with the same structure.
Translated reuse from another language.
Repeated methods text without clear disclosure.
Self-plagiarism from earlier publications.
Citation-based concealment where sources are mentioned but reused too heavily.

Figure Reuse

Figure reuse can be harder to detect than text reuse. Authors may crop an image, rotate it, adjust contrast, change labels, rearrange panels, or insert the same visual material into a new context.

In scientific manuscripts, reused figures can be especially serious because figures often represent experimental evidence. A duplicated microscopy image, blot panel, chart, or diagram may affect how readers interpret the study.

Examples of Figure Reuse

Duplicated image panels in different papers.
Cropped or resized scientific images.
Rotated, flipped, or contrast-adjusted visuals.
Relabeled plots with similar data patterns.
Reused diagrams with changed terms.
Repeated screenshots or interface images.
Composite figures with rearranged panels.

Figure Type	Reuse Risk	Detection Challenge
Microscopy image	Duplicated visual evidence	Cropping, resizing, and contrast changes can hide reuse.
Western blot	Repeated or relabeled panels	Partial reuse may appear only in small regions.
Chart or plot	Same data redrawn in a new format	The visual style may change while values remain similar.
Diagram	Conceptual or structural reuse	Labels may be rewritten while the logic stays the same.
Screenshot	Interface or dataset reuse	Compression and resizing can affect image comparison.
Flowchart	Repeated process structure	Layout may be modified while the sequence remains similar.

Table Reuse

Tables are another important source of reuse evidence. A table can be copied even when the surrounding explanation is rewritten. Authors may rename columns, reorder rows, round values, change units, or reformat the table.

Table reuse detection must therefore look beyond visible text. It should compare structure, values, headers, relationships, and data patterns.

What Table Reuse Can Look Like

The same rows and columns appear with minor wording changes.
Headers are renamed but the table structure stays the same.
Values are rounded or converted into percentages.
Rows are reordered to make copying less obvious.
The same dataset appears as a chart in another paper.
Statistical patterns match across different documents.

Table Signal	What It Shows	Reviewer Value
Cell value match	Exact or near-exact numeric overlap	Helps detect copied data.
Header similarity	Similar variable names or categories	Helps connect related tables.
Row and column structure	Repeated organization of data	Shows reuse even after visual reformatting.
Numeric pattern	Similar distribution, ranking, or ratio	Helps find transformed data reuse.
Caption link	Connection between table title and content	Improves context for human review.

Why Multimodal AI Is Needed

Manuscripts are mixed-format documents. A single PDF may contain body text, tables, captions, figures, references, equations, footnotes, metadata, and supplementary links. Each part can carry meaning.

A text-only system may miss reuse because one modality can hide another. The text may be rewritten while the table stays the same. A figure may be reused while the caption is changed. A chart may be redrawn from the same dataset. A caption may be copied while the image is slightly edited.

Multimodal AI is designed to connect these signals. It can compare text with text, images with images, tables with tables, captions with captions, and even tables with charts.

Core Architecture of a Multimodal Reuse Detection System

A multimodal screening system should first parse the manuscript into meaningful components. Then each component should go through a specialized pipeline. After that, the system should combine the signals into a report that reviewers can inspect.

Suggested Architecture Flow

Manuscript
   ↓
Document Parser
   ├── Text Pipeline
   ├── Figure Pipeline
   ├── Table Pipeline
   ├── Caption Pipeline
   └── Metadata Pipeline
   ↓
Multimodal Similarity Fusion
   ↓
Reviewer Report

This architecture keeps each detection task focused. Text comparison needs different methods than figure comparison. Table analysis needs different logic than caption analysis. The fusion layer connects all signals and helps reviewers see the full picture.

Pipeline	Main Task	Output for Review
Text pipeline	Find copied, paraphrased, or translated passages	Highlighted text matches and source links
Figure pipeline	Detect visual similarity and duplicated panels	Side-by-side figure comparison
Table pipeline	Compare structure, headers, and numeric values	Aligned rows, columns, and value matches
Caption pipeline	Analyze figure and table descriptions	Caption similarity and context notes
Metadata pipeline	Track file, author, source, and submission data	Audit-ready context for reviewers

Document Parsing and Layout Analysis

Before multimodal AI can compare anything, it must understand the document structure. A manuscript is not just a stream of words. It has sections, captions, tables, figures, references, and visual regions.

Layout analysis helps the system identify where each element starts and ends. It can separate body text from captions, tables from paragraphs, and figure panels from surrounding labels.

Elements to Extract

Title and abstract.
Section headings.
Paragraphs and passages.
Tables and table captions.
Figures and figure captions.
Composite figure panels.
References and citations.
Footnotes and endnotes.
Supplementary materials.
Embedded labels and annotations.

Strong layout analysis improves every later step. If the system mistakes a caption for body text or fails to split a composite figure into panels, detection quality can drop.

Text Reuse Detection Layer

The text layer checks written content against existing sources. It can use exact matching, fuzzy matching, semantic similarity, and paraphrase detection.

Exact matching helps detect direct copying. Semantic similarity helps detect rewritten passages that preserve the same meaning. Passage alignment helps reviewers see which parts of the manuscript match which parts of the source.

Core Text Detection Methods

Fingerprinting for fast exact and near-exact matching.
N-gram comparison for phrase-level overlap.
Fuzzy matching for lightly edited passages.
Sentence embeddings for semantic similarity.
Transformer-based models for paraphrase detection.
Source retrieval for finding possible original documents.
Passage alignment for reviewer-friendly reports.

Figure Reuse Detection Layer

The figure layer compares visual content. It must be robust because reused images may be edited before reuse. Simple pixel matching is not enough.

A strong figure pipeline can compare entire images, figure panels, local regions, and visual embeddings. It can also detect common transformations such as cropping, resizing, rotation, flipping, contrast adjustment, or compression.

Figure Detection Methods

Perceptual hashing for fast visual similarity checks.
Feature matching for local image regions.
Image embeddings for semantic visual comparison.
Siamese-style comparison for duplicate detection.
Nearest-neighbor search for large visual indexes.
Panel segmentation for composite figures.
OCR for labels, axis text, legends, and annotations.

Figure Panel Segmentation

Many scientific figures are composite figures. They contain several panels, often labeled A, B, C, or D. A full-page figure comparison may miss duplication if only one panel was reused.

Panel segmentation solves this problem by splitting a composite figure into smaller parts. Each panel can then be compared separately against source images.

Why Panel-Level Detection Matters

One reused panel may be hidden inside a larger figure.
Panel labels may be changed or removed.
Panels may be rearranged in a new order.
Only a cropped region may be duplicated.
Reviewer reports become more precise.

Table Reuse Detection Layer

The table layer extracts and compares tabular data. This can be difficult in PDFs because table lines, merged cells, footnotes, and formatting styles vary.

A table reuse detector should understand both structure and values. Two tables may look different but still contain the same data. Another table may preserve the same logic while changing headers or row order.

Important Table Analysis Tasks

Detect table boundaries.
Identify rows and columns.
Recognize headers and subheaders.
Handle merged cells.
Extract numeric values.
Normalize units and percentages.
Compare row and column relationships.
Connect tables with captions and text mentions.

Caption and Label Analysis

Captions are essential because they explain what a figure or table means. Even if visual material is modified, the caption may preserve similar wording, terminology, or experimental context.

Labels also matter. Axis labels, legends, annotations, panel names, and table headers can reveal reuse that image comparison alone may not catch.

Caption and Label Signals

Signal	Example	Why It Helps
Copied caption	Same figure description appears in another manuscript	Connects visual reuse to source context.
Paraphrased caption	Same experiment described with changed wording	Helps detect concealed reuse.
Axis label match	Same chart variables and measurement units	Supports chart or dataset comparison.
Panel label match	Same subfigure labels or descriptions	Helps align composite figures.
Table header match	Similar column names or statistical categories	Supports table structure comparison.

OCR and Layout-Aware AI

OCR helps extract text from images, scanned pages, chart labels, table cells, and figure annotations. However, OCR alone is not enough.

A manuscript screening system also needs layout awareness. It must understand where text appears, what role it plays, and how it connects to nearby visual elements.

For example, the same word may appear in a paragraph, a caption, a table header, or an axis label. Each location has a different meaning. Layout-aware AI helps preserve that context.

How OCR and Layout AI Work Together

Component	Role	Example Output
OCR	Extract readable text from visual regions	Axis labels, table cells, panel annotations
Layout analysis	Detect document regions and structure	Caption, table, figure, paragraph, reference
Visual embeddings	Represent image content for comparison	Similarity between figure panels
Text embeddings	Represent meaning of captions and passages	Semantic caption or paragraph match
Fusion layer	Combine evidence across modalities	Unified reviewer report

Cross-Modal Reuse Scenarios

Some reuse patterns cross the boundary between modalities. This is where multimodal AI becomes especially useful.

A table in one manuscript may become a chart in another. A figure may be reused while the caption is rewritten. A text paragraph may describe the same dataset that appears visually elsewhere. A supplementary file may contain reused evidence that does not appear in the main manuscript.

Examples of Cross-Modal Reuse

Scenario	What Changes	What AI Should Compare
Table-to-chart reuse	Data values are redrawn as a graph	Numeric values, chart labels, and data patterns
Figure-to-caption mismatch	Image is reused but caption is rewritten	Visual similarity and caption meaning
Text rewritten, evidence reused	Paragraphs change but table or figure stays similar	Visual and numeric evidence, not only text
Supplementary reuse	Reused material appears outside the main paper	Main manuscript and supplementary files together
Translated labels	Figure labels change language	OCR, translation-aware text similarity, and visual match

Similarity Fusion: Combining Signals

Multimodal screening should not hide everything behind one vague score. A single score may be useful as a summary, but reviewers need to see the evidence behind it.

The system can calculate separate scores for text, figures, tables, captions, and metadata. Then it can combine them into a risk signal or review priority.

How Combined Signals Can Be Interpreted

Signal Combination	Possible Meaning	Reviewer Action
Text match + figure match	Strong reuse evidence across written and visual content	Compare source and manuscript together.
Figure match only	Possible reused visual evidence	Inspect panels, labels, and source context.
Table values match	Possible data reuse	Check dataset attribution and permission.
Caption + image match	Likely reused figure context	Verify source, citation, and reuse policy.
Text rewritten + same table	Possible concealed reuse	Review data origin and related publications.
Same chart structure	Possible redrawn data reuse	Extract and compare values.

Reviewer Workflow and Explainability

Multimodal AI should support human review, not replace it. The system should make evidence easier to inspect. It should show what matched, where it matched, how strong the signal is, and which source should be checked.

A good reviewer report should group related findings. If the same source appears in text, figures, and captions, the report should connect those signals instead of showing them as separate random alerts.

Useful Report Elements

Document-level summary of reuse signals.
Separate text, figure, table, and caption findings.
Side-by-side comparison with matched source material.
Panel-level figure comparison.
Aligned table rows, columns, and values.
Caption and label similarity explanation.
Severity labels and confidence indicators.
Filters for weak or common matches.
Reviewer notes and decision fields.
Exportable report for audit or editorial review.

False Positives in Multimodal Detection

Not every match is misconduct. Multimodal screening can produce false positives if it does not understand context.

Some figures are standard diagrams. Some tables come from public datasets. Some methods sections contain expected wording. Some templates are reused legitimately. Some charts may look similar because they use the same official statistics.

False Positive Source	Why It Happens	How to Reduce It
Standard methods text	Researchers describe the same procedure	Use section-aware weighting and citation context.
Public datasets	Many papers use the same official data	Check attribution and dataset source.
Generic diagrams	Common concepts may use similar visuals	Compare labels, captions, and originality context.
Journal templates	Formatting and structure may repeat	Ignore known template regions.
Reference tables	Citation lists can look similar	Separate references from research evidence.
Stock visuals or icons	Reusable visual assets may appear in many documents	Use source type filters and reviewer labels.

Evaluation Metrics for Multimodal Reuse Detection

Multimodal AI should be evaluated across each modality. Text detection, image detection, table extraction, and reviewer usability all need separate quality checks.

A system may perform well on text but poorly on figures. Another system may detect images accurately but produce a confusing report. Technical accuracy and reviewer usefulness should be measured together.

Evaluation Area	Metric or Question	Why It Matters
Text detection	Precision, recall, F1, passage-level overlap	Measures copied and paraphrased text detection.
Image detection	Duplicate accuracy, transformation robustness, panel recall	Measures figure reuse detection quality.
Table extraction	Cell accuracy, row alignment, header recognition	Measures whether table data was parsed correctly.
Table similarity	Numeric similarity, structure match, pattern detection	Measures data reuse detection.
Caption analysis	Semantic similarity and source alignment	Connects visual elements with textual context.
Reviewer usefulness	Time saved, clarity, confidence, false-positive reduction	Shows whether the report helps real decisions.

Security and Privacy Considerations

Manuscripts are sensitive documents. They may include unpublished research, confidential results, personal data, peer-review material, commercial information, or copyrighted content.

A multimodal screening platform must protect this material during upload, parsing, indexing, comparison, reporting, storage, and deletion.

Important Security Controls

Encryption in transit.
Encryption at rest.
Role-based access control.
Tenant isolation for institutions and publishers.
Audit logs for document access.
Secure deletion rules.
Retention settings by client or policy.
Restricted reviewer permissions.
Protected backup storage.
Clear policy for model training and data reuse.

Submitted manuscripts should not be reused for model training unless there is clear permission, legal basis, and transparent policy. Trust is essential in manuscript screening.

Common Mistakes in Multimodal Screening Design

Treating PDF as Plain Text Only

This removes figures, tables, captions, and layout from the analysis. The system may miss the most important evidence.

Comparing Full Figures Without Panel Detection

A duplicated panel may be hidden inside a larger composite figure. Panel-level comparison gives more precise results.

Ignoring Table Structure

Tables can be reused after headers are renamed or rows are reordered. Numeric and structural comparison is necessary.

Using One Score for Everything

A single score can hide important details. Reviewers need separate evidence for text, figures, tables, and captions.

No Explanation Layer

A flag without explanation is not enough. Reviewers need to understand why the system marked an item as suspicious.

Best Practices Checklist

Parse manuscripts into text, figures, tables, captions, and metadata.
Use separate pipelines for each modality.
Segment composite figures into individual panels.
Apply OCR to embedded labels, legends, and annotations.
Compare tables by values, structure, and relationships.
Connect captions to figures and tables.
Use multimodal fusion for stronger evidence.
Show source-side comparison to reviewers.
Separate exact reuse from transformed reuse.
Control false positives with context filters.
Keep human review in the loop.
Protect unpublished manuscripts with strict security controls.
Evaluate both technical accuracy and reviewer usefulness.

Suggested Multimodal Screening Summary

Modality	Detection Method	Main Risk Detected	Reviewer Output
Text	Fingerprinting, semantic similarity, passage alignment	Copying, paraphrasing, translation	Highlighted passages and source links
Figures	Image embeddings, perceptual hashing, panel matching	Visual reuse, manipulation, duplication	Side-by-side figure comparison
Tables	Structure and numeric comparison	Data reuse and copied table logic	Aligned rows, columns, and values
Captions	Text similarity and OCR	Reused figure or table descriptions	Caption match explanation
Layout	Layout-aware parsing	Hidden context and document structure reuse	Section-aware report
Metadata	File, source, and submission metadata analysis	Reuse patterns and source connections	Audit-ready evidence

Conclusion

Text-only plagiarism detection is no longer enough for complex manuscript screening. Modern academic and scientific documents communicate meaning through paragraphs, figures, tables, captions, visual evidence, and structured data.

Multimodal AI helps reviewers see the full document. It can detect copied text, reused figures, duplicated panels, repeated table values, similar captions, and cross-modal reuse patterns. This creates a stronger and more complete review process.

The best multimodal systems do not replace human judgment. They organize evidence, reduce blind spots, and help reviewers decide which similarities matter. A strong screening platform should protect not only the originality of the text, but also the integrity of the data, figures, and research evidence behind it.

FAQ

What is multimodal reuse detection?

Multimodal reuse detection checks similarity across different parts of a manuscript, including text, figures, tables, captions, images, layout, and metadata.

Why is text-only plagiarism detection limited?

Text-only systems may miss copied figures, reused tables, redrawn charts, duplicated image panels, or reused data structures.

Can AI detect reused scientific figures?

AI can help detect visual duplication and transformed image reuse, including cropping, rotation, resizing, contrast changes, and partial edits. Human review is still needed to confirm context and severity.

How can tables be checked for reuse?

Tables can be compared by cell values, row and column structure, headers, numeric patterns, captions, and links to charts or related text.

Should multimodal AI make the final misconduct decision?

No. Multimodal AI should provide evidence and prioritization. Human reviewers should make the final decision based on context, attribution, permission, and policy.