a

Blog | Microsoft Fabric

Practical AI:
Document Forgery Detection with Microsoft Fabric

20 June 2025

Jacob Matuzevičius

Data, Intelligence & Anaytics Consultant at Decision Inc. UK

In many organisations, scanned documents remain a quiet vulnerability. They are central to critical workflows – contracts, supplier agreements, onboarding forms – but often move through systems with limited validation. The risks aren’t always dramatic, but they are persistent. Forged documents, duplicated documents, subtle edits, or overlooked omissions can introduce downstream errors, reputational damage, and compliance concerns.

Manual checks remain the default in many business units. But as document volumes grow and operational pressures mount, it’s clear this approach does not scale. We set out to explore whether modern tools could support a more effective and sustainable way to manage this risk.

Using Microsoft Fabric, we developed a proof of concept designed to flag potentially forged or manipulated documents before they progress through business-critical workflows. It was built with speed, simplicity, and scale in mind.

The Business Challenge

Organisations increasingly rely on scanned documents, but verification processes have not kept pace. Scans are often treated as static images—briefly reviewed, stored, and assumed to be valid. Even small errors or edits can pass unnoticed until they create downstream issues.

Whether it’s a missing signature, a modified date, or a slight layout change, these details matter. But catching them consistently across thousands of documents is not feasible without support from automation.

The challenge was to design a solution that could surface potential issues early, integrate into existing processes, and be built using tools already in place.

The Approach

We used Microsoft Fabric’s end-to-end analytics capabilities to build the entire solution natively on platform. The process began with data extraction, using Fabric.’s AI functions to convert scanned PDFs into structured content, including layout, key fields, form data, and text blocks. The model was trained on a labelled dataset that included both genuine and forged documents, with realistic variations across both types.

To support the model’s performance, we carried out feature engineering in PySpark to capture patterns and anomalies commonly found in document structure and content. We then trained a PySpark classifier model using this feature set, validating the model with stratified sampling and tuning it to balance accuracy with review workload. Thresholds were calibrated to reduce false negatives while keeping the number of alerts practical for business teams. All development took place within Fabric Notebooks, enabling seamless collaboration, version control, and rapid iteration. Outputs were written back to the Lakehouse for integration into broader data workflows.

The solution was designed to act as an intelligent triage layer. For each document, it assigns a likelihood score based on patterns seen during training. Documents that appear clean pass through without interruption. Those that present anomalies are flagged for further review – reducing workload without compromising oversight.

The Outcome

Potential applications span several business areas. In procurement, the solution will help flag questionable supplier documentation before it enters approval workflows. Legal teams may use it to check the consistency of submitted contracts, while risk and compliance functions could apply it to audit scanned submissions against internal standards and regulatory expectations.

Because it’s built entirely within Fabric, the model can scale with existing data volumes and evolve as more training data becomes available. The pipeline can be scheduled, triggered by events, or extended to enrich outputs with contextual business data.

Why This Matters

What makes this solution powerful is not just the model, it’s the architecture around it. Fabric’s Lakehouse environment enables a streamlined, secure approach to data handling, model development, and deployment. PySpark provided the scalability to handle large volumes, while Notebooks made experimentation rapid and auditable.

For business stakeholders, the benefits are tangible:

]

Rapid deployment using existing tools and infrastructure

]

Transparent and explainable model behaviour

]

A clear path from pilot to production-ready integration

Rather than building a complex ML product, we focused on solving a single, well-defined problem in a way that aligns with how organisations already work.