GDPR for Document AI: A Practical Guide for Operators

Why GDPR is different for document AI

Two specific things make GDPR more constraining for document AI than for, say, a CRM or a project management tool.

You don't know what's coming through the API

A typical SaaS vendor knows every field they store — they designed the schema. A document AI vendor receives a JPEG and is asked to extract structured data. The "documents" can include anything: a passport (identity data), an insurance form (health data, special category Art. 9), a payslip (employment data, possibly union membership Art. 9), a property contract (location data plus other parties' identity data). You handle Art. 6 lawful basis questions and Art. 9 special-category considerations on every request, not at schema design time.

Audit trail volume is a different magnitude

GDPR processors have to demonstrate processing on demand — to a regulator, to a controller, to a data subject under Art. 15. For traditional SaaS this is queryable. For document AI it requires logging the prompt, the model version, the response, and any post-processing — for every request, for the full retention period, with tenant isolation and access controls. Audit-trail design becomes load-bearing.

These two facts shape every architectural decision a serious document AI vendor makes.

Controller vs processor — and why it matters

In almost every commercial document AI relationship, the customer is the GDPR controller and the document AI vendor is the processor. The customer chose what data to collect, has the lawful basis, and is responsible to data subjects. The processor handles data on the controller's instructions per Art. 28. That sounds simple. It isn't — three places people get this wrong.

The processor needs the controller's instructions in writing

"Provide document extraction services" isn't enough. The DPA needs to specify the categories of data subjects, categories of personal data, processing operations, and retention periods. Tightly scoped DPAs make Art. 28 compliance straightforward. Loose ones generate audit findings.

Sub-processors aren't the controller's choice — but they need consent

When the processor calls OpenAI's API, OpenAI is a sub-processor. The processor cannot just spring this on customers. Art. 28(2) requires either specific authorization or general authorization with prior notice of changes. 30-day notification has hardened from a "nice to have" to the regulatory expectation in 2026.

"Service improvement" is processor scope creep

If the processor uses customer data to train models, evaluate quality, or anything not strictly the contracted service, they've moved beyond Art. 28. Either it's a separate processing arrangement with a separate basis, or it's a violation. Most serious processors handle this by just turning it off — fluex does not train on customer data, full stop, and that prohibition is written into the DPA.

Lawful basis is the customer's problem (mostly)

The processor doesn't need to identify the lawful basis under Art. 6 — that's the controller's responsibility. But the architecture has to support every basis the controller might rely on:

Contract performance (Art. 6(1)(b)) — most common for KYC, onboarding, employment workflows
Legal obligation (Art. 6(1)(c)) — common for AML, tax compliance, employment records
Legitimate interest (Art. 6(1)(f)) — common for fraud detection, document verification
Consent (Art. 6(1)(a)) — sometimes, less common in B2B
Vital interests / public interest (Art. 6(1)(d) and (e)) — niche

What this means practically: the processor has to be able to handle a controller saying "we're operating under legitimate interest, document this in your records," vs "we're operating under consent, please honor consent withdrawal" — without architectural changes. Same engine, same workflow, different DPA terms.

Special-category data

Art. 9 special categories — health, sexual orientation, political opinions, religious beliefs, biometric data — require an additional condition on top of Art. 6. For document AI this comes up constantly because:

Payslips can include union membership and health-insurance category codes (Art. 9)
Identity documents can include biometric facial data (Art. 9)
Medical bills are by definition health data (Art. 9)

The processor's responsibility is to support the controller's chosen Art. 9 condition — typically explicit consent or substantial public interest under Art. 9(2)(g) for KYC. Architecturally that means tighter retention defaults for special-category data, scrubbing in audit metadata, and access logging that makes Art. 32 security demonstration possible.

Cross-border transfers

This is where document AI gets thorny. Most LLM providers — OpenAI, Anthropic — operate in the US. EU data flowing through them crosses the Atlantic. Under Chapter V of GDPR, transfers outside the EEA need:

Adequacy decision — rare for the US; the EU-US Data Privacy Framework remains in force but has been challenged
Standard Contractual Clauses (SCCs) — the workhorse, requires a Transfer Impact Assessment
Binding Corporate Rules — internal-only, doesn't help vendors
Derogations (Art. 49) — narrow, generally inappropriate for routine processing

Practically, every document AI vendor running EU traffic through US LLM providers operates under SCCs with a documented Transfer Impact Assessment. The processor needs to bind sub-processors to equivalent terms, document the encryption-in-transit posture, and have a position on US government access (the Schrems II concern).

This is also where the EU AI Act starts to overlap. The Act doesn't change GDPR but adds requirements for high-risk AI use, including documentation that overlaps with the Art. 30 record of processing.

Data subject rights

Under Arts. 15–22, data subjects can request access, rectification, deletion, portability, restriction, and object to processing (including automated decision-making). For a processor, the standard model is:

Subject sends a request to the controller
Controller verifies identity and routes the substantive request to the processor
Processor executes within an agreed SLA (30 days from controller request is typical)
Controller responds to the subject

The processor's job is to make this fast and verifiable. Specifically:

Access — given a tenant + subject identifier, return all extractions and metadata. Requires a data model that makes "all data about person X" a queryable concept, which it usually isn't by default.
Deletion — hard delete primary data, retain audit metadata only as required by other obligations (e.g., financial record retention).
Portability — JSON or another structured machine-readable export.
Objection / restriction — workflow flags that prevent further processing without re-authorization.

What your DPA actually needs to say

A document AI DPA is a normal Art. 28 DPA with extra clauses for the AI-specific surface. The minimum:

Sub-processor list and notification mechanism — who, where, what they access, with 30-day notification of changes
Zero-retention LLM API configuration as a contractual term — not just a setting, written in
No-training clause — explicit prohibition on using customer data for model training, evaluation, or "service improvement"
Transfer mechanism — SCCs incorporated by reference, with module identification (controller-to-processor + processor-to-sub-processor where applicable)
Retention defaults — and the customer's right to set lower defaults
Subject-rights SLA — typically 30 days from controller request
Security commitments — encryption, access controls, incident notification SLA (72 hours for breach reporting)
Audit substitute — the controller's right to audit, plus alternatives (SOC 2 reports under NDA, security questionnaires)

If your DPA doesn't say all of these, your processor probably isn't aligned with current EDPB guidance.

How fluex does it

We built fluex with GDPR processor obligations as architectural constraints, not afterthoughts:

Sub-processor registry as code — public listing on our trust page and 30-day customer notification SLA
Zero-retention — configured on OpenAI and Anthropic by default
No model training on customer data — written into the DPA, enforceable
Per-tenant encryption — keys in GCP KMS; CMEK on Enterprise
Access controls — dual-approval workflow with time-bound permissions; engineer access to customer document content requires a documented business reason and reviewer approval
Audit trail with model versioning — supports Art. 30 records on demand
Subject-rights tooling — access / deletion / portability operations exposed to controllers via API

For the full posture, see our trust page, or email legal@fluex.com for the DPA.

Closing thought

GDPR isn't a checklist. It's a design constraint.

It shapes what an AI product looks like. Document AI vendors who treat GDPR as a compliance problem build features and then patch the compliance afterward — and discover, painfully, that the patch is the architecture rebuild. Processors who treat GDPR as a design constraint build the architecture once.

The five commitments we wrote about in our SOC 2 piece — tenant isolation, audit trail with model versioning, sub-processor governance as code, access controls with break-glass auditability, per-tenant encryption — are exactly what GDPR Art. 32 (security of processing), Art. 28 (processor obligations), and Art. 30 (records of processing) ask for. Build them once. Use them everywhere.

GDPR for document AI: a practical guide for operators.