Tracing Agentic Document Extraction: Making LLM Pipelines Debuggable

The pitch for agentic document extraction is real-world accuracy. The cost is that every request becomes a small distributed system: a planner, two or three extractors, a validator chain, sometimes a post-processor escalation, often a sub-LLM call to handle ambiguity. When that pipeline misbehaves — wrong field, slow request, mysterious cost spike — you need to look inside it. This post is how we made fluex's pipeline debuggable.

Why traditional logs aren't enough

The default reflex for an engineer debugging an LLM workflow is to add log lines: log the prompt, log the response, log the result. This works for a single LLM call. It breaks immediately for a four-step ReAct pipeline because:

Time relationships are invisible. Did the validator run before or after the second extractor? Were any of them parallel? Logs in temporal order don't show causal structure.
Costs aren't attributable. One request burned 12,000 tokens. Which step? You can find out, but only by stitching log lines from three components together across timestamps.
Failures cascade weirdly. The post-processor escalates because the validator failed; the validator failed because the extractor returned the wrong field; the extractor returned the wrong field because the planner mis-routed the request. Logs show four failures; one was the actual cause.

The fix is structured tracing. We use OpenTelemetry, but the pattern is what matters.

Span design

Every extraction request starts a root span. Each pipeline component creates a child span when it runs. Spans nest naturally: planner spans the planner work; planner-output becomes input to extractor spans; validator spans run after extractor spans; post-processor spans bracket any retries.

What goes in each span:

Inputs and outputs as span attributes — for LLM spans, the prompt hash (not the prompt itself, in the trace; that goes in audit storage), model identifier, temperature, max-tokens. For validator spans, the rule identifier and the result.
Cost dimensions — input tokens, output tokens, model tier, vendor. Aggregating cost by tenant, by document type, by model is a query against trace attributes.
Outcome status — success, soft-failure (validator flagged), hard- failure (timeout, vendor error). This makes "what's failing right now" a one-query dashboard.
Tenant and request identifiers — every span has the tenant ID so customer-specific debugging is a filtered trace search.

Critically: spans link to audit-trail records via the request ID. Traces are for engineers and observability; the audit trail is for customers and regulators. The two are separate stores with different retention rules but cross-referenceable on demand.

What we don't put in spans

Traces leak. New Relic (our observability provider) is a sub-processor. Customer document content does not go into spans. The discipline:

Prompts are referenced by hash, not value. The full prompt lives in audit storage, which has tighter access controls.
Document content is referenced by content hash. Span attributes carry "document contains 14 expected fields, 11 high-confidence." Not "document content here is..."
Customer-named values (employee names, addresses) are scrubbed before egress to the observability vendor. We use a small egress filter that runs on every span before export. The filter has a maintained allow-list of attribute names; anything not on the list is treated as potentially-sensitive and dropped.

This discipline matters. We've written about why observability vendors are sub-processors in our piece on AI sub-processor governance — the short version is that careless tracing turns your observability vendor into an unintended retainer of customer data.

Sampling strategy

At fluex's volume, retaining every span is expensive and noisy. We sample by outcome:

100% of failed requests — every error keeps its complete trace.
100% of post-processor escalations — these are the interesting successes. They show where the planner's initial plan wasn't sufficient.
5% of routine successes — sampled at the root span, propagated to children. Enough to trend cost and latency without retaining every uneventful trace.
Per-tenant overrides — Enterprise customers can opt into 100% tracing for their traffic when they're investigating an integration.

Sampling decisions happen at the root span and propagate. Mixed-sampling traces (some spans retained, some dropped) are useless for debugging.

Debugging workflows

The payoff for this investment is that customer support requests become tractable. "We sent you this document and got the wrong amount" becomes:

Look up the request ID in the audit trail.
Pull the trace by request ID.
See the planner output, every extractor call, every validator result, every post-processor decision — in order, with timing and cost.
Reproduce the request locally with the captured prompt and model version (the audit store has the full prompts).
Identify the actual root cause; ship a fix.

For fluex, this turns "investigate a customer report" from a half-day operation into a 15-minute one. For our customers, it means our support response includes a real explanation, not "we'll look into it."

What we'd do differently

If we were starting today: more aggressive structured-attribute discipline from day one. We had an early period where some attributes were free-form strings, and the cleanup cost was real. Establish a small canonical attribute vocabulary early and reject anything outside it.

Also: bigger investment in trace-driven cost analytics earlier. We assumed cost would track tokens linearly. It does, but the variance per document type is enormous, and trace attributes were the only sane way to break it down. Building a cost analytics dashboard on top of traces — by tenant, by document type, by model — paid back in a quarter.

Closing

Agentic extraction is a small distributed system per request. Treat it like one. The teams that ship structured tracing on day one find the long-tail bugs in production and fix them before customers notice. The teams that don't burn weeks of senior engineering time stitching together log files when something goes wrong.

For the architecture this tracing supports, see our ReAct architecture pillar. For the audit side of the same coin, see our follow-up on audit trails for non-deterministic outputs.

Tracing agentic extraction: making LLM pipelines debuggable.