Agents make retrieval harder, not obsolete

Larger context windows were supposed to solve retrieval. Load the entire corpus into memory, let the model reason over it, skip the fragility of semantic search. That assumption breaks down once agents move from research projects to production systems.

Context windows, even large ones, manage state within a single request. They let you pack conversation history, metadata, and reference documents into one shot so the model sees everything at once. That works for focused problems. It doesn't work for agents that need to explore an enterprise corpus through dozens or hundreds of queries in parallel, or for systems where the corpus changes faster than you can refresh the context.

Qdrant's recent shift from positioning itself as a vector database to positioning itself as an information retrieval layer tells you where the problem has moved. Their version 1.17 focuses on agentic retrieval patterns: relevance feedback loops, delayed fan-out to reduce wasted queries, cluster-wide telemetry to surface where retrieval actually breaks down. They're responding to a real signal: agents have changed the job retrieval has to do, and general-purpose databases weren't built for it.

Agents multiply retrieval demand in ways that create compounding failures underneath.

Volume is the first pressure. A single user session with an agent might run 50 to 200 queries to gather context for multi-step reasoning. At 1,000 concurrent users, you're looking at hundreds to thousands of queries per second on your retrieval layer. Most vector databases were built for dozens to maybe low hundreds of QPS. The infrastructure fails differently when query load changes by an order of magnitude.

Parallelism and iterative refinement add a second layer. Agents don't ask for answers once. They ask, evaluate, refine, branch into new questions based on what they learned. A single reasoning step might spawn three parallel retrieval queries, each one conditioning the next. One failing query in a sequence doesn't just miss information — it corrupts the reasoning that comes after it. That corruption compounds through the agent loop.

Recall degradation under continuous ingestion is the third pressure, and the one people underestimate most. You can tune recall for a static corpus. You can't tune it for one that changes by 5% or 10% daily while agents are running live queries against it. New documents create blind spots in your retrieval model. For a synchronous application, that's inconvenient. For an agent making decisions across multiple steps, it becomes a decision-quality problem that cascades through the reasoning chain.

The architecture emerging in response uses tiered retrieval. General-purpose infrastructure — a Postgres instance or DynamoDB table with vector columns — handles baseline performance and operational simplicity. You're not asking retrieval to be perfect at this stage. You're asking it to work and to deliver predictable performance. Most teams can get good enough recall this way for 80% of their queries.

A purpose-built IR system handles the 20% that matter. Qdrant, Milvus, or Vespa at this stage aren't replacing your primary data store. They're acting as a high-recall, low-latency retrieval frontier that agents query when they need confidence about what they've found. This is a different mental model than "vector database as your source of truth." It's "IR layer as a quality gate before agents make decisions."

Your application writes to Postgres and Qdrant in parallel. Agents query Postgres for fresh, recent data and Qdrant for semantic depth across the full corpus. When the agent needs to make a decision that compounds across multiple steps, it validates results by checking both layers. This redundancy feels expensive until you realize that a retrieval failure at step 3 of a 5-step agent loop costs far more than a 50ms query overhead.

The operational triggers for building this kind of system are concrete. If your agent sessions generate more than 100 queries, you need purpose-built retrieval. If your corpus changes faster than daily, you need tiering. If you're seeing tail latency on retrieval queries during peak load, you're already undersized. If agents are making different decisions in batches run hours apart because the corpus shifted, you need refresh guarantees.

Building retrieval infrastructure that survives agent query patterns — not human query patterns — means treating parallelism, recall under continuous ingestion, and compounding failure modes as first-class engineering problems. The era of "vector database handles everything" was shorter than most people expected.

Agents didn't make retrieval obsolete. They made it harder to get right, and that's what matters for any team running agents in production.

https://venturebeat.com/data/agents-dont-replace-vector-search-they-make-it-harder-to-get-right

You might also like

Onyx security raises $40M to build an AI control plane for enterprise agents

EUDI Wallets could be the trust layer agentic AI needs — but the law hasn't caught up yet