Applied AIApril 2026 · 6 min read

How to implement RAG on enterprise data: the honest guide

The tutorials assume a clean corpus. Your corpus is not clean. This changes everything. A practical guide for technical decision-makers who have already read the tutorials and found them useless for their actual situation.

By Corrado Patierno

1. The real problem

Most enterprise data is not a clean PDF folder. It is a mix of scanned documents with bad OCR, Excel files used as databases, email threads with critical decisions buried on page 3, ERP exports in proprietary formats, and SharePoint sites where the folder structure reflects six years of organisational change.

Every RAG tutorial starts with "load your documents into a vector database." This assumes the documents are identified, accessible, readable, and reasonably structured. In most Italian mid-market organisations, none of these assumptions hold. The gap between what the tutorials describe and what you actually face is not a minor inconvenience — it is the entire project risk.

2. Start with questions, not infrastructure

The first mistake is asking "which vector database should we use?" before asking "what decisions do we want to make faster, and where does the information that would inform those decisions currently live?"

Map the questions first. Then work backwards to the documents. Then to the retrieval architecture. I have seen teams spend three months building an ingestion pipeline for a document corpus that turned out to be irrelevant to the actual business questions. The infrastructure was excellent. The ROI was zero.

The right starting point is a whiteboard session with the people who make decisions: what do you need to know, how quickly, and where do you currently find it? The answers to these questions determine your corpus, your chunking strategy, and your latency requirements — not the other way around.

3. The discovery phase — why skipping it costs three months

A two-week document discovery sprint — inventory, access mapping, quality assessment, format audit — sounds like overhead. It is not. Every project that skips it hits the same wall six weeks in: the corpus is worse than expected, access is restricted in unexpected ways, and the chunking strategy built on assumptions needs to be rebuilt from scratch.

What the discovery sprint actually produces: a complete inventory of document sources with format, access method, and quality score. A map of which documents answer which business questions. A realistic assessment of OCR quality, language mix, and structural consistency. An access audit — who owns what, what requires credentials, what is behind VPN.

Discovery is not a consulting deliverable. It is insurance. The two weeks you spend here save you three months of rework later.

4. Chunking strategy matters more than model choice

The retrieval quality of a RAG system is determined more by how documents are chunked than by whether you use GPT-4 or an open-source model. This is the single most underappreciated fact in enterprise RAG.

Fixed-size chunking (512 or 1024 tokens) works for homogeneous corpora — blog posts, support tickets, product descriptions. For technical documentation, legal contracts, and regulatory texts — documents with internal structure that matters — semantic chunking that respects section boundaries outperforms fixed-size by a significant margin.

Concrete guidance for three document types we work with regularly:

Technical norms (ISO, UNI, EN): Chunk by clause. Preserve the full clause hierarchy in metadata (e.g. "ISO 9001:2015, Section 7.1.5, Clause 7.1.5.2"). Overlap 1 clause above for context. Typical chunk size: 300–800 tokens.
Contracts and legal documents: Chunk by article or sub-article. Enrich metadata with party names, dates, and cross-references. Overlap is critical — contract clauses reference each other constantly. Typical chunk size: 400–1000 tokens.
Operational manuals: Chunk by procedure step or section. Preserve step ordering in metadata. Include the procedure title and scope in every chunk. Typical chunk size: 200–600 tokens.

5. How to evaluate retrieval quality before going live

Deploying without evaluation is how you end up with a system that feels impressive in demos and fails on real queries. Three evaluation approaches that work in practice:

MRR (Mean Reciprocal Rank) on a curated set of question-document pairs. Build this set during discovery — the subject matter experts who know where answers live are the ones who should create it. Aim for at least 50 pairs covering the full query distribution.
Recall@k: for the 50 most important queries, does the right chunk appear in the top k results? If your top-5 recall is below 85%, your chunking or embedding strategy needs work before you touch the generation layer.
Human eval on edge cases: the queries where the answer is in an unusual location, or requires reasoning across two documents, or involves a negation. These are the queries that will erode user trust if they fail. Test them explicitly.

Don't go live below a threshold you have agreed in advance. Define the threshold before you start evaluating, not after you see the numbers.

6. Governance from day one

Access control, audit logging, and model version pinning are not features you add after the system works. They are constraints that shape the architecture. If you build without them, you will retrofit them under pressure — and the retrofit will be incomplete.

Minimum viable governance for a production RAG system: who can query what corpus, logged at query time. Which model version produced which answer, immutably stored. A process for updating the model that includes regression testing against the evaluation set. A data retention policy that specifies how long query logs and generated answers are kept.

In regulated industries — healthcare, pharma, finance — this is not optional. In every other industry, it is still a good idea. The first time someone asks "why did the system give that answer?" you will be glad you logged it.

7. On-premise vs cloud — the honest trade-offs

Cloud RAG is faster to start and slower to govern. On-premise is slower to start and easier to govern at scale. The break-even depends on three factors:

Data sensitivity: if your corpus contains patient data, financial records, or trade secrets, the compliance overhead of cloud deployment often exceeds the operational overhead of on-premise.
Regulatory requirements: GDPR, AI Act, DORA, Legge 132/2025 — each adds constraints on where data can be processed and how processing must be documented. On-premise simplifies compliance by eliminating third-party data processing agreements.
Team capability: if your IT team can run containers and manage GPU nodes, on-premise is viable. If they cannot, cloud is the pragmatic choice until you build that capability — but build it, because the cost curve favours on-premise for sustained workloads.

Neither option is universally better. Anyone who tells you otherwise is selling something.

10 questions to answer before you start

What business decisions should this system make faster, and for whom?
Where does the data that informs those decisions currently live, and in what format?
Who owns that data, and what access restrictions apply?
What is the quality of the source documents — OCR accuracy, structural consistency, language mix?
What regulatory requirements apply to the data and to the AI system processing it?
What is the acceptable latency for a query response in production?
How will you measure retrieval quality, and what is the minimum acceptable threshold?
Who will maintain the system after deployment — update the corpus, retrain embeddings, monitor quality?
What happens when the system gives a wrong answer — what is the blast radius?
Can your infrastructure team run containers and manage GPU workloads, or do you need to build that capability first?

If you cannot answer at least seven of these with confidence, you are not ready to build. Start with the discovery sprint.

Applied AIEnterprise ArchitectureData Platforms

Let's talk about your project

AI infrastructure to build, a legacy system to modernise, or an ERP to connect to the future? Get in touch.

Start the conversation →