Production RAG Systems for Engineering Teams
Class Duration
14 hours of live training delivered over 2-3 days to accommodate your scheduling needs.
Student Prerequisites
- Professional software development experience in Python or TypeScript
- Basic familiarity with databases and REST APIs
Target Audience
Software engineers and ML engineers building internal knowledge bases, document Q&A systems, or AI-powered search features on top of organizational data. Relevant for teams that need to go beyond simple vector search demos and deploy reliable RAG systems that perform well on real enterprise data.
Description
This course covers production-grade RAG architecture from ingestion pipeline to deployed application. We go well beyond the naive chunk-embed-retrieve-generate pattern to cover the techniques that actually matter for production quality: advanced chunking (semantic, hierarchical, late chunking), hybrid search, cross-encoder reranking, query transformation, multi-turn conversation with retrieval, evaluation harnesses with golden datasets, and deployment patterns. Labs build a complete, evaluated RAG system on realistic enterprise-style document and data sources.
Learning Outcomes
- Design a production RAG ingestion pipeline with appropriate chunking and metadata strategies.
- Implement hybrid search combining dense retrieval and BM25 with result fusion.
- Apply cross-encoder reranking to improve retrieval precision.
- Use query transformation techniques (HyDE, query expansion, step-back) to improve recall.
- Build a RAG evaluation harness with a golden dataset, measuring RAGAS-style metrics.
- Identify and remediate common RAG failure modes from an evaluation run.
- Deploy a RAG API service with appropriate caching, cost controls, and observability.
Training Materials
Comprehensive courseware is distributed online at the start of class. All students receive a downloadable MP4 recording of the training.
Software Requirements
Python 3.12+, Docker, API keys for an embedding model and a frontier LLM, and Git.
Training Topics
RAG Architecture for Production
- Ingestion pipeline, retrieval pipeline, and generation layer
- Where naive RAG fails in production
- Design decisions that determine RAG quality
Advanced Chunking
- Semantic chunking with embedding similarity
- Hierarchical chunking: parent/child retrieval
- Late chunking for long-context models
- Contextual Retrieval: prepending an LLM-generated context summary to each chunk before embedding (Anthropic, 2024–2026 standard)
- Late chunking vs. Contextual Retrieval: efficiency vs. relevance tradeoff
- Metadata extraction and storage
Vector Stores and Indexing
- pgvector, Qdrant, Weaviate, and Pinecone comparison
- Index configuration: HNSW vs. IVF tradeoffs
- Incremental and batch indexing pipelines
- Multi-tenancy and access control
Hybrid Search
- BM25 keyword search alongside dense retrieval
- Reciprocal Rank Fusion and score normalization
- Sparse-dense hybrid index options
Reranking and Query Transformation
- Cross-encoder rerankers: quality vs. latency
- HyDE (Hypothetical Document Embeddings)
- Query expansion and step-back prompting
- Multi-query retrieval for robustness
Multi-Turn RAG
- Conversation-aware retrieval with history
- Context window management over turns
- Follow-up query resolution
RAG Evaluation
- Evaluation dimensions: faithfulness, answer relevance, context recall
- RAGAS-style metrics and tooling
- Building a golden dataset
- Running evals in CI to catch regressions
Deployment and Operations
- RAG API service architecture
- Caching: embedding cache and response cache
- Cost attribution per request
- Monitoring retrieval quality in production
Workshop
- End-to-end RAG pipeline build and eval lab
- Failure mode remediation exercise
- Q&A session