<<Download>> Download Microsoft Word Course Outline Icon Word Version Download PDF Course Outline Icon PDF Version

Evaluating AI Coding Assistants and LLM Apps

Class Duration

7 hours of live training delivered over 1-2 days to accommodate your scheduling needs.

Student Prerequisites

  • Professional software development experience
  • Basic familiarity with testing concepts (unit tests, CI pipelines)

Target Audience

Software engineers, ML engineers, and engineering managers who need to measure the actual quality of AI-assisted output — not just anecdotes — and build systematic evals into their development process. Equally relevant for teams evaluating whether to adopt or switch AI tools, and for developers building LLM-powered features who need to prevent quality regressions.

Description

This course treats LLM evaluation as a first-class engineering discipline. We cover the design and implementation of eval harnesses for both AI coding assistant workflows and LLM-powered applications: golden dataset construction, automated scoring (LLM-as-judge, unit test pass rate, assertion-based), regression detection in CI, and human evaluation design. Participants build a working eval pipeline for at least one realistic scenario during the labs.

Learning Outcomes

  • Describe the dimensions of LLM output quality relevant to code generation and application responses.
  • Build a golden dataset for a target task with appropriate input/output pairs and labeling criteria.
  • Implement LLM-as-judge scoring with calibration and inter-rater reliability assessment.
  • Write assertion-based evals for structured output and functional correctness.
  • Integrate an eval suite into a CI pipeline to catch regressions on model or prompt changes.
  • Analyze eval results to identify systematic failure modes.
  • Design a human evaluation study for tasks that resist automated scoring.

Training Materials

Comprehensive courseware is distributed online at the start of class. All students receive a downloadable MP4 recording of the training.

Software Requirements

Python 3.12+, API key for at least one frontier model (for LLM-as-judge labs), and Git.

Training Topics

Evaluation Fundamentals
  • Why anecdotal assessment fails at scale
  • Dimensions of quality: correctness, helpfulness, safety, style
  • Automated vs. human evaluation tradeoffs
  • The eval pyramid: fast unit evals to slow human evals
Golden Dataset Construction
  • Input selection and diversity criteria
  • Output labeling: reference answers, rubrics, and criteria
  • Labeling tools and workflows
  • Dataset versioning and maintenance
LLM-as-Judge Scoring
  • Designing LLM judge prompts
  • Calibration against human labels
  • Pairwise vs. absolute scoring
  • Detecting and mitigating judge biases
Assertion-Based Evals
  • Exact match, regex, and substring assertions
  • Unit test pass rate as an eval metric
  • JSON Schema validation for structured outputs
  • Combining assertions for composite scores
Eval Harness Implementation
  • Eval runner architecture: dataset → model → scorer → report
  • Parallelizing eval runs for speed
  • Caching model responses during development
  • Eval framework options: Braintrust, Promptfoo, custom
CI Integration for Regression Detection
  • Running evals on prompt or model changes
  • Setting pass/fail thresholds
  • Delta reporting: regression vs. improvement
  • Cost budget for CI evals
Evaluating Coding Assistants Specifically
  • Task-based eval: acceptance rate, edit distance, correctness
  • Measuring impact on cycle time and review pass rate
  • Privacy and data handling for real-codebase evals
Human Evaluation Design
  • When automated evals are insufficient
  • Study design: sample size, evaluator diversity, instructions
  • Inter-rater reliability measurement
  • Efficient human-in-the-loop eval workflows
Workshop
  • Build and run an eval harness on a target task
  • CI integration exercise
  • Q&A session
<<Download>> Download Microsoft Word Course Outline Icon Word Version Download PDF Course Outline Icon PDF Version