Evaluating AI Coding Assistants and LLM Apps

Class Duration

7 hours of live training delivered over 1-2 days to accommodate your scheduling needs.

Student Prerequisites

Professional software development experience
Basic familiarity with testing concepts (unit tests, CI pipelines)

Target Audience

Software engineers, ML engineers, and engineering managers who need to measure the actual quality of AI-assisted output — not just anecdotes — and build systematic evals into their development process. Equally relevant for teams evaluating whether to adopt or switch AI tools, and for developers building LLM-powered features who need to prevent quality regressions.

Description

This course treats LLM evaluation as a first-class engineering discipline. We cover the design and implementation of eval harnesses for both AI coding assistant workflows and LLM-powered applications: golden dataset construction, automated scoring (LLM-as-judge, unit test pass rate, assertion-based), regression detection in CI, and human evaluation design. Participants build a working eval pipeline for at least one realistic scenario during the labs.

Learning Outcomes

Describe the dimensions of LLM output quality relevant to code generation and application responses.
Build a golden dataset for a target task with appropriate input/output pairs and labeling criteria.
Implement LLM-as-judge scoring with calibration and inter-rater reliability assessment.
Write assertion-based evals for structured output and functional correctness.
Integrate an eval suite into a CI pipeline to catch regressions on model or prompt changes.
Analyze eval results to identify systematic failure modes.
Design a human evaluation study for tasks that resist automated scoring.

Training Materials

Comprehensive courseware is distributed online at the start of class. All students receive a downloadable MP4 recording of the training.

Software Requirements

Python 3.12+, API key for at least one frontier model (for LLM-as-judge labs), and Git.

Training Topics

Evaluation Fundamentals

Why anecdotal assessment fails at scale
Dimensions of quality: correctness, helpfulness, safety, style
Automated vs. human evaluation tradeoffs
The eval pyramid: fast unit evals to slow human evals

Golden Dataset Construction

Input selection and diversity criteria
Output labeling: reference answers, rubrics, and criteria
Labeling tools and workflows
Dataset versioning and maintenance

LLM-as-Judge Scoring

Designing LLM judge prompts
Calibration against human labels
Pairwise vs. absolute scoring
Detecting and mitigating judge biases

Assertion-Based Evals

Exact match, regex, and substring assertions
Unit test pass rate as an eval metric
JSON Schema validation for structured outputs
Combining assertions for composite scores

Eval Harness Implementation

Eval runner architecture: dataset → model → scorer → report
Parallelizing eval runs for speed
Caching model responses during development
Eval framework options: Braintrust, Promptfoo, custom

CI Integration for Regression Detection

Running evals on prompt or model changes
Setting pass/fail thresholds
Delta reporting: regression vs. improvement
Cost budget for CI evals

Evaluating Coding Assistants Specifically

Task-based eval: acceptance rate, edit distance, correctness
Measuring impact on cycle time and review pass rate
Privacy and data handling for real-codebase evals

Human Evaluation Design

When automated evals are insufficient
Study design: sample size, evaluator diversity, instructions
Inter-rater reliability measurement
Efficient human-in-the-loop eval workflows

Workshop

Build and run an eval harness on a target task
CI integration exercise
Q&A session