← Back to Products
Evaluation and Quality Assessment
COURSE

Evaluation and Quality Assessment

INR 59
0.0 Rating
📂 Artificial Intelligence (AI)

Description

This subject teaches systematic methods for evaluating prompt and model quality. Learners will use qualitative and quantitative metrics, design evaluation datasets, run human and automated assessments, and iterate on prompts using evidence-driven frameworks.

Learning Objectives

Upon completion of this subject, learners will be able to define quality criteria for their use cases, construct evaluation datasets and rubrics, apply prompt evaluation frameworks, measure accuracy, relevance, coherence, style, and cost, and run A/B tests that compare prompts, models, and configurations. They will know how to set up continuous evaluation loops that keep systems aligned as models and data change.

Topics (7)

1
Accuracy, Factuality, and Hallucination Assessment

This topic delves into factuality evaluation. It distinguishes factually correct, partially correct, unsupported, and contradicted assertions. Learners design question-answer pairs with known ground truth and measure how often the model matches, approximates, or fabricates information. They categorize hallucinations into types, such as fabricated citations, plausible but wrong details, or incorrect...

This topic delves into factuality evaluation. It distinguishes factually correct, partially correct, unsupported, and contradicted assertions. Learners design question-answer pairs with known ground truth and measure how often the model matches, approximates, or fabricates information. They categorize hallucinations into types, such as fabricated citations, plausible but wrong details, or incorrect aggregation of correct facts. The topic explores mitigation strategies including RAG, more cautious prompting, explicit uncertainty expressions, and using smaller local models for verification. Learners also consider domains where factuality is critical (medicine, law, finance) versus domains where creativity may be acceptable (fiction, ideation).

Show more
2
Relevance, Coherence, and User-Centered Quality

This topic emphasizes alignment with user needs. Relevance metrics capture whether outputs actually answer the question asked, not a subtly different one. Coherence metrics assess whether multi-paragraph outputs stay logically consistent over time. Learners develop rubrics asking raters to score outputs on directness, focus, structure, and helpfulness. The topic explores...

This topic emphasizes alignment with user needs. Relevance metrics capture whether outputs actually answer the question asked, not a subtly different one. Coherence metrics assess whether multi-paragraph outputs stay logically consistent over time. Learners develop rubrics asking raters to score outputs on directness, focus, structure, and helpfulness. The topic explores the gap between technically correct answers and practically useful answers, and how prompts can be tuned to favor actionable, concise, or explanatory responses as required. Techniques for collecting user feedback in production, such as thumbs-up/down, free-text comments, and follow-up surveys, are covered and linked back to prompt refinement.

Show more
3
Prompt Evaluation Frameworks and Metrics

This topic provides a conceptual map of evaluation dimensions. It distinguishes intrinsic quality (e.g., factual correctness, logical coherence, grammaticality) from extrinsic quality (e.g., user satisfaction, task success, business impact). Learners study existing prompt evaluation frameworks from industry and research that propose structured rubrics for assessing outputs. They review metrics like...

This topic provides a conceptual map of evaluation dimensions. It distinguishes intrinsic quality (e.g., factual correctness, logical coherence, grammaticality) from extrinsic quality (e.g., user satisfaction, task success, business impact). Learners study existing prompt evaluation frameworks from industry and research that propose structured rubrics for assessing outputs. They review metrics like exact match, F1, BLEU/ROUGE for text similarity, and more recent LLM-as-judge scoring where a second model rates outputs according to a rubric. The topic also explores safety-related metrics (toxicity, bias, jailbreak success rate) and operational metrics (latency, failure rate, cost-per-outcome). Learners come away with a vocabulary for precisely describing what “good” means in their context.

Show more
4
Human Evaluation and Annotation Workflows

This topic explains how to structure human rating tasks so that different annotators can consistently apply a rubric. Learners write clear rating guidelines with examples of high-, medium-, and low-quality outputs for each criterion. They learn about inter-rater reliability metrics such as Cohen’s kappa and Krippendorff’s alpha and how to...

This topic explains how to structure human rating tasks so that different annotators can consistently apply a rubric. Learners write clear rating guidelines with examples of high-, medium-, and low-quality outputs for each criterion. They learn about inter-rater reliability metrics such as Cohen’s kappa and Krippendorff’s alpha and how to use them to validate annotation quality. The topic covers sampling strategies for which prompts and outputs to send to annotators, and mechanisms to detect spam or low-effort ratings. It also touches on ethical considerations in annotation work, including fair pay, psychological safety when reviewing harmful content, and transparent communication about task purpose.

Show more
5
Automated Evaluation and LLM-as-a-Judge

This topic introduces automated evaluation methods where one model rates the outputs of another (or the same) model using a structured system prompt that encodes a rubric. It describes how to design judge prompts that ask for scores and explanations, how to calibrate thresholds, and how to combine multiple judge...

This topic introduces automated evaluation methods where one model rates the outputs of another (or the same) model using a structured system prompt that encodes a rubric. It describes how to design judge prompts that ask for scores and explanations, how to calibrate thresholds, and how to combine multiple judge scores. It also covers rule-based checks such as validating JSON schema conformity, checking for forbidden phrases, or verifying simple numerical constraints. Learners examine limitations of LLM-as-a-judge approaches, such as preference biases or vulnerability to adversarial phrasing, and learn to triangulate with human evaluation rather than replacing it wholesale.

Show more
6
Cost, Latency, and Efficiency Metrics

This topic covers operational metrics beyond output quality. Learners analyze API pricing models that charge per token, per call, or per model, and compute cost-per-1,000 requests and cost-per-successful-outcome. They measure latency from user input to model output, understand bottlenecks in network, retrieval, and generation, and define service-level objectives (SLOs). The...

This topic covers operational metrics beyond output quality. Learners analyze API pricing models that charge per token, per call, or per model, and compute cost-per-1,000 requests and cost-per-successful-outcome. They measure latency from user input to model output, understand bottlenecks in network, retrieval, and generation, and define service-level objectives (SLOs). The topic shows how prompt length, number of examples, and choice of model tier affect both cost and speed. Learners experiment with techniques such as using cheaper models for triage, caching frequent results, and dynamically switching models based on task complexity. They also consider environmental and organizational constraints, such as rate limits and budget caps.

Show more
7
Iterative Improvement and A/B Testing of Prompts

This topic describes how to apply experimentation techniques such as A/B testing and multi-armed bandits to prompt engineering. Learners define experimental units (user sessions, conversations, tasks), randomization schemes, and success metrics. They implement two or more prompt variants and compare their performance statistically, considering sample size, confidence intervals, and significance...

This topic describes how to apply experimentation techniques such as A/B testing and multi-armed bandits to prompt engineering. Learners define experimental units (user sessions, conversations, tasks), randomization schemes, and success metrics. They implement two or more prompt variants and compare their performance statistically, considering sample size, confidence intervals, and significance thresholds. The topic also covers challenges such as overlapping users across variants, time-based drift, and changes in upstream models that can confound results. Learners practice documenting experiment designs and outcomes, creating a knowledge base of what prompt patterns work best for which use cases.

Show more