Name: Evaluation and Quality Assessment
Price: 59 INR
Rating: 0

1

Accuracy, Factuality, and Hallucination Assessment

This topic delves into factuality evaluation. It distinguishes factually correct, partially correct, unsupported, and contradicted assertions. Learners design question-answer pairs with known ground truth and measure how often the model matches, approximates, or fabricates information. They categorize hallucinations into types, such as fabricated citations, plausible but wrong details, or incorrect...

This topic delves into factuality evaluation. It distinguishes factually correct, partially correct, unsupported, and contradicted assertions. Learners design question-answer pairs with known ground truth and measure how often the model matches, approximates, or fabricates information. They categorize hallucinations into types, such as fabricated citations, plausible but wrong details, or incorrect aggregation of correct facts. The topic explores mitigation strategies including RAG, more cautious prompting, explicit uncertainty expressions, and using smaller local models for verification. Learners also consider domains where factuality is critical (medicine, law, finance) versus domains where creativity may be acceptable (fiction, ideation).

2

Relevance, Coherence, and User-Centered Quality

This topic emphasizes alignment with user needs. Relevance metrics capture whether outputs actually answer the question asked, not a subtly different one. Coherence metrics assess whether multi-paragraph outputs stay logically consistent over time. Learners develop rubrics asking raters to score outputs on directness, focus, structure, and helpfulness. The topic explores...

This topic emphasizes alignment with user needs. Relevance metrics capture whether outputs actually answer the question asked, not a subtly different one. Coherence metrics assess whether multi-paragraph outputs stay logically consistent over time. Learners develop rubrics asking raters to score outputs on directness, focus, structure, and helpfulness. The topic explores the gap between technically correct answers and practically useful answers, and how prompts can be tuned to favor actionable, concise, or explanatory responses as required. Techniques for collecting user feedback in production, such as thumbs-up/down, free-text comments, and follow-up surveys, are covered and linked back to prompt refinement.

3

Prompt Evaluation Frameworks and Metrics

This topic provides a conceptual map of evaluation dimensions. It distinguishes intrinsic quality (e.g., factual correctness, logical coherence, grammaticality) from extrinsic quality (e.g., user satisfaction, task success, business impact). Learners study existing prompt evaluation frameworks from industry and research that propose structured rubrics for assessing outputs. They review metrics like...

This topic provides a conceptual map of evaluation dimensions. It distinguishes intrinsic quality (e.g., factual correctness, logical coherence, grammaticality) from extrinsic quality (e.g., user satisfaction, task success, business impact). Learners study existing prompt evaluation frameworks from industry and research that propose structured rubrics for assessing outputs. They review metrics like exact match, F1, BLEU/ROUGE for text similarity, and more recent LLM-as-judge scoring where a second model rates outputs according to a rubric. The topic also explores safety-related metrics (toxicity, bias, jailbreak success rate) and operational metrics (latency, failure rate, cost-per-outcome). Learners come away with a vocabulary for precisely describing what “good” means in their context.

4

Human Evaluation and Annotation Workflows

This topic explains how to structure human rating tasks so that different annotators can consistently apply a rubric. Learners write clear rating guidelines with examples of high-, medium-, and low-quality outputs for each criterion. They learn about inter-rater reliability metrics such as Cohen’s kappa and Krippendorff’s alpha and how to...

This topic explains how to structure human rating tasks so that different annotators can consistently apply a rubric. Learners write clear rating guidelines with examples of high-, medium-, and low-quality outputs for each criterion. They learn about inter-rater reliability metrics such as Cohen’s kappa and Krippendorff’s alpha and how to use them to validate annotation quality. The topic covers sampling strategies for which prompts and outputs to send to annotators, and mechanisms to detect spam or low-effort ratings. It also touches on ethical considerations in annotation work, including fair pay, psychological safety when reviewing harmful content, and transparent communication about task purpose.

5

Automated Evaluation and LLM-as-a-Judge

This topic introduces automated evaluation methods where one model rates the outputs of another (or the same) model using a structured system prompt that encodes a rubric. It describes how to design judge prompts that ask for scores and explanations, how to calibrate thresholds, and how to combine multiple judge...

This topic introduces automated evaluation methods where one model rates the outputs of another (or the same) model using a structured system prompt that encodes a rubric. It describes how to design judge prompts that ask for scores and explanations, how to calibrate thresholds, and how to combine multiple judge scores. It also covers rule-based checks such as validating JSON schema conformity, checking for forbidden phrases, or verifying simple numerical constraints. Learners examine limitations of LLM-as-a-judge approaches, such as preference biases or vulnerability to adversarial phrasing, and learn to triangulate with human evaluation rather than replacing it wholesale.

6

Cost, Latency, and Efficiency Metrics

This topic covers operational metrics beyond output quality. Learners analyze API pricing models that charge per token, per call, or per model, and compute cost-per-1,000 requests and cost-per-successful-outcome. They measure latency from user input to model output, understand bottlenecks in network, retrieval, and generation, and define service-level objectives (SLOs). The...

This topic covers operational metrics beyond output quality. Learners analyze API pricing models that charge per token, per call, or per model, and compute cost-per-1,000 requests and cost-per-successful-outcome. They measure latency from user input to model output, understand bottlenecks in network, retrieval, and generation, and define service-level objectives (SLOs). The topic shows how prompt length, number of examples, and choice of model tier affect both cost and speed. Learners experiment with techniques such as using cheaper models for triage, caching frequent results, and dynamically switching models based on task complexity. They also consider environmental and organizational constraints, such as rate limits and budget caps.

7

Iterative Improvement and A/B Testing of Prompts

This topic describes how to apply experimentation techniques such as A/B testing and multi-armed bandits to prompt engineering. Learners define experimental units (user sessions, conversations, tasks), randomization schemes, and success metrics. They implement two or more prompt variants and compare their performance statistically, considering sample size, confidence intervals, and significance...

This topic describes how to apply experimentation techniques such as A/B testing and multi-armed bandits to prompt engineering. Learners define experimental units (user sessions, conversations, tasks), randomization schemes, and success metrics. They implement two or more prompt variants and compare their performance statistically, considering sample size, confidence intervals, and significance thresholds. The topic also covers challenges such as overlapping users across variants, time-based drift, and changes in upstream models that can confound results. Learners practice documenting experiment designs and outcomes, creating a knowledge base of what prompt patterns work best for which use cases.

Evaluation and Quality Assessment

Description

Learning Objectives

Topics (7)