<?xml version="1.0" encoding="UTF-8"?>
<rss 
    version="2.0"
    xmlns:dc="http://purl.org/dc/elements/1.1/" 
    xmlns:content="http://purl.org/rss/1.0/modules/content/" 
    xmlns:atom="http://www.w3.org/2005/Atom" 
    xmlns:media="http://search.yahoo.com/mrss/" 
>
    <channel>
        <title><![CDATA[Daily Dose of Data Science]]></title>
        <description><![CDATA[A daily column with insights, observations, tutorials and best practices on python and data science. Read by industry professionals at big tech, startups, and engineering students.]]></description>
        <link>https://www.dailydoseofds.com</link>
        <image>
            <url>https://www.dailydoseofds.com/favicon.png</url>
            <title>Daily Dose of Data Science</title>
            <link>https://www.dailydoseofds.com</link>
        </image>
        <generator>Ghost 6.26</generator>
        <lastBuildDate>Mon, 13 Apr 2026 02:37:40 +0530</lastBuildDate>
        <atom:link href="https://www.dailydoseofds.com" rel="self" type="application/rss+xml"/>
        <ttl>60</ttl>

                <item>
                    <title><![CDATA[Diffusion LLMs from the Ground Up: Theory, Math, and Why They Work]]></title>
                    <description><![CDATA[Diffusion LLMs Part 1: Understanding how diffusion language models work from first principles, the math behind masked diffusion, and why they represent a fundamentally different approach to text generation.]]></description>
                    <link>https://www.dailydoseofds.com/diffusion-models-part-1/</link>
                    <guid isPermaLink="false">69d4e8f20607610001a9b6b9</guid>

                        <category><![CDATA[Classical ML and Deep Learning]]></category>
                        <category><![CDATA[Diffusion Models]]></category>

                        <dc:creator><![CDATA[Avi Chawla]]></dc:creator>

                    <pubDate>Sun, 12 Apr 2026 01:20:00 +0530</pubDate>

                        <media:content url="https://www.dailydoseofds.com/content/images/2026/04/dllm-thumbnail.png" medium="image"/>

                    <content:encoded><![CDATA[<img src="https://www.dailydoseofds.com/content/images/2026/04/dllm-thumbnail.png" alt="Diffusion LLMs from the Ground Up: Theory, Math, and Why They Work"/> <h2 id="intro">Intro</h2><p>Every production LLM today generates text the same way, i.e., one token at a time, left to right. GPT-4, Claude, Gemini, DeepSeek, LLaMA. All of them use autoregressive (AR) generation, where each token depends on every token before it.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/04/image-156.png" class="kg-image" alt="" loading="lazy" width="1300" height="625" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/04/image-156.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2026/04/image-156.png 1000w, https://www.dailydoseofds.com/content/images/2026/04/image-156.png 1300w" sizes="(min-width: 720px) 720px"></figure><p>This creates two structural problems. First, it's slow in a way that better hardware can't fully fix. Second, it creates blind spots in reasoning that no amount of training data can patch.</p><p>Diffusion language models (dLLMs) take a fundamentally different approach. Instead of generating tokens sequentially, they start with a fully masked sequence and iteratively reveal all tokens in parallel, refining the output over multiple steps.</p><p>This gives a generation paradigm that's compute-efficient, bidirectional, and capable of things autoregressive models structurally can't do.</p><p>This article builds a complete understanding of how dLLMs work, from first principles to the math. By the end, you should be able to read any dLLM paper fluently, understand every component of the architecture, and reason about why this approach works.</p><p>Let's begin.</p><hr><h2 id="the-sequential-bottleneck-in-autoregressive-generation">The sequential bottleneck in autoregressive generation</h2><p>Autoregressive models factorize the probability of a text sequence as a product of conditional probabilities:</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/04/image-106.png" class="kg-image" alt="" loading="lazy" width="2000" height="444" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/04/image-106.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2026/04/image-106.png 1000w, https://www.dailydoseofds.com/content/images/size/w1600/2026/04/image-106.png 1600w, https://www.dailydoseofds.com/content/images/2026/04/image-106.png 2028w" sizes="(min-width: 720px) 720px"></figure><div class="kg-card kg-callout-card kg-callout-card-blue"><div class="kg-callout-emoji">💡</div><div class="kg-callout-text"><b><strong style="white-space: pre-wrap;">Factorization</strong></b> here means breaking the joint probability of an entire sequence into a chain of conditional probabilities. <br><br>Each factor $p(x_i|x_1,\ldots,x_{i-1})$ represents the probability of the next token given everything before it. This is called the chain rule of probability, and it's mathematically exact.<br><br>The key point here is that each factor requires its own forward pass through the model.</div></div><p>Each of these factors in the above mathematical formulation requires loading the full model weights through GPU memory to produce a single token.</p><p>To understand better, consider what happens during autoregressive decoding at the hardware level.</p><p>A 7B parameter model stored in FP16 precision occupies roughly 14GB of GPU memory.</p><p>To generate a single token, the GPU needs to read those 14GB of weights from its high-bandwidth memory (HBM) into its compute cores, perform the matrix multiplications and attention computations, and produce one output token. Then, for the next token, it reads all 14GB again.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/04/image-157.png" class="kg-image" alt="" loading="lazy" width="1300" height="511" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/04/image-157.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2026/04/image-157.png 1000w, https://www.dailydoseofds.com/content/images/2026/04/image-157.png 1300w" sizes="(min-width: 720px) 720px"></figure><p>The reading part is the bottleneck. An A100 GPU can read data from HBM at about 2 TB/s. Reading 14GB at that speed takes roughly 7 milliseconds. The actual math (all the matrix multiplies, the attention scores, the MLP layers) may be takes about 0.1 milliseconds.</p><p>So the GPU spends 7ms waiting for data that it then uses for a computation that takes 0.1ms. This means roughly 98% of the time spent on data transfer and only 2% on actual math.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/04/image-158.png" class="kg-image" alt="" loading="lazy" width="1300" height="528" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/04/image-158.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2026/04/image-158.png 1000w, https://www.dailydoseofds.com/content/images/2026/04/image-158.png 1300w" sizes="(min-width: 720px) 720px"></figure><p>A useful way to quantify this imbalance is arithmetic intensity, which is the ratio of compute operations (FLOPs) to bytes of data moved.</p><p>During Autoregressive decoding, this ratio is roughly 1 FLOP per byte. Modern GPUs like the A100 are designed for workloads with 100+ FLOPs per byte. </p><p>To put that in perspective, the GPU's compute cores are capable of performing 100 operations for every byte of data that arrives from memory.</p><p>But autoregressive decoding only needs 1 operation per byte. So the compute cores finish their work almost instantly and then sit waiting for the next chunk of data to arrive. </p><p>It's like having a team of 100 workers but only ever giving them 1 task at a time. They're not overloaded with work; they're waiting on the supply line.</p><p>The GPU is being used at roughly 1% of its computational capacity, not because the model is small, but because the memory system can't deliver data fast enough to keep the cores busy.</p><p>This is what it means for autoregressive decoding to be memory-bandwidth bound. The limiting factor isn't the GPU's ability to do math, it's how fast data can be shuttled from memory to the compute cores.</p><p>Buying a GPU with 2x more FLOPs doesn't help much if the memory bandwidth only improves by 1.3x. The gap between compute capability and memory bandwidth has been widening with every GPU generation, which means the Autoregressive bottleneck gets proportionally worse over time, not better.</p><p>The second structural problem is directional.</p><p>Autoregressive models only ever see left-to-right context during generation. This creates what's known as the reversal curse, where a model trained on documents containing "Tom Cruise's mother is Mary Lee Pfeiffer" will be significantly worse at answering "Who is Mary Lee Pfeiffer's son?" than "Who is Tom Cruise's mother?"</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/04/image-159.png" class="kg-image" alt="" loading="lazy" width="1273" height="599" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/04/image-159.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2026/04/image-159.png 1000w, https://www.dailydoseofds.com/content/images/2026/04/image-159.png 1273w" sizes="(min-width: 720px) 720px"></figure><p>The original paper testing this on GPT-4 found 79% accuracy on the forward direction versus 33% on the reverse for 1,000 celebrity-parent pairs.</p><p>The asymmetry is caused by the autoregressive training objective. The model learns $p(\text{mother} | \text{Tom Cruise})$ but not $p(\text{Tom Cruise} | \text{mother's name})$ unless that ordering also appears in the training data.</p><div class="kg-card kg-callout-card kg-callout-card-blue"><div class="kg-callout-emoji">💡</div><div class="kg-callout-text">An important nuance to point out here.<br><br>The reversal curse is strongest for rare or obscure facts that predominantly appear in one ordering in the training data. For well-known facts like "Paris is the capital of France," the relationship appears in so many different orderings across a large training corpus that frontier models handle the reversal fine. The curse also doesn't apply to in-context reasoning: if you provide "A is B" in the prompt, autoregressive models can deduce "B is A" without issue.<br><br>The limitation is about what the model learns during training from left-to-right factorization, not about its reasoning ability at inference time.<br><br>Still, for long-tail knowledge (which is most knowledge), the asymmetry is measurable and significant, and it's a direct consequence of unidirectional training.</div></div><p>These two bottlenecks (sequential generation that underutilizes hardware, and unidirectional context that creates reasoning blind spots) are structural properties of the autoregressive factorization itself. They motivate the search for a fundamentally different generation paradigm.</p><hr><h2 id="how-diffusion-works-in-images">How diffusion works in images?</h2><p>Before we look at diffusion for text, it helps to understand how diffusion works in the domain where it first succeeded, i.e., images.</p><p>The core idea is surprisingly simple, and understanding it will make the text case (and the problems it introduces) much clearer.</p><h3 id="the-forward-process-systematically-destroying-data">The forward process: systematically destroying data</h3><p>A diffusion model starts with a clean image and progressively corrupts it by adding Gaussian noise over many timesteps. At each step, a small amount of random noise is added to every pixel value. After enough steps, the image becomes pure noise, indistinguishable from random static.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/04/image-122.png" class="kg-image" alt="" loading="lazy" width="2000" height="1123" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/04/image-122.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2026/04/image-122.png 1000w, https://www.dailydoseofds.com/content/images/size/w1600/2026/04/image-122.png 1600w, https://www.dailydoseofds.com/content/images/size/w2400/2026/04/image-122.png 2400w" sizes="(min-width: 720px) 720px"></figure><p>Formally, given clean data $x_0$ (the original image), the forward process at timestep $t$ produces:</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/04/image-107.png" class="kg-image" alt="" loading="lazy" width="2000" height="218" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/04/image-107.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2026/04/image-107.png 1000w, https://www.dailydoseofds.com/content/images/size/w1600/2026/04/image-107.png 1600w, https://www.dailydoseofds.com/content/images/2026/04/image-107.png 2095w" sizes="(min-width: 720px) 720px"></figure><p>There are several components here, so let's unpack each one.</p>

<aside data-sx-content-cta class="relative p-0.5 js-toc-ignore">
        <div data-sx-content-cta-blur class="absolute w-full start-0 -top-0.5 -translate-y-full bg-linear-to-b to-70% from-black/0 to-background dark:to-background-dark pointer-events-none">
            &nbsp; <br> &nbsp; <br> &nbsp;
        </div>

    <div class="relative flex flex-col items-center text-center w-full px-4 py-10 lg:px-10 text-base rounded-xl sm:rounded-2xl">
        <div class="dark:hidden absolute -inset-px -z-10 rounded-xl bg-white shadow-pretty-sm transition-shadow duration-200"></div>
        <span class="hidden dark:block absolute -z-10 -inset-px rounded-inherit bg-radial-[50%_100%_at_50%_0%] from-white/10"></span>        
        <div class="hidden dark:block absolute -inset-px -z-10 rounded-xl bg-white/1 transition-colors duration-500 inset-shadow-md inset-shadow-white/2"></div>
        
        <div class="hidden dark:block absolute -z-10 -inset-0.5 mask-to-b mask-position-to-70% mask-opacity-to-50%">
            <div class="hidden dark:block absolute -z-10 inset-px rounded-xl border border-gray-50/10"></div>
        </div>                                                
        <div class="hidden dark:block absolute -z-10 -inset-px rounded-xl border border-gray-50/5 opacity-0"></div>

        <div class="hidden dark:block absolute -top-px start-1/2 -translate-x-1/2 w-1/2 h-px bg-linear-to-r from-white/0 to-white/0 via-gray-400/50"></div>

        <h2 class="text-gray-900 dark:text-gray-100 text-pretty text-xl sm:text-2xl leading-snug tracking-tight">




                                                <span class="hidden first:inline">
                                This lesson is for paying subscribers only
                            </span>                           
                    <span class="hidden first:inline">
                        This post is for paying subscribers only
                    </span>  

        </h2>

            <a href="/membership" class="mt-7 px-4 py-2.5 font-semibold text-sm text-center bg-primary rounded-full shadow-primary dark:shadow-none dark:drop-shadow-primary-sm hover:opacity-90 transition-opacity duration-200 inline-block" style="color: white;">
                Unlock Full Access
            </a>
            <p class="mt-5 text-sm">
                Already have an account? <button data-portal="signin" class="block sm:inline mx-auto font-medium underline decoration-primary decoration-2 underline-offset-2 hover:text-primary transition-colors duration-200">Sign in</button>
            </p>
    </div>
</aside>
]]></content:encoded>
                </item>
                <item>
                    <title><![CDATA[The Anatomy of an Agent Harness]]></title>
                    <description><![CDATA[A deep dive into what Anthropic, OpenAI, Perplexity and LangChain are actually building.
]]></description>
                    <link>https://www.dailydoseofds.com/p/the-anatomy-of-an-agent-harness/</link>
                    <guid isPermaLink="false">69d94d92ee6da900017b4025</guid>

                        <category><![CDATA[Claude]]></category>

                        <dc:creator><![CDATA[Avi Chawla]]></dc:creator>

                    <pubDate>Tue, 07 Apr 2026 00:51:00 +0530</pubDate>

                        <media:content url="https://www.dailydoseofds.com/content/images/2026/04/clipboard-image-1775848639.png" medium="image"/>

                    <content:encoded><![CDATA[<img src="https://www.dailydoseofds.com/content/images/2026/04/clipboard-image-1775848639.png" alt="The Anatomy of an Agent Harness"/> <p>A&nbsp;<a href="https://www.dailydoseofds.com/ai-agents-crash-course-part-10-with-implementation/"><strong>ReAct loop</strong></a>, a couple of tools, and a well-written system prompt can get surprisingly far in a demo.</p><p>But the moment the task requires 10+ steps, things fall apart like the model forgets what it did three steps ago, tool calls fail silently, and the context window fills up with garbage.</p><figure class="kg-card kg-image-card"><img src="https://substackcdn.com/image/fetch/$s_!4aYg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F933d530a-d817-4e19-9180-8bade83ef57e_1357x706.png" class="kg-image" alt="" loading="lazy" width="1357" height="706"></figure><p>The problem isn't the model. It's everything around the model.</p><p>LangChain proved this when they changed only the infrastructure wrapping their LLM (same model, same weights) and jumped from outside the top 30 to rank 5 on TerminalBench 2.0.</p><p>A separate research project hit a 76.4% pass rate by having an LLM optimize the infrastructure itself, surpassing hand-designed systems.</p><p>That infrastructure has a name now: the agent harness.</p><h2 id="what-is-agent-harness">What is Agent Harness?</h2><p>The term was formalized in early 2026, but the concept existed long before.</p><p>The harness is the complete software infrastructure wrapping an LLM, including the orchestration loop, tools, memory, context management, state persistence, error handling, and guardrails.</p><p>Anthropic’s Claude Code documentation puts it simply: the SDK is “the agent harness that powers Claude Code.“</p><p>We really liked the canonical formula, from LangChain’s Vivek Trivedy: “If you’re not the model, you’re the harness.”</p><p>To put it another way, the “agent” is the emergent behavior: the goal-directed, tool-using, self-correcting entity the user interacts with. The harness is the machinery producing that behavior. When someone says “I built an agent,” they mean they built a harness and pointed it at a model.</p><figure class="kg-card kg-image-card"><img src="https://substackcdn.com/image/fetch/$s_!FSSm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1b2a255e-8439-4212-acea-ff62939cc62a_680x379.png" class="kg-image" alt="" loading="lazy" width="680" height="379"></figure><p>Beren Millidge made this analogy precise in his 2023 essay:</p><ul><li>A raw LLM is a CPU with no RAM, no disk, and no I/O.</li><li>The context window serves as RAM (fast but limited).</li><li>External databases function as disk storage (large but slow).</li><li>Tool integrations act as device drivers.</li></ul><p>The harness is the operating system.</p><h2 id="three-levels-of-engineering">Three levels of engineering</h2><p>Three concentric levels of engineering surround the model:</p><ul><li>Prompt engineering crafts the instructions the model receives.</li><li>Context engineering manages what the model sees and when.</li><li>Harness engineering encompasses both, plus the entire application infrastructure: tool orchestration, state persistence, error recovery, verification loops, safety enforcement, and lifecycle management.</li></ul><p>The harness is not a wrapper around a prompt. It is the complete system that makes autonomous agent behavior possible.</p><h2 id="the-11-components-of-a-production-harness">The 11 components of a production Harness</h2><p>Synthesizing across Anthropic, OpenAI, LangChain, and the broader practitioner community, a production agent harness has eleven distinct components. Let’s walk through each one.</p><figure class="kg-card kg-image-card"><img src="https://substackcdn.com/image/fetch/$s_!FJz8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F66a7c9d3-c90e-4ae8-9501-07f59dedd3d2_680x407.png" class="kg-image" alt="" loading="lazy" width="680" height="407"></figure><h3 id="1-the-orchestration-loop">1. The Orchestration Loop</h3><p>This is the heartbeat. It implements the Thought-Action-Observation (TAO) cycle, also called the ReAct loop. The loop runs: assemble prompt, call LLM, parse output, execute any tool calls, feed results back, repeat until done.</p><p>Mechanically, it’s often just a while loop. The complexity lives in everything the loop manages, not the loop itself. Anthropic describes their runtime as a “dumb loop” where all intelligence lives in the model. The harness just manages turns.</p><h3 id="tools">Tools</h3><p>Tools are the agent’s hands. They’re defined as schemas (name, description, parameter types) injected into the LLM’s context so the model knows what’s available. The tool layer handles registration, schema validation, argument extraction, sandboxed execution, result capture, and formatting results back into LLM-readable observations.</p><p>Claude Code provides tools across six categories: file operations, search, execution, web access, code intelligence, and subagent spawning. OpenAI’s Agents SDK supports function tools (via&nbsp;<code>function_tool</code>), hosted tools (WebSearch, CodeInterpreter, FileSearch), and MCP server tools.</p><h3 id="3-memory">3. Memory</h3><p>Memory operates at multiple timescales. Short-term memory is the conversation history within a single session. Long-term memory persists across sessions: Anthropic uses&nbsp;<code>CLAUDE.md</code>&nbsp;project files and auto-generated&nbsp;<code>MEMORY.md</code>&nbsp;files; LangGraph uses namespace-organized JSON Stores; OpenAI supports Sessions backed by SQLite or Redis.</p><p>Claude Code implements a three-tier hierarchy: a lightweight index (~150 characters per entry, always loaded), detailed topic files pulled in on demand, and raw transcripts accessed via search only.</p><h3 id="4-context-management">4. Context management</h3><p>This is where many agents fail silently. The core problem is context rot: model performance degrades 30%+ when key content falls in mid-window positions.</p><p>Even million-token windows suffer from instruction-following degradation as context grows.</p><p>Production strategies include:</p><ul><li>Compaction: summarizing conversation history when approaching limits (Claude Code preserves architectural decisions and unresolved bugs while discarding redundant tool outputs)</li><li>Observation masking: JetBrains’ Junie hides old tool outputs while keeping tool calls visible</li><li>Just-in-time retrieval: maintaining lightweight identifiers and loading data dynamically (Claude Code uses grep, glob, head, tail rather than loading full files)</li><li>Sub-agent delegation: each subagent explores extensively but returns only 1,000 to 2,000 token condensed summaries</li></ul><p>Anthropic’s context engineering guide states the goal: find the smallest possible set of high-signal tokens that maximize likelihood of the desired outcome.</p><h3 id="5-prompt-construction">5. Prompt construction</h3><p>This assembles what the model actually sees at each step. It’s hierarchical with system prompt, tool definitions, memory files, conversation history, and the current user message.</p><p>OpenAI’s Codex uses a strict priority stack: server-controlled system message (highest priority), tool definitions, developer instructions, user instructions (cascading&nbsp;<code>AGENTS.md</code>&nbsp;files, 32 KiB limit), then conversation history.</p><h3 id="6-output-parsing">6. Output parsing</h3><p>Modern harnesses rely on native tool calling, where the model returns structured&nbsp;<code>tool_calls</code>&nbsp;objects rather than free-text that must be parsed.</p><p>The harness checks if there are any tool calls? If yes, it executes them and loops. If not, it gives the final answer.</p><p>For structured outputs, both OpenAI and LangChain support schema-constrained responses via Pydantic models.</p><p>Legacy approaches like RetryWithErrorOutputParser (which feeds the original prompt, the failed completion, and the parsing error back to the model) remain available for edge cases.</p><h3 id="7-state-management">7. State management</h3><p>LangGraph models state as typed dictionaries flowing through graph nodes, with reducers merging updates.</p><p>Checkpointing happens at super-step boundaries, enabling resumption after interruptions and time-travel debugging.</p><p>OpenAI offers four mutually exclusive strategies: application memory, SDK sessions, server-side Conversations API, or lightweight previous_response_id chaining. Claude Code takes a different approach: git commits as checkpoints and progress files as structured scratchpads.</p><h3 id="8-error-handling">8. Error handling</h3><p>Here’s why this matters: a 10-step process with 99% per-step success still has only ~90.4% end-to-end success due to compounding.</p><p>LangGraph distinguishes four error types: transient (retry with backoff), LLM-recoverable (return error as ToolMessage so the model can adjust), user-fixable (interrupt for human input), and unexpected (bubble up for debugging). Anthropic catches failures within tool handlers and returns them as error results to keep the loop running. Stripe’s production harness caps retry attempts at two.</p><h3 id="9-guardrails-and-safety">9. Guardrails and safety</h3><p>OpenAI’s SDK implements three levels: input guardrails (run on the first agent), output guardrails (run on the final output), and tool guardrails (run on every tool invocation).</p><p>A “tripwire” mechanism halts the agent immediately when triggered.</p><p>Anthropic separates permission enforcement from model reasoning architecturally. The model decides what to attempt; the tool system decides what’s allowed. Claude Code gates ~40 discrete tool capabilities independently, with three stages: trust establishment at project load, permission check before each tool call, and explicit user confirmation for high-risk operations.</p><h3 id="10-verification-loops">10. Verification loops</h3><p>This is what separates toy demos from production agents. Anthropic recommends three approaches: rules-based feedback (tests, linters, type checkers), visual feedback (screenshots via Playwright for UI tasks), and LLM-as-judge (a separate subagent evaluates output).</p><p>Boris Cherny, creator of Claude Code, noted that giving the model a way to verify its work improves quality by 2 to 3x.</p><h3 id="11-subagent-orchestration">11. Subagent orchestration</h3><p>Claude Code supports three execution models: Fork (byte-identical copy of parent context), Teammate (separate terminal pane with file-based mailbox communication), and Worktree (own git worktree, isolated branch per agent).</p><p>OpenAI’s SDK supports agents-as-tools (specialist handles bounded subtask) and handoffs (specialist takes full control). LangGraph implements subagents as nested state graphs.</p><h2 id="a-step-by-step-walkthrough">A step-by-step walkthrough</h2><figure class="kg-card kg-image-card"><img src="https://substackcdn.com/image/fetch/$s_!jJ4Z!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ac4f24e-259e-4837-a547-a696f9eed8a0_680x367.png" class="kg-image" alt="" loading="lazy" width="680" height="367"></figure><p>Now that you know the components, let’s trace how they work together in a single cycle.</p><ul><li>Step 1 (Prompt Assembly): The harness constructs the full input: system prompt + tool schemas + memory files + conversation history + current user message. Important context is positioned at the beginning and end of the prompt (the “Lost in the Middle” finding).</li><li>Step 2 (LLM Inference): The assembled prompt goes to the model API. The model generates output tokens: text, tool call requests, or both.</li><li>Step 3 (Output Classification): If the model produced text with no tool calls, the loop ends. If it requested tool calls, proceed to execution. If a handoff was requested, update the current agent and restart.</li><li>Step 4 (Tool Execution): For each tool call, the harness validates arguments, checks permissions, executes in a sandboxed environment, and captures results. Read-only operations can run concurrently; mutating operations run serially.</li><li>Step 5 (Result Packaging): Tool results are formatted as LLM-readable messages. Errors are caught and returned as error results so the model can self-correct.</li><li>Step 6 (Context Update): Results are appended to the conversation history. If approaching the context window limit, the harness triggers compaction.</li><li>Step 7 (Loop): Return to Step 1. Repeat until termination.</li></ul><p>Termination conditions are layered: the model produces a response with no tool calls, the maximum turn limit is exceeded, the token budget is exhausted, a guardrail tripwire fires, the user interrupts, or a safety refusal is returned. A simple question might take 1 to 2 turns. A complex refactoring task can chain dozens of tool calls across many turns.</p><p>For long-running tasks spanning multiple context windows, Anthropic developed a two-phase “Ralph Loop” pattern.</p><p>It uses an Initializer Agent that sets up the environment (init script, progress file, feature list, initial git commit), then a Coding Agent in every subsequent session reads git logs and progress files to orient itself, picks the highest-priority incomplete feature, works on it, commits, and writes summaries.</p><p>The filesystem provides continuity across context windows.</p><h2 id="how-frameworks-implement-the-pattern">How frameworks implement the pattern</h2><figure class="kg-card kg-image-card"><img src="https://substackcdn.com/image/fetch/$s_!t0CH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd95eb7bc-15be-4f0f-9501-06f74856f593_680x381.png" class="kg-image" alt="" loading="lazy" width="680" height="381"></figure><p>Anthropic’s Claude Agent SDK exposes the harness through a&nbsp;<code>single query()</code>&nbsp;function that creates the agentic loop and returns an async iterator streaming messages.</p><p>The runtime is a “dumb loop.” All intelligence lives in the model. Claude Code uses a Gather-Act-Verify cycle: gather context (search files, read code), take action (edit files, run commands), verify results (run tests, check output), repeat.</p><p>OpenAI’s Agents SDK implements the harness through the Runner class with three modes: async, sync, and streamed.</p><p>The SDK is “code-first”: workflow logic is expressed in native Python rather than graph DSLs. The Codex harness extends this with a three-layer architecture: Codex Core (agent code + runtime), App Server (bidirectional JSON-RPC API), and client surfaces (CLI, VS Code, web app). All surfaces share the same harness, which is why “Codex models feel better on Codex surfaces than a generic chat window.”</p><p>LangGraph models the harness as an explicit state graph. Two nodes (<code>llm_call</code>&nbsp;and&nbsp;<code>tool_node</code>) connected by a conditional edge: if tool calls present, route to tool_node; if absent, route to END.</p><p>LangGraph evolved from LangChain’s AgentExecutor, which was deprecated in v0.2 because it was hard to extend and lacked multi-agent support. LangChain’s Deep Agents explicitly use the term “agent harness”: built-in tools, planning (write_todos tool), file systems for context management, subagent spawning, and persistent memory.</p><p>CrewAI implements a role-based multi-agent architecture: Agent (the harness around the LLM, defined by role, goal, backstory, and tools), Task (the unit of work), and Crew (the collection of agents). CrewAI’s Flows layer adds a “deterministic backbone with intelligence where it matters,” managing routing and validation while Crews handle autonomous collaboration.</p><h2 id="the-scaffolding-metaphor">The scaffolding metaphor</h2><p>Construction scaffolding is a temporary infrastructure that enables workers to build a structure they couldn’t reach otherwise. It doesn’t do the construction. But without it, workers can’t reach the upper floors.</p><figure class="kg-card kg-image-card"><img src="https://substackcdn.com/image/fetch/$s_!9dmt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59a97baa-7dff-424d-89cc-e2829660ecf4_680x370.png" class="kg-image" alt="" loading="lazy" width="680" height="370"></figure><p>The key insight is that scaffolding is removed when the building is complete. As models improve, harness complexity should decrease. Manus was rebuilt five times in six months, each rewrite removing complexity. Complex tool definitions became general shell execution. “Management agents” became simple structured handoffs.</p><p>This points to the co-evolution principle where models are now post-trained with specific harnesses in the loop. Claude Code’s model learned to use the specific harness it was trained with. Changing tool implementations can degrade performance because of this tight coupling.</p><p>The future-proofing test for harness design states that if performance scales up with more powerful models without adding harness complexity, the design is sound.</p><figure class="kg-card kg-image-card"><img src="https://substackcdn.com/image/fetch/$s_!uLwD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff25b53b2-7a60-44bb-b622-f18b87f1d1bd_680x379.png" class="kg-image" alt="" loading="lazy" width="680" height="379"></figure><h2 id="seven-decisions-for-harness-definitions">Seven decisions for Harness definitions</h2><p>Every harness architect faces seven choices:</p><figure class="kg-card kg-image-card"><img src="https://substackcdn.com/image/fetch/$s_!yCiY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63894e2b-ad10-47a4-8de4-36e3be7a88fd_680x380.png" class="kg-image" alt="" loading="lazy" width="680" height="380"></figure><ol><li>Single-agent vs. multi-agent. Both Anthropic and OpenAI ask to maximize a single agent first. Multi-agent systems add overhead (extra LLM calls for routing, context loss during handoffs). Split only when tool overload exceeds ~10 overlapping tools or clearly separate task domains exist.</li><li>ReAct vs. plan-and-execute. ReAct interleaves reasoning and action at every step (flexible but higher per-step cost). Plan-and-execute separates planning from execution. LLMCompiler reports a 3.6x speedup over sequential ReAct.</li><li>Context window management strategy. Five production approaches include time-based clearing, conversation summarization, observation masking, structured note-taking, and sub-agent delegation. ACON research showed 26 to 54% token reduction while preserving 95%+ accuracy by prioritizing reasoning traces over raw tool outputs.</li><li>Verification loop design. Computational verification (tests, linters) provides deterministic ground truth. Inferential verification (LLM-as-judge) catches semantic issues but adds latency. Martin Fowler’s Thoughtworks team frames this as guides (feedforward, steer before action) versus sensors (feedback, observe after action).</li><li>Permission and safety architecture. Permissive (fast but risky, auto-approve most actions) versus restrictive (safe but slow, require approval for each action). The choice depends on the deployment context.</li><li>Tool scoping strategy. More tools often mean worse performance. Vercel removed 80% of tools from v0 and got better results. Claude Code achieves 95% context reduction via lazy loading. The principle: expose the minimum tool set needed for the current step.</li><li>Harness thickness. How much logic lives in the harness versus the model. Anthropic bets on thin harnesses and model improvement. Graph-based frameworks bet on explicit control. Anthropic regularly deletes planning steps from Claude Code’s harness as new model versions internalize that capability.</li></ol><h2 id="the-harness-is-the-product">The harness is the product</h2><p>Two products using identical models can have wildly different performance based solely on harness design. The TerminalBench evidence is clear that changing only the harness moved agents by 20+ ranking positions.</p><p>The harness is not a solved problem or a commodity layer. It’s where the hard engineering lives like managing context as a scarce resource, designing verification loops that catch failures before they compound, building memory systems that provide continuity without hallucination, and making architectural bets about how much scaffolding to build versus how much to leave to the model.</p><p>The field is moving toward thinner harnesses as models improve. But the harness itself isn’t going away. Even the most capable model needs something to manage its context window, execute its tool calls, persist its state, and verify its work.</p><p>The next time your agent fails, don’t blame the model but rather look at the harness.</p><p>Thanks for reading!</p>]]></content:encoded>
                </item>
                <item>
                    <title><![CDATA[MLOps and LLMOps: Case Studies]]></title>
                    <description><![CDATA[An exploration of real-world MLOps and LLMOps case studies, examining the importance of reliable ML and AI engineering and their significance for business outcomes.]]></description>
                    <link>https://www.dailydoseofds.com/llmops-mlops-miscellaneous-part-1/</link>
                    <guid isPermaLink="false">69cb5d933d9e380001fb3222</guid>

                        <category><![CDATA[MLOps/LLMOps Course]]></category>
                        <category><![CDATA[MLOps]]></category>
                        <category><![CDATA[LLMOps]]></category>

                        <dc:creator><![CDATA[Avi Chawla]]></dc:creator>

                    <pubDate>Mon, 06 Apr 2026 00:26:30 +0530</pubDate>

                        <media:content url="https://www.dailydoseofds.com/content/images/2026/03/Sanyog-MLOps_-_I-4.png" medium="image"/>

                    <content:encoded><![CDATA[<img src="https://www.dailydoseofds.com/content/images/2026/03/Sanyog-MLOps_-_I-4.png" alt="MLOps and LLMOps: Case Studies"/> <h2 id="introduction">Introduction</h2><p>Several AI/ML systems do not just fail because the model is not good enough. They fail because everything around the model was not built to last.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/04/image-40.png" class="kg-image" alt="" loading="lazy" width="1000" height="321" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/04/image-40.png 600w, https://www.dailydoseofds.com/content/images/2026/04/image-40.png 1000w" sizes="(min-width: 720px) 720px"></figure><p>Consider a scenario: A team trains a model, it performs well in tests, and it is deployed to production. Yet conversion rates remain flat. A question from the finance team cannot be answered. Engineers spend weeks investigating why a recommendation system with 94% accuracy is making the product no better. This is not an edge case, rather a reflection of how real-world systems behave.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/04/image-37-1.png" class="kg-image" alt="" loading="lazy" width="1024" height="441" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/04/image-37-1.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2026/04/image-37-1.png 1000w, https://www.dailydoseofds.com/content/images/2026/04/image-37-1.png 1024w" sizes="(min-width: 720px) 720px"></figure><p>The companies that eventually get this right, do so by learning lessons that do not always have much to do with algorithms. Their progress comes from how they design systems, handle constraints, and adapt to failure.</p><p>This article is the final piece in our MLOps/LLMOps course. Here, we take a look at concrete, real-world examples. We examine a set of carefully chosen case studies drawn from real systems. Each case study focuses on the decisions that shaped the system, why specific approaches were chosen and the constraints teams operated under.</p><p>The examples span big tech, fintech, banking, e-commerce, etc. offering a grounded view of how modern AI/ML systems are actually built and sustained.</p><div class="kg-card kg-callout-card kg-callout-card-blue"><div class="kg-callout-emoji">👉</div><div class="kg-callout-text">We have no affiliation, partnership, or association (in any capacity) with any of the companies referenced in these case studies. All trademarks and logos remain the property of their respective owners and are used solely for identification and educational purposes.</div></div><hr><h2 id="1-the-fundamental-misunderstanding">#1) The fundamental misunderstanding</h2><h3 id="bookingcom-model-performance-%E2%89%A0-business-performance">Booking.com: Model performance ≠ Business performance</h3><p>In 2019, Booking.com published a <a href="https://blog.kevinhu.me/2021/04/25/25-Paper-Reading-Booking.com-Experiences/bernardi2019.pdf?ref=dailydoseofds.com">paper</a> at KDD that has changed how most companies today think about ML.</p><figure class="kg-card kg-image-card"><img src="https://content.presspage.com/templates/50/2962/744836/booking_logo--blue.svg?1" class="kg-image" alt="Booking.com: Global News logo" loading="lazy" width="300" height="50"></figure><p>It described 150 deployed production models and one uncomfortable lesson: improving a model's accuracy often did not improve the business metric it was supposed to affect.</p><p>There were several reasons for this:</p><ul><li>Value saturation: the model had already captured most of the available gain, and further accuracy improvements had nothing much left to unlock.</li><li>Segment saturation: when testing a new model against the current one, the two models increasingly agree on what to show users, shrinking the population actually exposed to any difference. The testable segment becomes too small to move aggregate metrics.</li><li>Proxy metric over-optimization: the model had learned to maximize something measurable that was only loosely correlated with what the business actually cared about.</li></ul><div class="kg-card kg-callout-card kg-callout-card-blue"><div class="kg-callout-emoji">👉</div><div class="kg-callout-text">This means the model gets better at predicting the label it was trained on, but that label is an imperfect proxy for the business outcome.</div></div><ul><li>Uncanny valley effect: as a model becomes too accurate (predicting user behavior so precisely that it feels like the system knows too much), it can unsettle users, producing a negative effect on business value.</li></ul><p>Their solution was to treat randomized controlled trials (RCTs) as mandatory infrastructure, not optional validation.</p><div class="kg-card kg-callout-card kg-callout-card-blue"><div class="kg-callout-emoji">👉</div><div class="kg-callout-text">A randomized controlled trial (RCT) is a type of scientific experiment designed to evaluate the efficacy of an intervention by minimizing bias through the random allocation of participants to one or more comparison groups.</div></div><p>This means every single model gets validated through an RCT before it stays in production. This was not a qualitative review or a conversation about the AUC, but rather an actual experiment measuring whether users behaved differently in a way that matters to the business.</p><p>The deeper insight was about how teams construct problems. Switching a preferences model from click data to natural language processing on guest reviews produced more business value than any model-level improvement had. The framing of the problem and project scoping mattered more than the sophistication of the solution.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/04/image-41-1.png" class="kg-image" alt="" loading="lazy" width="1024" height="411" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/04/image-41-1.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2026/04/image-41-1.png 1000w, https://www.dailydoseofds.com/content/images/2026/04/image-41-1.png 1024w" sizes="(min-width: 720px) 720px"></figure>

<aside data-sx-content-cta class="relative p-0.5 js-toc-ignore">
        <div data-sx-content-cta-blur class="absolute w-full start-0 -top-0.5 -translate-y-full bg-linear-to-b to-70% from-black/0 to-background dark:to-background-dark pointer-events-none">
            &nbsp; <br> &nbsp; <br> &nbsp;
        </div>

    <div class="relative flex flex-col items-center text-center w-full px-4 py-10 lg:px-10 text-base rounded-xl sm:rounded-2xl">
        <div class="dark:hidden absolute -inset-px -z-10 rounded-xl bg-white shadow-pretty-sm transition-shadow duration-200"></div>
        <span class="hidden dark:block absolute -z-10 -inset-px rounded-inherit bg-radial-[50%_100%_at_50%_0%] from-white/10"></span>        
        <div class="hidden dark:block absolute -inset-px -z-10 rounded-xl bg-white/1 transition-colors duration-500 inset-shadow-md inset-shadow-white/2"></div>
        
        <div class="hidden dark:block absolute -z-10 -inset-0.5 mask-to-b mask-position-to-70% mask-opacity-to-50%">
            <div class="hidden dark:block absolute -z-10 inset-px rounded-xl border border-gray-50/10"></div>
        </div>                                                
        <div class="hidden dark:block absolute -z-10 -inset-px rounded-xl border border-gray-50/5 opacity-0"></div>

        <div class="hidden dark:block absolute -top-px start-1/2 -translate-x-1/2 w-1/2 h-px bg-linear-to-r from-white/0 to-white/0 via-gray-400/50"></div>

        <h2 class="text-gray-900 dark:text-gray-100 text-pretty text-xl sm:text-2xl leading-snug tracking-tight">




                                                <span class="hidden first:inline">
                                This lesson is for paying subscribers only
                            </span>                           
                    <span class="hidden first:inline">
                        This post is for paying subscribers only
                    </span>  

        </h2>

            <a href="/membership" class="mt-7 px-4 py-2.5 font-semibold text-sm text-center bg-primary rounded-full shadow-primary dark:shadow-none dark:drop-shadow-primary-sm hover:opacity-90 transition-opacity duration-200 inline-block" style="color: white;">
                Unlock Full Access
            </a>
            <p class="mt-5 text-sm">
                Already have an account? <button data-portal="signin" class="block sm:inline mx-auto font-medium underline decoration-primary decoration-2 underline-offset-2 hover:text-primary transition-colors duration-200">Sign in</button>
            </p>
    </div>
</aside>
]]></content:encoded>
                </item>
                <item>
                    <title><![CDATA[Anatomy of the .claude/ Folder]]></title>
                    <description><![CDATA[A complete guide to CLAUDE.md, custom commands, skills, agents, and permissions, and how to set them up properly.]]></description>
                    <link>https://www.dailydoseofds.com/p/anatomy-of-the-claude-folder/</link>
                    <guid isPermaLink="false">69cb2ba63d9e380001fb3108</guid>

                        <category><![CDATA[Claude]]></category>

                        <dc:creator><![CDATA[Avi Chawla]]></dc:creator>

                    <pubDate>Tue, 31 Mar 2026 07:36:20 +0530</pubDate>

                        <media:content url="https://www.dailydoseofds.com/content/images/2026/03/claude-folder--1-.png" medium="image"/>

                    <content:encoded><![CDATA[<img src="https://www.dailydoseofds.com/content/images/2026/03/claude-folder--1-.png" alt="Anatomy of the .claude/ Folder"/> <p>Claude Code users typically treat the <code>.claude</code> folder like a black box. They know it exists. They’ve seen it appear in their project root. But they’ve never opened it, let alone understood what every file inside it does.</p><p>That’s a missed opportunity.</p><p>The <code>.claude</code> folder is the control center for how Claude behaves in your project.</p><figure class="kg-card kg-image-card"><img src="https://substackcdn.com/image/fetch/$s_!ITpM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b81cc25-df87-4ea8-a11b-9a719d5836b1_1166x1176.png" class="kg-image" alt="" loading="lazy" width="1166" height="1176"></figure><p>It holds your instructions, your custom commands, your permission rules, and even Claude’s memory across sessions. Once you understand what lives where and why, you can configure Claude Code to behave exactly the way your team needs it to.</p><p>This newsletter walks you through the entire anatomy of the folder, from the files you’ll use daily to the ones you’ll set once and forget.</p><h2 id="two-folders-not-one">Two folders, not one</h2><p>Before diving in, one thing worth knowing upfront: there are actually two .claude directories, not one.</p><p>The first lives inside your project, and the second lives in your home directory:</p><figure class="kg-card kg-image-card"><img src="https://substackcdn.com/image/fetch/$s_!Bdok!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0bddbd7c-d987-4c70-a38b-44db8104f8d7_680x369.png" class="kg-image" alt="" loading="lazy" width="680" height="369"></figure><p>The project-level folder holds team configuration. You commit it to git. Everyone on the team gets the same rules, the same custom commands, the same permission policies.</p><p>The global <code>~/.claude/</code> folder holds your personal preferences and machine-local state, like session history and auto-memory.</p><h2 id="claudemd-claude%E2%80%99s-instruction-manual">CLAUDE.md: Claude’s instruction manual</h2><p>This is the most important file in the entire system. When you start a Claude Code session, the first thing it reads is <code>CLAUDE.md</code>. It loads it straight into the system prompt and keeps it in mind for the entire conversation.</p><p>Simply put: whatever you write in <code>CLAUDE.md</code>, Claude will follow.</p><p>If you tell Claude to always write tests before implementation, it will. If you say “never use console.log for error handling, always use the custom logger module,” it will respect that every time.</p><p>A <code>CLAUDE.md</code> at your project root is the most common setup. But you can also have one in <code>~/.claude/CLAUDE.md</code> for global preferences that apply across all projects, and even one inside subdirectories for folder-specific rules. Claude reads all of them and combines them.</p><h3 id="what-actually-belongs-in-claudemd">What actually belongs in <code>CLAUDE.md</code></h3><p>Most people either write too much or too little. Here’s what works.</p><h4 id="write">Write:</h4><ul><li>Build, test, and lint commands (npm run test, make build, etc.)</li><li>Key architectural decisions (”we use a monorepo with Turborepo”)</li><li>Non-obvious gotchas (”TypeScript strict mode is on, unused variables are errors”)</li><li>Import conventions, naming patterns, error handling styles</li><li>File and folder structure for the main modules</li></ul><h4 id="don%E2%80%99t-write">Don’t write:</h4><ul><li>Anything that belongs in a linter or formatter config</li><li>Full documentation you can already link to</li><li>Long paragraphs explaining theory</li></ul><p>Keep <code>CLAUDE.md</code> under 200 lines. Files longer than that start eating too much context, and Claude’s instruction adherence actually drops.</p><p>Here’s a minimal but effective example:</p><figure class="kg-card kg-image-card"><img src="https://embed.filekitcdn.com/e/k7YHPN24SoxyM8nGKZnDxa/78e94FBjzW8UAma9gd9mTq/email" class="kg-image" alt="" loading="lazy" width="2400" height="2018"></figure><p>That’s ~20 lines. It gives Claude everything it needs to work productively in this codebase without constant clarification.</p><h3 id="claudelocalmd-for-personal-overrides">CLAUDE.local.md for personal overrides</h3><p>Sometimes you have a preference that’s specific to you, not the whole team. Maybe you prefer a different test runner, or you want Claude to always open files using a specific pattern.</p><p>Create <code>CLAUDE.local.md</code> in your project root. Claude reads it alongside the main <code>CLAUDE.md</code>, and it’s automatically gitignored so your personal tweaks never land in the repo.</p><figure class="kg-card kg-image-card"><img src="https://substackcdn.com/image/fetch/$s_!fLrt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F692fdd01-f40d-4d30-9947-de28aec0bad7_680x369.png" class="kg-image" alt="" loading="lazy" width="680" height="369"></figure><h2 id="the-rules-folder-modular-instructions-that-scale">The rules/ folder: modular instructions that scale</h2><p><code>CLAUDE.md</code> works great for a single project. But once your team grows, you end up with a 300-line <code>CLAUDE.md</code> that nobody maintains and everyone ignores.</p><p>The <code>rules/</code> folder solves that.</p><p>Every markdown file inside <code>.claude/rules/</code> gets loaded alongside your <code>CLAUDE.md</code> automatically. Instead of one giant file, you split instructions by concern:</p><figure class="kg-card kg-image-card"><img src="https://embed.filekitcdn.com/e/k7YHPN24SoxyM8nGKZnDxa/5zJXg1nSCWteSDbpRzJgr7/email" class="kg-image" alt="" loading="lazy" width="1480" height="872"></figure><p>Each file stays focused and easy to update. The team member who owns API conventions edits <code>api-conventions.md</code>. The person who owns testing standards edits <code>testing.md</code>. Nobody stomps on each other.</p><p>The real power comes from path-scoped rules. Add a YAML frontmatter block to a rule file and it only activates when Claude is working with matching files:</p><figure class="kg-card kg-image-card"><img src="https://embed.filekitcdn.com/e/k7YHPN24SoxyM8nGKZnDxa/8qpkpbSsqqxZnjHdRk8A6U/email" class="kg-image" alt="" loading="lazy" width="1976" height="1296"></figure><p>Claude won’t load this file when editing a React component. It only loads when it’s working inside src/api/ or src/handlers/. Rules without a paths field load unconditionally, every session.</p><p>This is the right pattern once your <code>CLAUDE.md</code> starts feeling crowded.</p><h2 id="the-commands-folder-your-custom-slash-commands">The commands/ folder: your custom slash commands</h2><p>Out of the box, Claude Code has built-in slash commands like <code>/help</code> and <code>/compact</code>. The <code>commands/</code> folder lets you add your own.</p><p>Every markdown file you drop into <code>.claude/commands/</code> becomes a slash command.</p><p>A file named <code>review.md</code> creates <code>/project:review</code>. A file named <code>fix-issue.md</code> creates <code>/project:fix-issue</code>. The filename is the command name.</p><figure class="kg-card kg-image-card"><img src="https://substackcdn.com/image/fetch/$s_!tVYn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b9ba300-beed-43b3-a6cd-919715885bdb_680x370.png" class="kg-image" alt="" loading="lazy" width="680" height="370"></figure><p>Here’s a simple example. Create <code>.claude/commands/review.md</code>:</p><figure class="kg-card kg-image-card"><img src="https://embed.filekitcdn.com/e/k7YHPN24SoxyM8nGKZnDxa/gWS9t19feKYw68w8D8fC9x/email" class="kg-image" alt="" loading="lazy" width="2400" height="1711"></figure><p>Now run <code>/project:review</code> in Claude Code and it automatically injects the real git diff into the prompt before Claude sees it. The <code>!</code> backtick syntax runs shell commands and embeds the output. That’s what makes these commands genuinely useful instead of just saved text.</p><h3 id="passing-arguments-to-commands">Passing arguments to commands</h3><p>Use <code>$ARGUMENTS</code> to pass text after the command name:</p><figure class="kg-card kg-image-card"><img src="https://embed.filekitcdn.com/e/k7YHPN24SoxyM8nGKZnDxa/nHrzhDXSFiVqPiaBsXfYJh/email" class="kg-image" alt="" loading="lazy" width="2400" height="1267"></figure><p>Running <code>/project:fix-issue 234</code> feeds issue 234’s content directly into the prompt.</p><h3 id="personal-vs-project-commands">Personal vs. project commands</h3><p>Project commands in <code>.claude/commands/</code> are committed and shared with your team. For commands you want everywhere regardless of project, put them in <code>~/.claude/commands/</code>. Those show up as <code>/user:command-name</code> instead.</p><p>A useful personal command: a daily standup helper, a command for generating commit messages following your convention, or a quick security scan.</p><h2 id="the-skills-folder-reusable-workflows-on-demand">The skills/ folder: reusable workflows on demand</h2><p>You now know how commands work. Skills look similar on the surface, but the trigger is fundamentally different. Here’s the distinction before we go any further:</p><figure class="kg-card kg-image-card"><img src="https://substackcdn.com/image/fetch/$s_!ZHzE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefd043e5-89f6-4538-9a50-f0a8e2c41047_680x370.png" class="kg-image" alt="" loading="lazy" width="680" height="370"></figure><p>Skills are workflows that Claude can invoke on its own, without you typing a slash command, when the task matches the skill’s description. Commands wait for you. Skills watch the conversation and act when the moment is right.</p><p>Each skill lives in its own subdirectory with a <code>SKILL.md</code> file:</p><figure class="kg-card kg-image-card"><img src="https://embed.filekitcdn.com/e/k7YHPN24SoxyM8nGKZnDxa/kurZVENCxdMQ2UtnnqptaQ/email" class="kg-image" alt="" loading="lazy" width="1912" height="1100"></figure><p>The <code>SKILL.md</code> uses YAML frontmatter to describe when to use it:</p><figure class="kg-card kg-image-card"><img src="https://embed.filekitcdn.com/e/k7YHPN24SoxyM8nGKZnDxa/jG6pi3epsteuKe62So2ftp/email" class="kg-image" alt="" loading="lazy" width="2400" height="1424"></figure><p>When you say “review this PR for security issues,” Claude reads the description, recognizes it matches, and invokes the skill automatically. You can also call it explicitly with <code>/security-review</code>.</p><p>The key difference from commands: skills can bundle supporting files alongside them. The <code>DETAILED_GUIDE.md</code> reference above pulls in a detailed document that lives right next to <code>SKILL.md</code>. Commands are single files. Skills are packages.</p><p>Personal skills go in <code>~/.claude/skills/</code> and are available across all your projects.</p><h2 id="the-agents-folder-specialized-subagent-personas">The agents/ folder: specialized subagent personas</h2><p>When a task is complex enough to benefit from a dedicated specialist, you can define a subagent persona in <code>.claude/agents/</code>. Each agent is a markdown file with its own system prompt, tool access, and model preference:</p><figure class="kg-card kg-image-card"><img src="https://embed.filekitcdn.com/e/k7YHPN24SoxyM8nGKZnDxa/tGYHNmdQvwKTQQx99ccDxQ/email" class="kg-image" alt="" loading="lazy" width="1532" height="712"></figure><p>Here’s what a <code>code-reviewer.md</code> looks like:</p><figure class="kg-card kg-image-card"><img src="https://embed.filekitcdn.com/e/k7YHPN24SoxyM8nGKZnDxa/8g74TR4w5bjWY1A3Xsoz5/email" class="kg-image" alt="" loading="lazy" width="2400" height="1468"></figure><p>When Claude needs a code review done, it spawns this agent in its own isolated context window. The agent does its work, compresses the findings, and reports back. Your main session doesn’t get cluttered with thousands of tokens of intermediate exploration.</p><p>The tools field restricts what the agent can do. A security auditor only needs Read, Grep, and Glob. It has no business writing files. That restriction is intentional and worth being explicit about.</p><p>The model field lets you use a cheaper, faster model for focused tasks. Haiku handles most read-only exploration well. Save Sonnet and Opus for the work that actually needs them.</p><p>Personal agents go in <code>~/.claude/agents/</code> and are available across all projects.</p><figure class="kg-card kg-image-card"><img src="https://substackcdn.com/image/fetch/$s_!NWl6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F915f8fe3-921f-40ff-8ca2-6c85f474a1c3_680x370.png" class="kg-image" alt="" loading="lazy" width="680" height="370"></figure><h2 id="settingsjson-permissions-and-project-config">settings.json: permissions and project config</h2><p>The <code>settings.json</code> file inside <code>.claude/</code> controls what Claude is and isn’t allowed to do. It’s where you define which tools Claude can run, which files it can read, and whether it needs to ask before running certain commands.</p><p>The complete file looks like this:</p><figure class="kg-card kg-image-card"><img src="https://embed.filekitcdn.com/e/k7YHPN24SoxyM8nGKZnDxa/v9UuMXesuYtq9BeaLZTtBm/email" class="kg-image" alt="" loading="lazy" width="2400" height="1735"></figure><p>Here’s what each part does.</p><p>The <code>$schema</code> line enables autocomplete and inline validation in VS Code or Cursor. Always include it.</p><p>The allow list contains commands that run without Claude asking for confirmation. For most projects, a good allow list covers:</p><ul><li><code>Bash(npm run *)</code> or <code>Bash(make *)</code> so Claude can run your scripts freely</li><li><code>Bash(git *)</code> for read-only git commands</li><li>Read, Write, Edit, Glob, Grep for file operations</li></ul><p>The deny list contains commands that are blocked entirely, no matter what. A sensible deny list blocks:</p><ul><li>Destructive shell commands like <code>rm -rf</code></li><li>Direct network commands like curl</li><li>Sensitive files like <code>.env</code> and anything in <code>secrets/</code></li></ul><p>If something isn’t in either list, Claude asks before proceeding. That middle ground is intentional. It gives you a safety net without having to anticipate every possible command upfront.</p><p>That said, you can also have <code>settings.local.json</code> for personal overrides. It has the same idea as <code>CLAUDE.local.md</code>. Create <code>.claude/settings.local.json</code> for permission changes you don’t want committed. It’s auto-gitignored.</p><h2 id="the-global-claude-folder">The global ~/.claude/ folder</h2><p>You don’t interact with this folder often, but it’s useful to know what’s in it.</p><p><code>~/.claude/CLAUDE.md</code> loads into every Claude Code session, across all your projects. Good place for your personal coding principles, preferred style, or anything you want Claude to remember, regardless of which repo you’re in.</p><p><code>~/.claude/projects/</code> stores session transcripts and auto-memory per project. Claude Code automatically saves notes to itself as it works: commands it discovers, patterns it observes, and architecture insights. These persist across sessions. You can browse and edit them with <code>/memory</code>.</p><p><code>~/.claude/commands/</code> and <code>~/.claude/skills/</code> hold personal commands and skills available across all projects.</p><p>You generally don’t need to manually manage these. But knowing they exist is handy when Claude seems to “remember” something you never told it, or when you want to wipe a project’s auto-memory and start fresh.</p><h2 id="the-full-picture">The full picture</h2><p>Here’s how everything comes together:</p><figure class="kg-card kg-image-card"><img src="https://embed.filekitcdn.com/e/k7YHPN24SoxyM8nGKZnDxa/6291QVzr8JpuS4U1aDWar8/email" class="kg-image" alt="" loading="lazy" width="2400" height="2773"></figure><h2 id="a-practical-setup-to-get-started">A practical setup to get started</h2><p>If you’re starting from scratch, here’s a progression that works well.</p><p>Step 1. Run <code>/init</code> inside Claude Code. It generates a starter <code>CLAUDE.md</code> by reading your project. Edit it down to the essentials.</p><p>Step 2. Add <code>.claude/settings.json</code> with allow/deny rules appropriate for your stack. At minimum, allow your run commands and deny .env reads.</p><p>Step 3. Create one or two commands for the workflows you do most. Code review and issue fixing are good starting points.</p><p>Step 4. As your project grows and your CLAUDE.md gets crowded, start splitting instructions into <code>.claude/rules/</code> files. Scope them by path where it makes sense.</p><p>Step 5. Add a <code>~/.claude/CLAUDE.md</code> with your personal preferences. This might be something like “always write types before implementations” or “prefer functional patterns over class-based.”</p><p>That’s genuinely all you need for 95% of projects. Skills and agents come in when you have recurring complex workflows worth packaging up.</p><h2 id="the-key-insight">The key insight</h2><p>The <code>.claude</code> folder is really a protocol for telling Claude who you are, what your project does, and what rules it should follow. The more clearly you define that, the less time you spend correcting Claude and the more time it spends doing useful work.</p><p><code>CLAUDE.md</code> is your highest-leverage file. Get that right first. Everything else is optimization.</p><p>Start small, refine as you go, and treat it like any other piece of infrastructure in your project: something that pays dividends every day once it’s set up properly.</p>]]></content:encoded>
                </item>
                <item>
                    <title><![CDATA[Concepts of LLM Serving]]></title>
                    <description><![CDATA[LLMOps Part 14: An overview of the fundamentals of LLM serving, including API-based access, inference with vLLM, and practical decisions.]]></description>
                    <link>https://www.dailydoseofds.com/llmops-crash-course-part-14/</link>
                    <guid isPermaLink="false">69cb296b3d9e380001fb30c2</guid>

                        <category><![CDATA[MLOps/LLMOps Course]]></category>
                        <category><![CDATA[LLMOps]]></category>

                        <dc:creator><![CDATA[Avi Chawla]]></dc:creator>

                    <pubDate>Sun, 29 Mar 2026 07:24:00 +0530</pubDate>

                        <media:content url="https://www.dailydoseofds.com/content/images/2026/04/Sanyog-MLOps_-_I.png" medium="image"/>

                    <content:encoded><![CDATA[<img src="https://www.dailydoseofds.com/content/images/2026/04/Sanyog-MLOps_-_I.png" alt="Concepts of LLM Serving"/> <h2 id="recap">Recap</h2><p>In the previous chapter (Part 13), we explored how LLM inference actually works and how to optimize it.</p><p>We began by grounding ourselves in the core performance metrics such as TTFT, TPOT, throughput (TPS/RPS), latency percentiles, and goodput.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/03/image-163.png" class="kg-image" alt="" loading="lazy" width="436" height="175"></figure><p>We then built a precise mental model of inference by breaking it into two distinct phases: prefill (compute-bound) and decode (memory-bound).</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/04/image.png" class="kg-image" alt="" loading="lazy" width="908" height="783" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/04/image.png 600w, https://www.dailydoseofds.com/content/images/2026/04/image.png 908w" sizes="(min-width: 720px) 720px"></figure><p>This distinction became central to understanding why different optimizations work the way they do.</p><p>From there, we explored key optimization techniques across the inference stack, starting with continuous batching to maximize GPU utilization and throughput under variable-length workloads.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/03/image-165.png" class="kg-image" alt="" loading="lazy" width="820" height="647" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/03/image-165.png 600w, https://www.dailydoseofds.com/content/images/2026/03/image-165.png 820w" sizes="(min-width: 720px) 720px"></figure><p>Then we understood about KV caching to eliminate redundant computation and essentially trade compute for memory.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/03/image-166.png" class="kg-image" alt="" loading="lazy" width="996" height="1016" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/03/image-166.png 600w, https://www.dailydoseofds.com/content/images/2026/03/image-166.png 996w" sizes="(min-width: 720px) 720px"></figure><p>After KV caching, we learned about PagedAttention &amp; prefix caching to fix memory fragmentation and reuse shared prefixes across requests, significantly improving effective memory usage and throughput.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/03/image-167.png" class="kg-image" alt="" loading="lazy" width="1024" height="360" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/03/image-167.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2026/03/image-167.png 1000w, https://www.dailydoseofds.com/content/images/2026/03/image-167.png 1024w" sizes="(min-width: 720px) 720px"></figure><p>We also saw that KV cache quantization can further reduce memory footprint and enable larger batch sizes or longer contexts.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/03/image-168.png" class="kg-image" alt="" loading="lazy" width="1000" height="462" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/03/image-168.png 600w, https://www.dailydoseofds.com/content/images/2026/03/image-168.png 1000w" sizes="(min-width: 720px) 720px"></figure><p>We then looked at attention-level optimizations, including MQA, GQA, and FlashAttention.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.dailydoseofds.com/content/images/2026/03/image-169.png" class="kg-image" alt="" loading="lazy" width="801" height="263" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/03/image-169.png 600w, https://www.dailydoseofds.com/content/images/2026/03/image-169.png 801w" sizes="(min-width: 720px) 720px"><figcaption><a href="https://arxiv.org/pdf/2305.13245v2?ref=dailydoseofds.com" target="_blank"><span style="white-space: pre-wrap;">Ainslie et al. (2023)</span></a></figcaption></figure><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.dailydoseofds.com/content/images/2026/03/image-170.png" class="kg-image" alt="" loading="lazy" width="979" height="347" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/03/image-170.png 600w, https://www.dailydoseofds.com/content/images/2026/03/image-170.png 979w" sizes="(min-width: 720px) 720px"><figcaption><a href="https://arxiv.org/pdf/2205.14135?ref=dailydoseofds.com" target="_blank"><span style="white-space: pre-wrap;">Dao et al. (2022)</span></a></figcaption></figure><p>Next, we explored speculative decoding, which leverages a smaller draft model to accelerate generation without sacrificing output quality, along with its practical trade-offs like acceptance and memory overhead.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/03/image-171.png" class="kg-image" alt="" loading="lazy" width="1000" height="470" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/03/image-171.png 600w, https://www.dailydoseofds.com/content/images/2026/03/image-171.png 1000w" sizes="(min-width: 720px) 720px"></figure><p>We also discussed prefill-decode disaggregation, where systems separate compute-bound and memory-bound workloads across different hardware pools, and how its effectiveness depends on workload characteristics.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/03/image-172.png" class="kg-image" alt="" loading="lazy" width="545" height="514"></figure><p>Finally, we covered parallelism strategies (data, tensor, pipeline, and expert parallelism) to scale inference across multiple devices.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/03/image-173.png" class="kg-image" alt="" loading="lazy" width="959" height="480" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/03/image-173.png 600w, https://www.dailydoseofds.com/content/images/2026/03/image-173.png 959w" sizes="(min-width: 720px) 720px"></figure><p>Finally, we explored hands-on experiments demonstrating: the impact of KV caching, the speedup from speculative decoding, and the performance gains of vLLM.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/03/image-174.png" class="kg-image" alt="" loading="lazy" width="782" height="135" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/03/image-174.png 600w, https://www.dailydoseofds.com/content/images/2026/03/image-174.png 782w" sizes="(min-width: 720px) 720px"></figure><p>Overall, this chapter established that LLM inference is not just about running a model, but about carefully managing compute, memory, and the trade-offs.</p><p>If you haven’t yet gone through Part 13, we recommend reviewing it first.</p><p>Read it here:</p><figure class="kg-card kg-bookmark-card"><a class="kg-bookmark-container" href="https://billing.dailydoseofds.com/llmops-crash-course-part-13/?ref=dailydoseofds.com"><div class="kg-bookmark-content"><div class="kg-bookmark-title">LLM Inference and Optimization: Fundamentals, Bottlenecks, and Techniques</div><div class="kg-bookmark-description">LLMOps Part 13: Exploring the mechanics of LLM inference, from prefill and decode phases to KV caching, batching, and optimization techniques that improve latency and throughput.</div><div class="kg-bookmark-metadata"><img class="kg-bookmark-icon" src="https://billing.dailydoseofds.com/content/images/icon/dailyds-logo-transparent-35.png" alt=""><span class="kg-bookmark-author">Daily Dose of Data Science</span><span class="kg-bookmark-publisher">Avi Chawla</span></div></div><div class="kg-bookmark-thumbnail"><img src="https://billing.dailydoseofds.com/content/images/thumbnail/Sanyog-MLOps_-_I-2-9.png" alt="" onerror="this.style.display = 'none'"></div></a></figure><p>In this chapter, we will discuss the fundamentals of LLM serving, exploring self-hosting of models, API-based access, and inference with vLLM.</p><p>As always, every notion will be explained through clear examples and walkthroughs to develop a solid understanding.</p><p>Let’s begin!</p><hr><h2 id="introduction">Introduction</h2><p>If you have a language model and want to make it accessible through an API that can be invoked by others, then this article serves as the operations manual for the fundamentals of that journey.</p><p>While there are many similarities with traditional ML deployment, serving a large language model introduces a different set of challenges.</p><p>LLMs are resource-intensive, often consuming significant VRAM even when idle. In naive setups, requests might be handled sequentially, meaning a single long-running generation can block all subsequent users. Cold starts are slower, and scaling is more complex due to the heavy compute and memory requirements.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/03/image-175.png" class="kg-image" alt="" loading="lazy" width="2000" height="934" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/03/image-175.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2026/03/image-175.png 1000w, https://www.dailydoseofds.com/content/images/size/w1600/2026/03/image-175.png 1600w, https://www.dailydoseofds.com/content/images/2026/03/image-175.png 2000w" sizes="(min-width: 720px) 720px"></figure><p>In contrast, conventional ML systems typically assume lightweight models with fast initialization and low per-request latency, assumptions that might not hold strong in the case of LLMs.</p><p>In the sections ahead, we’ll build an understanding of the core fundamentals and concepts behind serving LLMs. We'll try to leave you with enough clarity and intuition to replicate, adapt, and extend the setup demonstrated.</p><hr><h2 id="accessing-inference">Accessing inference</h2><p>The very first thing we need to understand is how an application will access LLM API for inference.</p><p>The LLM deployment and serving landscape broadly splits into two categories:</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/03/image-176.png" class="kg-image" alt="" loading="lazy" width="1024" height="478" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/03/image-176.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2026/03/image-176.png 1000w, https://www.dailydoseofds.com/content/images/2026/03/image-176.png 1024w" sizes="(min-width: 720px) 720px"></figure><ul><li>API providers (OpenAI, Anthropic, etc.): These are inference services. You send a request, you get a response. You do not manage hardware, you do not worry about GPU provisioning, and you do not implement optimizations. The provider handles all of that. However, you still need to manage your application layer.</li><li>Self-hosted inference: This means you run the model yourself. You provision GPUs (either on-premises or in the cloud), use a model inference and serving mechanism (like vLLM, TGI, etc.), and manage the entire stack. This gives you full control over model selection, inference configuration, data privacy, and cost structure. It also means you are responsible for everything: GPU allocation, model loading, optimization, and more.</li></ul><div class="kg-card kg-callout-card kg-callout-card-blue"><div class="kg-callout-emoji">👉</div><div class="kg-callout-text">Hybrid architectures are increasingly common. You might self-host a smaller model for high-frequency, low-complexity tasks and route complex reasoning tasks to a managed API. Or you might self-host during peak hours when reserved GPU instances are cheaper per-request, and fall back to API calls during off-peak periods. This pattern gives you cost flexibility while maintaining control.</div></div><div class="kg-card kg-callout-card kg-callout-card-blue"><div class="kg-callout-emoji">👉</div><div class="kg-callout-text">Our discussion in this chapter will be more or less focused on and around the concepts of the self-hosted LLM inference category.</div></div><hr><h2 id="deployment-topology">Deployment topology</h2><p>Where the model runs is a strategic decision, not just a technical one. The choice affects data security, cost structure, operational overhead, and the teams responsible for keeping things running.</p><h3 id="on-premises">On-premises</h3><p>Teams choose on-prem LLM deployments primarily for three reasons: data security and compliance, predictable cost at scale, and performance control.</p><p>For regulated industries like healthcare, finance, and government, sending user data to a third-party inference API is often not an option. An on-prem deployment keeps all inference traffic inside the organization's own network perimeter. There is no data leaving the building.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/03/image-177.png" class="kg-image" alt="" loading="lazy" width="1024" height="505" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/03/image-177.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2026/03/image-177.png 1000w, https://www.dailydoseofds.com/content/images/2026/03/image-177.png 1024w" sizes="(min-width: 720px) 720px"></figure><p>Cost predictability is the second driver. Cloud GPU pricing can be volatile, and token-based pricing from hosted API providers becomes expensive at scale. A fleet of owned or co-located GPUs has a roughly fixed monthly cost. Once the infrastructure investment is amortized, the marginal cost of additional inference is just power and operations. For steady, high-volume workloads, this math often favors on-prem.</p><p>The downsides however are also quite real:</p><ul><li>Upfront capital expenditure is substantial</li><li>Operational complexity is high</li><li>Iteration speed suffers when a new model architecture requires different hardware or driver versions and your on-prem cluster may not be ready for that.</li></ul><h3 id="cloud-deployments">Cloud deployments</h3><p>Cloud deployments, whether managed inference APIs or self-hosted models on rented GPU instances, offer the inverse tradeoffs.</p><ul><li>No upfront cost.</li><li>Access to the latest GPU generations on demand.</li><li>Horizontal scaling in minutes.</li><li>No capacity planning headache.</li></ul><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/03/image-178.png" class="kg-image" alt="" loading="lazy" width="2000" height="950" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/03/image-178.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2026/03/image-178.png 1000w, https://www.dailydoseofds.com/content/images/size/w1600/2026/03/image-178.png 1600w, https://www.dailydoseofds.com/content/images/2026/03/image-178.png 2000w" sizes="(min-width: 720px) 720px"></figure><p>However, the costs are variable and can surprise you at scale. You pay for what you use, which is great for experimentation and bursty workloads but can scale quickly as volume grows.</p><p>Data leaves your infrastructure, which creates compliance complexity for sensitive workloads. You are also dependent on provider SLAs and GPU availability, which can be constrained during periods of high demand.</p><p>Cloud is the right default for early-stage deployments, development and staging environments, workloads with highly variable or unpredictable traffic, and any use case where speed of iteration matters more than cost optimization.</p><h3 id="hybrid-setup">Hybrid setup</h3><p>Similar to inference access patterns, many deployments end up hybrid, combining on-prem baseline capacity with cloud overflow. The pattern works as follows: steady, predictable traffic runs on owned on-prem hardware (lower marginal cost). When traffic spikes beyond on-prem capacity, overflow traffic routes to cloud GPUs that spin up on demand.</p><div class="kg-card kg-callout-card kg-callout-card-blue"><div class="kg-callout-emoji">👉</div><div class="kg-callout-text">This also gives the compliance and cost benefits of on-prem for your baseline load, and the elasticity of cloud for peaks.</div></div>

<aside data-sx-content-cta class="relative p-0.5 js-toc-ignore">
        <div data-sx-content-cta-blur class="absolute w-full start-0 -top-0.5 -translate-y-full bg-linear-to-b to-70% from-black/0 to-background dark:to-background-dark pointer-events-none">
            &nbsp; <br> &nbsp; <br> &nbsp;
        </div>

    <div class="relative flex flex-col items-center text-center w-full px-4 py-10 lg:px-10 text-base rounded-xl sm:rounded-2xl">
        <div class="dark:hidden absolute -inset-px -z-10 rounded-xl bg-white shadow-pretty-sm transition-shadow duration-200"></div>
        <span class="hidden dark:block absolute -z-10 -inset-px rounded-inherit bg-radial-[50%_100%_at_50%_0%] from-white/10"></span>        
        <div class="hidden dark:block absolute -inset-px -z-10 rounded-xl bg-white/1 transition-colors duration-500 inset-shadow-md inset-shadow-white/2"></div>
        
        <div class="hidden dark:block absolute -z-10 -inset-0.5 mask-to-b mask-position-to-70% mask-opacity-to-50%">
            <div class="hidden dark:block absolute -z-10 inset-px rounded-xl border border-gray-50/10"></div>
        </div>                                                
        <div class="hidden dark:block absolute -z-10 -inset-px rounded-xl border border-gray-50/5 opacity-0"></div>

        <div class="hidden dark:block absolute -top-px start-1/2 -translate-x-1/2 w-1/2 h-px bg-linear-to-r from-white/0 to-white/0 via-gray-400/50"></div>

        <h2 class="text-gray-900 dark:text-gray-100 text-pretty text-xl sm:text-2xl leading-snug tracking-tight">




                                                <span class="hidden first:inline">
                                This lesson is for paying subscribers only
                            </span>                           
                    <span class="hidden first:inline">
                        This post is for paying subscribers only
                    </span>  

        </h2>

            <a href="/membership" class="mt-7 px-4 py-2.5 font-semibold text-sm text-center bg-primary rounded-full shadow-primary dark:shadow-none dark:drop-shadow-primary-sm hover:opacity-90 transition-opacity duration-200 inline-block" style="color: white;">
                Unlock Full Access
            </a>
            <p class="mt-5 text-sm">
                Already have an account? <button data-portal="signin" class="block sm:inline mx-auto font-medium underline decoration-primary decoration-2 underline-offset-2 hover:text-primary transition-colors duration-200">Sign in</button>
            </p>
    </div>
</aside>
]]></content:encoded>
                </item>
                <item>
                    <title><![CDATA[LLM Inference and Optimization: Fundamentals, Bottlenecks, and Techniques]]></title>
                    <description><![CDATA[LLMOps Part 13: Exploring the mechanics of LLM inference, from prefill and decode phases to KV caching, batching, and optimization techniques that improve latency and throughput.]]></description>
                    <link>https://www.dailydoseofds.com/llmops-crash-course-part-13/</link>
                    <guid isPermaLink="false">69cad8cb10b4a900019d101c</guid>

                        <category><![CDATA[MLOps/LLMOps Course]]></category>
                        <category><![CDATA[LLMOps]]></category>

                        <dc:creator><![CDATA[Avi Chawla]]></dc:creator>

                    <pubDate>Sun, 22 Mar 2026 00:05:32 +0530</pubDate>

                        <media:content url="https://www.dailydoseofds.com/content/images/2026/03/Sanyog-MLOps_-_I-2.png" medium="image"/>

                    <content:encoded><![CDATA[<img src="https://www.dailydoseofds.com/content/images/2026/03/Sanyog-MLOps_-_I-2.png" alt="LLM Inference and Optimization: Fundamentals, Bottlenecks, and Techniques"/> <h2 id="recap">Recap</h2><p>In the last chapter (Part 12), we explored how language models can be adapted via fine-tuning.</p><p>We began by discussing the central question of when fine-tuning is actually worth doing. We studied the reasons to fine-tune and the reasons to avoid it.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/03/image-88.png" class="kg-image" alt="" loading="lazy" width="1000" height="540" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/03/image-88.png 600w, https://www.dailydoseofds.com/content/images/2026/03/image-88.png 1000w" sizes="(min-width: 720px) 720px"></figure><p>After that, we moved to understanding PEFT techniques. In particular, we explored LoRA and QLoRA. We understood that LoRA reduces the number of trainable parameters, while QLoRA combines ideas of LoRA with quantization.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/03/image-89.png" class="kg-image" alt="" loading="lazy" width="1000" height="503" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/03/image-89.png 600w, https://www.dailydoseofds.com/content/images/2026/03/image-89.png 1000w" sizes="(min-width: 720px) 720px"></figure><p>Next, we shifted our focus to alignment fine-tuning. We explored three major preference optimization approaches: RLHF, DPO, and GRPO. Each of these methods provides a different way to push model behavior closer to human preferences or task-specific reward signals.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/03/image-90.png" class="kg-image" alt="" loading="lazy" width="1000" height="320" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/03/image-90.png 600w, https://www.dailydoseofds.com/content/images/2026/03/image-90.png 1000w" sizes="(min-width: 720px) 720px"></figure><p>Then, we looked at the role of data in fine-tuning and emphasized the quality and structure of training data.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/03/image-91.png" class="kg-image" alt="" loading="lazy" width="333" height="433"></figure><p>Finally walked through a hands-on GRPO-based demo using Unsloth, TRL, and the GSM8K dataset.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/03/image-92.png" class="kg-image" alt="" loading="lazy" width="865" height="172" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/03/image-92.png 600w, https://www.dailydoseofds.com/content/images/2026/03/image-92.png 865w" sizes="(min-width: 720px) 720px"></figure><p>Overall, the previous chapter established fine-tuning as a tool for adapting LLMs, while also making it clear that it should be used thoughtfully and only when simpler approaches are not enough.</p><p>If you haven’t yet gone through Part 12, we recommend reviewing it first, as it helps maintain the natural learning flow of the series.</p><p>Read it here:</p><figure class="kg-card kg-bookmark-card"><a class="kg-bookmark-container" href="https://www.dailydoseofds.com/llmops-crash-course-part-12/"><div class="kg-bookmark-content"><div class="kg-bookmark-title">LLM Fine-tuning: Techniques for Adapting Language Models</div><div class="kg-bookmark-description">LLMOps Part 12: Understanding LLM fine-tuning, parameter-efficient methods like LoRA and QLoRA, and alignment techniques such as RLHF, DPO, and GRPO.</div><div class="kg-bookmark-metadata"><img class="kg-bookmark-icon" src="https://www.dailydoseofds.com/content/images/icon/dailyds-logo-transparent-34.png" alt=""><span class="kg-bookmark-author">Daily Dose of Data Science</span><span class="kg-bookmark-publisher">Avi Chawla</span></div></div><div class="kg-bookmark-thumbnail"><img src="https://www.dailydoseofds.com/content/images/thumbnail/Sanyog-MLOps_-_I-1-14.png" alt="" onerror="this.style.display = 'none'"></div></a></figure><p>In this chapter, we will be exploring LLM inference and optimization. We will understand how inference works and what techniques are used to optimize it.</p><p>As always, every notion will be explained through clear examples and walkthroughs to develop a solid understanding.</p><p>Let’s begin!</p><hr><h2 id="key-throughput-and-latency-metrics">Key throughput and latency metrics</h2><p>Understanding and optimizing inference without understanding the right metrics is probably not a very smart thing to do. Thus, let’s first establish how we measure LLM inference performance in the first place:</p><ul><li>Time to first token (TTFT): The time taken to generate the first token after sending a request. It basically evaluates how fast the model can start responding. Think of TTFT as the startup latency for a query.</li><li>Time per output token (TPOT): Once the first token is out, TPOT measures the steady-state speed of generation: the average time to produce each subsequent token. A lower TPOT means the model can produce tokens faster, leading to higher tokens per second. Formally, if an output has $N$ tokens and the total generation time <strong>after</strong> the first token is $T$, then:</li></ul><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/03/image-93.png" class="kg-image" alt="" loading="lazy" width="436" height="175"></figure><ul><li>End-to-end latency (E2E): The total time from request to receiving the final output token.</li><li>Throughput: How much work the system can handle per unit time. Throughput is often measured in two ways:<ul><li>Requests per second (RPS): tells how many requests can be completed per second.</li><li>Tokens per second (TPS): measures token processing rate, including input tokens (input TPS) and output tokens (output TPS) across all requests.</li></ul></li></ul><div class="kg-card kg-callout-card kg-callout-card-blue"><div class="kg-callout-emoji">👉</div><div class="kg-callout-text">For LLMs, TPS is often more informative.</div></div><div class="kg-card kg-callout-card kg-callout-card-blue"><div class="kg-callout-emoji">👉</div><div class="kg-callout-text">Note: Different LLMs may use different tokenizers, and therefore, when comparing throughput even if two LLMs have similar TPS, they may not be equivalent. This is because corresponding tokens may represent a different number of characters.&nbsp;</div></div><ul><li>Latency percentiles (p95, p99) capture the tail experience. A system with an average TTFT of 200ms may have a p99 TTFT of 2 seconds, meaning 1% of users wait 10× longer. Thus, our objectives/SLOs should also target p95 or p99, not just averages (especially important if consistency is critical).</li><li>Goodput is the fraction of requests meeting all objective/SLO constraints simultaneously (e.g., TTFT &lt; 500ms AND TPOT &lt; 50ms).</li></ul><p>Now that we understand the metrics, let’s go ahead and build a precise mental model of how the LLM inference mechanism works.</p><hr><h2 id="how-llm-inference-works">How LLM inference works</h2><p>Autoregressive language models generate tokens one at a time, where each new token depends on all previous tokens. This dependency is the fundamental constraint that makes understanding inference and optimization challenging.</p><h3 id="the-two-phases-of-llm-inference">The two phases of LLM inference</h3><p>Critically, LLM inference isn’t a uniform workload; it has two distinct phases:</p><ul><li>Prefill phase: The model processes the input tokens to compute the intermediate states (keys and values), which are used to generate the first output token. Because the full extent of the input is known, at a high level, this is a highly parallelized matrix-matrix operation, and the GPU can reach high utilization (the operation is compute-bound). The K and V projections are stored in what we call the KV cache for later reuse (more on it later in the chapter). The prefill phase completes by building the full KV cache for the prompt and producing the logits from which the first output token is sampled. The decode phase then generates subsequent tokens one at a time. The time until this first token is generated maps directly to the TTFT metric and is dominated by the prefill phase.</li></ul><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/03/image-99.png" class="kg-image" alt="" loading="lazy" width="2000" height="877" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/03/image-99.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2026/03/image-99.png 1000w, https://www.dailydoseofds.com/content/images/size/w1600/2026/03/image-99.png 1600w, https://www.dailydoseofds.com/content/images/size/w2400/2026/03/image-99.png 2400w" sizes="(min-width: 720px) 720px"></figure><ul><li>Decode phase: The model generates output tokens one by one. At each step, only the single newest token's Q, K, V projections are computed. The new K and V vectors are appended to the KV cache, and the new Q vector attends against all cached K vectors. This results in low-parallel, matrix-vector kind of operations with poor hardware utilization. The arithmetic intensity plummets: tons of memory accesses for a tiny operation. Thus, the decode phase is usually memory-bandwidth-bound and hence maps to the TPOT metric.</li></ul>

<aside data-sx-content-cta class="relative p-0.5 js-toc-ignore">
        <div data-sx-content-cta-blur class="absolute w-full start-0 -top-0.5 -translate-y-full bg-linear-to-b to-70% from-black/0 to-background dark:to-background-dark pointer-events-none">
            &nbsp; <br> &nbsp; <br> &nbsp;
        </div>

    <div class="relative flex flex-col items-center text-center w-full px-4 py-10 lg:px-10 text-base rounded-xl sm:rounded-2xl">
        <div class="dark:hidden absolute -inset-px -z-10 rounded-xl bg-white shadow-pretty-sm transition-shadow duration-200"></div>
        <span class="hidden dark:block absolute -z-10 -inset-px rounded-inherit bg-radial-[50%_100%_at_50%_0%] from-white/10"></span>        
        <div class="hidden dark:block absolute -inset-px -z-10 rounded-xl bg-white/1 transition-colors duration-500 inset-shadow-md inset-shadow-white/2"></div>
        
        <div class="hidden dark:block absolute -z-10 -inset-0.5 mask-to-b mask-position-to-70% mask-opacity-to-50%">
            <div class="hidden dark:block absolute -z-10 inset-px rounded-xl border border-gray-50/10"></div>
        </div>                                                
        <div class="hidden dark:block absolute -z-10 -inset-px rounded-xl border border-gray-50/5 opacity-0"></div>

        <div class="hidden dark:block absolute -top-px start-1/2 -translate-x-1/2 w-1/2 h-px bg-linear-to-r from-white/0 to-white/0 via-gray-400/50"></div>

        <h2 class="text-gray-900 dark:text-gray-100 text-pretty text-xl sm:text-2xl leading-snug tracking-tight">




                                                <span class="hidden first:inline">
                                This lesson is for paying subscribers only
                            </span>                           
                    <span class="hidden first:inline">
                        This post is for paying subscribers only
                    </span>  

        </h2>

            <a href="/membership" class="mt-7 px-4 py-2.5 font-semibold text-sm text-center bg-primary rounded-full shadow-primary dark:shadow-none dark:drop-shadow-primary-sm hover:opacity-90 transition-opacity duration-200 inline-block" style="color: white;">
                Unlock Full Access
            </a>
            <p class="mt-5 text-sm">
                Already have an account? <button data-portal="signin" class="block sm:inline mx-auto font-medium underline decoration-primary decoration-2 underline-offset-2 hover:text-primary transition-colors duration-200">Sign in</button>
            </p>
    </div>
</aside>
]]></content:encoded>
                </item>
                <item>
                    <title><![CDATA[LLM Fine-tuning: Techniques for Adapting Language Models]]></title>
                    <description><![CDATA[LLMOps Part 12: Understanding LLM fine-tuning, parameter-efficient methods like LoRA and QLoRA, and alignment techniques such as RLHF, DPO, and GRPO.]]></description>
                    <link>https://www.dailydoseofds.com/llmops-crash-course-part-12/</link>
                    <guid isPermaLink="false">69cad8cb10b4a900019d1019</guid>

                        <category><![CDATA[MLOps/LLMOps Course]]></category>
                        <category><![CDATA[LLMOps]]></category>

                        <dc:creator><![CDATA[Avi Chawla]]></dc:creator>

                    <pubDate>Mon, 16 Mar 2026 01:42:57 +0530</pubDate>

                        <media:content url="https://www.dailydoseofds.com/content/images/2026/03/Sanyog-MLOps_-_I-1.png" medium="image"/>

                    <content:encoded><![CDATA[<img src="https://www.dailydoseofds.com/content/images/2026/03/Sanyog-MLOps_-_I-1.png" alt="LLM Fine-tuning: Techniques for Adapting Language Models"/> <h2 id="recap">Recap</h2><p>In the last chapter (Part 11), we completed our journey of understanding how evaluation works in LLM systems.</p><p>We began by exploring multi-turn evaluation. Unlike single-turn scenarios where one prompt produces one response, conversational systems require evaluating behavior across an entire dialogue. A response in later turns often depends on things that happened earlier in the conversation.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/03/image-38.png" class="kg-image" alt="" loading="lazy" width="975" height="643" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/03/image-38.png 600w, https://www.dailydoseofds.com/content/images/2026/03/image-38.png 975w" sizes="(min-width: 720px) 720px"></figure><p>Next, we shifted our focus to tool evaluation. We studied that in tool-enabled applications, the final text response is only a small part of the system’s behavior. The underlying sequence of tool decisions such as which tools were selected, in what order they were executed, and what arguments were passed, can determine whether the task actually succeeds.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/03/image-39.png" class="kg-image" alt="" loading="lazy" width="1000" height="551" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/03/image-39.png 600w, https://www.dailydoseofds.com/content/images/2026/03/image-39.png 1000w" sizes="(min-width: 720px) 720px"></figure><p>After that, we moved to understanding the role of Langfuse in the evaluation space. We discussed how we can capture the lifecycle of every request as a trace, which records inputs, outputs, latency, token usage, and intermediate operations. We also demonstrated how evaluation results can be attached to traces.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/03/image-40.png" class="kg-image" alt="" loading="lazy" width="955" height="255" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/03/image-40.png 600w, https://www.dailydoseofds.com/content/images/2026/03/image-40.png 955w" sizes="(min-width: 720px) 720px"></figure><p>Next, we introduced red teaming, which focuses on evaluating how LLM systems behave under adversarial conditions. We explored DeepTeam for automated red teaming, and saw it generate adversarial prompts targeting vulnerabilities.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/03/image-41.png" class="kg-image" alt="" loading="lazy" width="1000" height="577" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/03/image-41.png 600w, https://www.dailydoseofds.com/content/images/2026/03/image-41.png 1000w" sizes="(min-width: 720px) 720px"></figure><p>Finally, we discussed two important operational considerations that influence pipelines in systems: latency and cost.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/03/image-42.png" class="kg-image" alt="" loading="lazy" width="1000" height="521" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/03/image-42.png 600w, https://www.dailydoseofds.com/content/images/2026/03/image-42.png 1000w" sizes="(min-width: 720px) 720px"></figure><p>Overall, the previous chapter broadens the concept of evaluation, presenting it as a systemic discipline.</p><p>If you haven’t yet gone through Part 11, we recommend reviewing it first, as it helps maintain the natural learning flow of the series.</p><p>Read it here:</p><figure class="kg-card kg-bookmark-card"><a class="kg-bookmark-container" href="https://www.dailydoseofds.com/llmops-crash-course-part-11/"><div class="kg-bookmark-content"><div class="kg-bookmark-title">Evaluation: Multi-turn Conversations, Tool Use, Tracing, and Red Teaming</div><div class="kg-bookmark-description">LLMOps Part 11: Understanding evaluation of conversational LLM systems, tool evaluations, tracing with Langfuse, and automated red teaming.</div><div class="kg-bookmark-metadata"><img class="kg-bookmark-icon" src="https://www.dailydoseofds.com/content/images/icon/dailyds-logo-transparent-29.png" alt=""><span class="kg-bookmark-author">Daily Dose of Data Science</span><span class="kg-bookmark-publisher">Avi Chawla</span></div></div><div class="kg-bookmark-thumbnail"><img src="https://www.dailydoseofds.com/content/images/thumbnail/Sanyog-MLOps_-_I-14.png" alt="" onerror="this.style.display = 'none'"></div></a></figure><p>In this chapter, we will be exploring LLM fine-tuning. We'll understand parameter-efficient training methods like LoRA and QLoRA, and alignment techniques such as RLHF, DPO, and GRPO.</p><p>As always, every notion will be explained through clear examples and walkthroughs to develop a solid understanding.</p><p>Let’s begin!</p><hr><h2 id="introduction">Introduction</h2><p>Fine-tuning refers to adapting a pre-trained LLM to a specific task or domain by further training all or part of its weights. It can dramatically improve an LLM’s performance on specialized tasks or formats.</p><p>Fine-tuning is especially popular for enhancing instruction-following ability, making a model better at obeying user prompts and style guidelines. Today, with the rapid increase of open-source models of all sizes, fine-tuning has become far more popular and attractive than in the early GPT-3 era.</p><p>However, still, fine-tuning is a significant investment. It requires high-quality data, ML expertise, and substantial compute resources.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/03/image-81.png" class="kg-image" alt="" loading="lazy" width="2000" height="960" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/03/image-81.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2026/03/image-81.png 1000w, https://www.dailydoseofds.com/content/images/size/w1600/2026/03/image-81.png 1600w, https://www.dailydoseofds.com/content/images/size/w2400/2026/03/image-81.png 2400w" sizes="(min-width: 720px) 720px"></figure><p>A common question is when to fine-tune versus using prompting or RAG alone. In practice, fine-tuning is attempted only after exhausting prompt-based methods, and often used in tandem with prompting (e.g. a fine-tuned model plus good prompts) in real-world scenarios.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/03/image-44.png" class="kg-image" alt="" loading="lazy" width="1000" height="540" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/03/image-44.png 600w, https://www.dailydoseofds.com/content/images/2026/03/image-44.png 1000w" sizes="(min-width: 720px) 720px"></figure><p>Let’s now delve deeper into this chapter by exploring the advantages and limitations of LLM fine-tuning, briefly reviewing some of the major fine-tuning techniques, and examining emerging approaches such as model merging.</p><hr><h2 id="why-fine-tune-an-llm-and-why-not">Why fine-tune an LLM? And why not?</h2><h3 id="reasons-to-fine-tune">Reasons to fine-tune</h3><p>The primary motivation is to improve a model’s quality or specificity beyond what prompting alone can achieve. Fine-tuning can unlock latent capabilities of a model that are hard to elicit via prompts.</p><p>Common use cases include:</p><ul><li>Task/Domain specialization: If a model wasn’t trained sufficiently on your domain or task, fine-tuning on in-domain data can dramatically boost performance. For example, an out-of-the-box model might handle standard SQL but fail on a niche dialect; fine-tuning on that dialect will teach the model the needed syntax. Similarly, a legal LLM can be refined on legal documents to improve its legal reasoning.</li></ul><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/03/image-47-1-1.png" class="kg-image" alt="" loading="lazy" width="2000" height="970" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/03/image-47-1-1.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2026/03/image-47-1-1.png 1000w, https://www.dailydoseofds.com/content/images/size/w1600/2026/03/image-47-1-1.png 1600w, https://www.dailydoseofds.com/content/images/2026/03/image-47-1-1.png 2000w" sizes="(min-width: 720px) 720px"></figure><ul><li>Format and style tuning: Fine-tuning is often used to enforce structured output formats (JSON, XML, markdown, etc.) or a specific style/tone. While prompt instructions can coax format, a well fine-tuned model will natively produce the format reliably.</li><li>Instruction following and alignment: Instruction-tuned models are fine-tuned to better follow human instructions and avoid undesired outputs. Nowadays, it’s standard for LLM providers to release a base model and a supervised-finetuned (SFT) version for dialogue. SFT uses high-quality (instruction, response) pairs to teach the model to respond helpfully. This unlocks usability improvements that pure pre-training doesn’t yield.</li><li>Bias and safety mitigation: Fine-tuning on carefully curated data can correct unwanted biases or behaviors in the base model.</li><li>Efficiency via smaller models: A well-finetuned small model can many-a-times outperform a larger general model on a narrow task. Fine-tuning thus enables deploying lightweight models that achieve required accuracy, which is beneficial for cost and latency (since a smaller model is cheaper and faster to run).</li></ul><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/03/image-49.png" class="kg-image" alt="" loading="lazy" width="2000" height="987" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/03/image-49.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2026/03/image-49.png 1000w, https://www.dailydoseofds.com/content/images/size/w1600/2026/03/image-49.png 1600w, https://www.dailydoseofds.com/content/images/size/w2400/2026/03/image-49.png 2400w" sizes="(min-width: 720px) 720px"></figure><p>In summary, fine-tuning shines when you need custom behavior or high accuracy in a specific setting that an already available model wasn’t probably trained for, explicitly. It can yield a model that is better suited to your application than any general-purpose model.</p><h3 id="reasons-not-to-fine-tune">Reasons NOT to fine-tune</h3><p>Despite its benefits, fine-tuning is neither trivial nor always the right choice:</p><ul><li>Prompting or RAG may suffice: There are many improvements can often be achieved to a good extent with prompt engineering, providing context, or retrieval augmentation without the cost of training, especially improvements where external knowledge is the primary factor. Hence, before fine-tuning, one should try instructing the model or supplying reference context to see if performance is acceptable. Fine-tuning is generally a last resort when other methods fail to meet requirements.</li></ul><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/03/image-50.png" class="kg-image" alt="" loading="lazy" width="1000" height="540" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/03/image-50.png 600w, https://www.dailydoseofds.com/content/images/2026/03/image-50.png 1000w" sizes="(min-width: 720px) 720px"></figure><ul><li>Risk of over-specialization: Fine-tuning on a specific task can degrade performance on other tasks the model used to do well (a form of catastrophic forgetting). In a multi-task application, this means you may have to fine-tune on all desired tasks or use separate models per task. Developing and maintaining multiple fine-tuned models is operationally complex.</li><li>Maintenance and model freshness: A fine-tuned model can become stale. If a new model is released that significantly outperforms your fine-tuned model, you face a tough choice: stick with your older fine-tuned model, or adopt the new model and incur the cost of again fine-tuning on your data. In rapidly evolving fields, this cycle can repeat often.</li><li>Data and requirements: Fine-tuning requires significant upfront investment in data. You need a high-quality dataset of task-specific examples. Obtaining and curating this data (or generating it with AI and cleaning it) can be slow and expensive, especially for complex tasks requiring domain expertise.</li></ul><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/03/image-51-1-1.png" class="kg-image" alt="" loading="lazy" width="333" height="433"></figure><ul><li>Computational cost: Fine-tuning (especially full fine-tuning) is computationally intensive. Hence, if you are just prototyping an idea, jumping straight into fine-tuning is rarely the first step; it makes sense only once you’re convinced that a custom model is needed and worth the investment.</li></ul><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/03/image-82-1.png" class="kg-image" alt="" loading="lazy" width="224" height="410"></figure><p>In practice, many teams follow a progression: start with prompt (and context) engineering and off-the-shelf models, only proceed to fine-tune when necessary for quality or latency reasons.</p><p>Next, let’s explore some effective techniques for fine-tuning large language models (LLMs).</p><hr><h2 id="memory-efficient-fine-tuning-peft">Memory-efficient fine-tuning: PEFT</h2><p>A major challenge in fine-tuning large models is memory. Full fine-tuning of all model weights can be prohibitively expensive in terms of GPU VRAM. Parameter-efficient fine-tuning (PEFT) approaches avoid this problem by reducing the number of trainable parameters, thereby cutting memory and compute while aiming to preserve performance.</p><p>Let's now survey the two of the most popular and prominent PEFT methods:</p><h3 id="lora-low-rank-adaptation-of-llms">LoRA: Low-Rank Adaptation of LLMs</h3><p>LoRA is by far the most popular method in the PEFT family of techniques. LoRA builds on a simple insight: neural network weight updates often lie in a low-dimensional subspace. This means that when we fine-tune a model, the change we need to make to the original weights does not require a full-rank update. Instead, the update matrix can often be well approximated by a low-rank decomposition composed of much smaller matrices.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/03/image-53-1.png" class="kg-image" alt="" loading="lazy" width="1024" height="416" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/03/image-53-1.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2026/03/image-53-1.png 1000w, https://www.dailydoseofds.com/content/images/2026/03/image-53-1.png 1024w" sizes="(min-width: 720px) 720px"></figure><p>By learning these low-rank adapters while keeping the original model weights frozen, LoRA drastically reduces the number of trainable parameters and the memory required for fine-tuning.</p><div class="kg-card kg-callout-card kg-callout-card-blue"><div class="kg-callout-emoji">👉</div><div class="kg-callout-text">Why low rank? Large neural networks are massively&nbsp;overparameterized. A weight matrix might have millions of entries, but the meaningful directions of change (the directions that actually matter for learning a new task) live in a much smaller subspace. Research has shown that pre-trained models have a low "intrinsic dimension", i.e., you only need to move along a handful of directions in weight space to adapt the model effectively.</div></div><p>Intuition: Imagine a 1000-dimensional parameter space, but the useful changes you need to make to adapt the model only lie along 8 important directions. Instead of searching the entire 1000-dimensional space, you can restrict learning to those 8 directions and still capture the necessary behavior change. LoRA exploits this idea by constraining weight updates to a low-rank subspace.</p><h4 id="how-lora-works">How LoRA works</h4><p>Consider a single weight matrix in the model. Call this matrix&nbsp;$W$ and let the dimension be 4096, then&nbsp;$W$&nbsp;is a&nbsp;$4096×4096$&nbsp;matrix with about 16.7 million parameters.</p><p>In standard fine-tuning, you'd compute a gradient for every one of those 16.7 million entries and update them all. But, LoRA does something different: it&nbsp;freezes&nbsp;$W$&nbsp;entirely and instead learns a small&nbsp;correction&nbsp;to it.</p><p>LoRA represents the weight update as the product of two small matrices:</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/03/image-54.png" class="kg-image" alt="" loading="lazy" width="484" height="151"></figure><p>where&nbsp;$A$&nbsp;is an&nbsp;$n×r$&nbsp;matrix and&nbsp;$B$&nbsp;is an&nbsp;$r×m$&nbsp;matrix. The key parameter here is&nbsp;$r$, the&nbsp;rank hyperparameter, which is typically very small (for example, 8 or 16).</p><div class="kg-card kg-callout-card kg-callout-card-blue"><div class="kg-callout-emoji">👉</div><div class="kg-callout-text">If you are thinking that&nbsp;$r$ value being 8 or 16&nbsp;is extremely low as compared to the original dimensions, please note that the authors of LoRA experimented with even smaller ranks, and found it as effective as full fine-tuning. In other words, the rank&nbsp;$r$&nbsp;can be orders of magnitude smaller than the weight matrix, and it won’t affect the performance.</div></div><p>Thus, the effective weight used during the forward pass becomes:</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/03/image-55.png" class="kg-image" alt="" loading="lazy" width="1066" height="165" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/03/image-55.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2026/03/image-55.png 1000w, https://www.dailydoseofds.com/content/images/2026/03/image-55.png 1066w" sizes="(min-width: 720px) 720px"></figure><p>During training,&nbsp;only&nbsp;$A$&nbsp;and&nbsp;$B$&nbsp;receive gradients and get updated. The original weight matrix&nbsp;$W$&nbsp;stays exactly as it was after pre-training. Hence, we're not modifying the model, we're learning a lightweight patch on top of it.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/03/image-56.png" class="kg-image" alt="" loading="lazy" width="1000" height="503" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/03/image-56.png 600w, https://www.dailydoseofds.com/content/images/2026/03/image-56.png 1000w" sizes="(min-width: 720px) 720px"></figure><h4 id="how-it-helps">How it helps? </h4><p>Continuing with our earlier assumed data, the original matrix&nbsp;$W$&nbsp;has $4096×4096$&nbsp;weight matrix with rank&nbsp;$r=8$ (say):</p><p>Therefore:</p><ul><li>Original:&nbsp;$4096×4096=16,777,216$&nbsp;parameters</li><li>LoRA:&nbsp;$(4096×8)+(4096×8)=65,536$&nbsp;parameters</li></ul><p>That's about just 0.4% of the original, and across the whole model, this has a dramatic effect.</p><h4 id="initialization-and-scaling">Initialization and scaling</h4><p>At the start of training,&nbsp;$ΔW$&nbsp;is initialized to zero so that the model begins exactly where pre-training left off. Typically,&nbsp;$A$&nbsp;is initialized with small random values (from a Gaussian distribution) and&nbsp;$B$&nbsp;is initialized to all zeros, so that&nbsp;$A×B=0$&nbsp;initially.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/03/image-57.png" class="kg-image" alt="" loading="lazy" width="485" height="194"></figure><p>LoRA also introduces a&nbsp;scaling factor&nbsp;$α$, so the actual update is:</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/03/image-58.png" class="kg-image" alt="" loading="lazy" width="772" height="144" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/03/image-58.png 600w, https://www.dailydoseofds.com/content/images/2026/03/image-58.png 772w" sizes="(min-width: 720px) 720px"></figure><p>Here:</p><ul><li>The division by $r$ normalizes the update magnitude so that different rank choices behave consistently.</li><li>The $\alpha$ parameter provides explicit control over how strongly the LoRA adaptation affects the original model.</li></ul><h4 id="at-inference-time">At inference time</h4><p>Once training is done, you have two options:</p><p><strong>Option 1:</strong></p><p>Merge the weights.&nbsp;Compute&nbsp;$W′ = W+\frac{α}{r}⋅(A×B)$&nbsp;once, and replace&nbsp;$W$&nbsp;with&nbsp;$W′$&nbsp;in the model. Now the model runs at exactly the same speed as the original. The LoRA matrices are "baked in."</p><p><strong>Option 2:</strong></p><p>Keep them separate.&nbsp;Store&nbsp;$A$&nbsp;and&nbsp;$B$&nbsp;alongside the frozen model and add their product on the fly during inference. This adds a small amount of computation, but it means you can easily toggle the adaptation on or off, or swap in different LoRA adapters for different tasks without touching the base model.</p><p>This second option is especially powerful: you can have one base model and dozens of smaller LoRA adapter files, each specializing the model for a different task.</p><h4 id="where-lora-is-applied">Where LoRA is applied?</h4><p>In a transformer, every layer has several weight matrices: the query, key, value, and output projections in the attention mechanism, plus the weight matrices in the feed-forward network. LoRA can be applied to any subset of these.</p><p>The most common practice is to apply LoRA to the&nbsp;attention projection matrices (the query and value ones are the most popular targets).</p><div class="kg-card kg-callout-card kg-callout-card-blue"><div class="kg-callout-emoji">👉</div><div class="kg-callout-text">Note: LoRA also tends to work well with smaller datasets. Because the base model's knowledge is preserved intact, the LoRA adapter only needs to learn the&nbsp;difference&nbsp;for the new task. A few thousand high-quality examples can be enough, whereas full fine-tuning might require far more data to avoid catastrophic forgetting (where the model loses its general abilities).</div></div><p>With LoRA now covered, let's go ahead and briefly explore about quantization and QLoRA.</p><h3 id="quantization-and-qlora">Quantization and QLoRA</h3><p>Our discussion on LoRA addressed one axis of efficiency: reducing the number of trainable parameters. Let's now tackle the second, complementary axis: reducing the precision of the numbers used to store model weights. This is called quantization, and when combined with LoRA, it produces one of the most efficient techniques.</p><h4 id="what-is-quantization">What is quantization?</h4><p>Early deep learning models were typically trained using 32-bit floating point (<code>float32</code>). Modern large language model training, however, usually relies on mixed-precision training, where most computations use 16-bit formats such as <code>float16</code> or <code>bfloat16</code>, while certain values like optimizer states are still maintained in <code>float32</code> for numerical stability. Quantization means representing these values with even fewer bits: for example, 8-bit integers and 4-bit integers.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/03/image-59.png" class="kg-image" alt="" loading="lazy" width="783" height="315" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/03/image-59.png 600w, https://www.dailydoseofds.com/content/images/2026/03/image-59.png 783w" sizes="(min-width: 720px) 720px"></figure><p>The memory savings are proportional and direct. For example, storing model weights in 16-bit precision requires roughly half the memory of 32-bit floats. Similarly, moving from 16-bit weights to 4-bit quantized weights reduces memory usage by about another four times (roughly).</p><div class="kg-card kg-callout-card kg-callout-card-blue"><div class="kg-callout-emoji">👉</div><div class="kg-callout-text">The catch, however, is that fewer bits means less precision. You are rounding every weight to one of a smaller set of possible values, which introduces quantization error. The question is whether the model can tolerate that error without meaningful loss in quality.</div></div><p>For inference (just running the model), 8-bit quantization has been reliable for years, and 4-bit inference has become standard practice lately with negligible quality loss. But training in low precision is harder, because gradients need to be accurate to make useful updates. This is where QLoRA comes in.</p><h4 id="qlora">QLoRA</h4><p>QLoRA makes 4-bit training practical by combining two ideas:</p><ul><li>Store the frozen base model in 4-bit precision to save memory.</li><li>Keep the LoRA adapter matrices in 16-bit precision so that gradient computation stays accurate.</li></ul><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/03/image-62.png" class="kg-image" alt="" loading="lazy" width="2000" height="593" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/03/image-62.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2026/03/image-62.png 1000w, https://www.dailydoseofds.com/content/images/size/w1600/2026/03/image-62.png 1600w, https://www.dailydoseofds.com/content/images/size/w2400/2026/03/image-62.png 2400w" sizes="(min-width: 720px) 720px"></figure><p>The base model weights are treated as a compressed, read-only lookup. During the forward pass, each 4-bit weight is temporarily dequantized back to 16-bit, used in the computation, and then discarded. Gradients flow backward through this dequantization step and into the LoRA matrices $A$ and $B$, which are the only things being updated.</p><p>In other words, you never actually train in 4-bit. You store in 4-bit and compute in higher precision. The frozen weights just need to be close enough to their original values that the model still behaves correctly, and the LoRA matrices handle all the actual learning.</p><h4 id="nf4">NF4</h4><p>Not all 4-bit representations are equal. A naive approach would space the 16 possible values (that 4 bits can represent) uniformly across the weight range. But neural network weights are not uniformly distributed. They tend to follow a roughly normal distribution. NF4 (NormalFloat 4-bit) exploits this.</p><p>It spaces the 16 quantization levels so that they are information-theoretically optimal for normally distributed data. More levels are packed near zero (where most weights live) and fewer levels are placed in the tails.</p><div class="kg-card kg-callout-card kg-callout-card-blue"><div class="kg-callout-emoji">👉</div><div class="kg-callout-text">This means a normally distributed weight quantized with NF4 loses less information than the same weight quantized with a uniform 4-bit grid.</div></div><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/03/image-63-1.png" class="kg-image" alt="" loading="lazy" width="1024" height="247" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/03/image-63-1.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2026/03/image-63-1.png 1000w, https://www.dailydoseofds.com/content/images/2026/03/image-63-1.png 1024w" sizes="(min-width: 720px) 720px"></figure><p>The result: NF4 preserves significantly more of the original model's knowledge than naive 4-bit quantization.</p><h4 id="what-qlora-made-possible">What QLoRA made possible</h4><p>QLoRA as a technique made it possible to let a person with one high-end GPU do what previously required a multi-node cluster. Thus, the practical barrier to fine-tuning large models dropped by an order of magnitude.</p><p>However, there is a tradeoff too. The primary cost of QLoRA is wall-clock time. Every weight that participates in a forward or backward pass must be dequantized on the fly. This extra step adds overhead.</p><p>In practice, however, this is almost always an acceptable tradeoff, because the alternative is not "slower training" but rather "no training at all" on a specific piece of hardware.</p><h4 id="quantization-beyond-training">Quantization beyond training</h4><p>Quantization is arguably even more impactful at inference time. Serving a model requires fitting it into memory first. Smaller precision means:</p><ul><li>More models per GPU (or larger models on the same GPU)</li><li>Higher throughput (more tokens per second)</li><li>Lower cost per query</li></ul><p>Nowadays, it is common practice to deploy LLMs in 8-bit or 4-bit precision. The quality loss is typically negligible for chat and generation tasks. Some deployment setups use mixed precision: 4-bit for most layers and 8-bit or 16-bit for a few layers that are especially sensitive to quantization error.</p><hr><p>In summary, both LoRA and QLoRA are essentially ways to minimize changes to the model while still adapting it. They prevent us from having to update billions of weights, making fine-tuning cheaper, and more accessible, and have proven to achieve performance comparable to full model fine-tuning.</p><p>Next, let’s explore alignment fine-tuning, where the focus is shaping how a model behaves so its responses better match human intentions and expectations.</p><hr><h2 id="alignment-fine-tuning-rlhf-dpo-and-grpo">Alignment fine-tuning: RLHF, DPO, and GRPO</h2>

<aside data-sx-content-cta class="relative p-0.5 js-toc-ignore">
        <div data-sx-content-cta-blur class="absolute w-full start-0 -top-0.5 -translate-y-full bg-linear-to-b to-70% from-black/0 to-background dark:to-background-dark pointer-events-none">
            &nbsp; <br> &nbsp; <br> &nbsp;
        </div>

    <div class="relative flex flex-col items-center text-center w-full px-4 py-10 lg:px-10 text-base rounded-xl sm:rounded-2xl">
        <div class="dark:hidden absolute -inset-px -z-10 rounded-xl bg-white shadow-pretty-sm transition-shadow duration-200"></div>
        <span class="hidden dark:block absolute -z-10 -inset-px rounded-inherit bg-radial-[50%_100%_at_50%_0%] from-white/10"></span>        
        <div class="hidden dark:block absolute -inset-px -z-10 rounded-xl bg-white/1 transition-colors duration-500 inset-shadow-md inset-shadow-white/2"></div>
        
        <div class="hidden dark:block absolute -z-10 -inset-0.5 mask-to-b mask-position-to-70% mask-opacity-to-50%">
            <div class="hidden dark:block absolute -z-10 inset-px rounded-xl border border-gray-50/10"></div>
        </div>                                                
        <div class="hidden dark:block absolute -z-10 -inset-px rounded-xl border border-gray-50/5 opacity-0"></div>

        <div class="hidden dark:block absolute -top-px start-1/2 -translate-x-1/2 w-1/2 h-px bg-linear-to-r from-white/0 to-white/0 via-gray-400/50"></div>

        <h2 class="text-gray-900 dark:text-gray-100 text-pretty text-xl sm:text-2xl leading-snug tracking-tight">




                                                <span class="hidden first:inline">
                                This lesson is for paying subscribers only
                            </span>                           
                    <span class="hidden first:inline">
                        This post is for paying subscribers only
                    </span>  

        </h2>

            <a href="/membership" class="mt-7 px-4 py-2.5 font-semibold text-sm text-center bg-primary rounded-full shadow-primary dark:shadow-none dark:drop-shadow-primary-sm hover:opacity-90 transition-opacity duration-200 inline-block" style="color: white;">
                Unlock Full Access
            </a>
            <p class="mt-5 text-sm">
                Already have an account? <button data-portal="signin" class="block sm:inline mx-auto font-medium underline decoration-primary decoration-2 underline-offset-2 hover:text-primary transition-colors duration-200">Sign in</button>
            </p>
    </div>
</aside>
]]></content:encoded>
                </item>
                <item>
                    <title><![CDATA[Evaluation: Multi-turn Conversations, Tool Use, Tracing, and Red Teaming]]></title>
                    <description><![CDATA[LLMOps Part 11: Understanding evaluation of conversational LLM systems, tool evaluations, tracing with Langfuse, and automated red teaming.]]></description>
                    <link>https://www.dailydoseofds.com/llmops-crash-course-part-11/</link>
                    <guid isPermaLink="false">69cad8cb10b4a900019d1013</guid>

                        <category><![CDATA[MLOps/LLMOps Course]]></category>

                        <dc:creator><![CDATA[Avi Chawla]]></dc:creator>

                    <pubDate>Sun, 08 Mar 2026 01:33:00 +0530</pubDate>

                        <media:content url="https://www.dailydoseofds.com/content/images/2026/03/Sanyog-MLOps_-_I.png" medium="image"/>

                    <content:encoded><![CDATA[<img src="https://www.dailydoseofds.com/content/images/2026/03/Sanyog-MLOps_-_I.png" alt="Evaluation: Multi-turn Conversations, Tool Use, Tracing, and Red Teaming"/> <h2 id="recap">Recap</h2><p>In part 10, we expanded our understanding of evaluation by exploring model benchmarks and the evaluation of LLM-powered applications within their real operational context.</p><p>We began by exploring model capability benchmarks. We understood that these benchmarks attempt to measure the general intelligence, reasoning ability, and knowledge breadth of large language models.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/04/image-76.png" class="kg-image" alt="" loading="lazy" width="481" height="304"></figure><p>We discussed widely used benchmarks such as MMLU, HellaSwag, TruthfulQA, etc. An important takeaway from this discussion was that benchmark scores are useful for narrowing down candidate models, but they should not be treated as definitive indicators of performance for a particular application context.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/04/image-77.png" class="kg-image" alt="" loading="lazy" width="1000" height="546" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/04/image-77.png 600w, https://www.dailydoseofds.com/content/images/2026/04/image-77.png 1000w" sizes="(min-width: 720px) 720px"></figure><p>From there, we shifted our focus to application-level evaluation. We first discussed test sets. Then we examined several key evaluation metrics, including deterministic ones such as exact match, contains-answer, and token-level F1.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/04/image-78.png" class="kg-image" alt="" loading="lazy" width="969" height="499" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/04/image-78.png 600w, https://www.dailydoseofds.com/content/images/2026/04/image-78.png 969w" sizes="(min-width: 720px) 720px"></figure><p>For RAG pipelines, we explored retrieval-specific metrics including Recall@K, Precision@K, and Mean reciprocal rank (MRR).</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/04/image-79.png" class="kg-image" alt="" loading="lazy" width="1000" height="314" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/04/image-79.png 600w, https://www.dailydoseofds.com/content/images/2026/04/image-79.png 1000w" sizes="(min-width: 720px) 720px"></figure><p>We also examined groundedness (faithfulness) evaluation, which measures whether the generated answer is supported by the retrieved context. One widely used technique here is Question-Answer Generation (QAG), which decomposes generated outputs into claims, converts them into verification questions, and checks them against the source context.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/04/image-80.png" class="kg-image" alt="" loading="lazy" width="1000" height="416" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/04/image-80.png 600w, https://www.dailydoseofds.com/content/images/2026/04/image-80.png 1000w" sizes="(min-width: 720px) 720px"></figure><p>We then explored the role of LLM-as-a-Judge methods. We emphasized that the reliability of an LLM judge depends heavily on prompt, rubric design and policy layer. Building on this, we examined G-Eval, a structured LLM-based evaluation framework.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.dailydoseofds.com/content/images/2026/04/image-81.png" class="kg-image" alt="" loading="lazy" width="871" height="642" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/04/image-81.png 600w, https://www.dailydoseofds.com/content/images/2026/04/image-81.png 871w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">G-Eval </span><a href="https://arxiv.org/pdf/2303.16634?ref=dailydoseofds.com" target="_blank"><span style="white-space: pre-wrap;">(Liu et al. 2023)</span></a></figcaption></figure><p>Finally, we moved into practical tooling, focusing on the DeepEval framework. Using DeepEval, we demonstrated how to define test cases, attach evaluation metrics, and run evaluation programmatically. We also looked at how DeepEval supports RAG evaluation. Additionally, we explored DAG-based evaluation. DAG metrics are particularly useful for evaluations such as checking structure, verifying the presence of required sections, or enforcing formatting constraints.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/04/image-82.png" class="kg-image" alt="" loading="lazy" width="892" height="585" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/04/image-82.png 600w, https://www.dailydoseofds.com/content/images/2026/04/image-82.png 892w" sizes="(min-width: 720px) 720px"></figure><p>Overall, the chapter extended our evaluation toolkit beyond the fundamentals.</p><p>If you haven’t yet gone through Part 10, we strongly recommend reviewing it first, as it establishes the conceptual foundation essential for understanding the material we’re about to cover.</p><p>Read it here:</p><figure class="kg-card kg-bookmark-card"><a class="kg-bookmark-container" href="https://www.dailydoseofds.com/llmops-crash-course-part-10/"><div class="kg-bookmark-content"><div class="kg-bookmark-title">Evaluation: Model Benchmarks and LLM Application Assessment</div><div class="kg-bookmark-description">LLMOps Part 10: Understanding model benchmarks, LLM application evaluation, and tooling.</div><div class="kg-bookmark-metadata"><img class="kg-bookmark-icon" src="https://www.dailydoseofds.com/content/images/icon/dailyds-logo-transparent-27.png" alt=""><span class="kg-bookmark-author">Daily Dose of Data Science</span><span class="kg-bookmark-publisher">Avi Chawla</span></div></div><div class="kg-bookmark-thumbnail"><img src="https://www.dailydoseofds.com/content/images/thumbnail/Sanyog-MLOps_-_I-3-12.png" alt="" onerror="this.style.display = 'none'"></div></a></figure><p>In this final chapter on LLM evaluation, we will understand evaluation of multi-turn systems, tool use evals, tracing, and red teaming.</p><p>As always, every notion will be explained through clear examples and walkthroughs to develop a solid understanding.</p><p>Let’s begin!</p><hr><h2 id="evaluating-multi-turn-conversation">Evaluating multi-turn conversation</h2><p>By now, single-turn evaluation is a solved problem for us, more or less. You feed a prompt to a model, get a response, and compare it against a reference answer. Multi-turn evaluation is harder. The quality of turn five depends on everything that happened in turns one through four. A response that looks correct in isolation might contradict something the model said two turns ago.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/04/image-83.png" class="kg-image" alt="" loading="lazy" width="975" height="643" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/04/image-83.png 600w, https://www.dailydoseofds.com/content/images/2026/04/image-83.png 975w" sizes="(min-width: 720px) 720px"></figure><p>This section covers the concepts and tooling that matter for multi-turn conversations.</p><h3 id="two-levels-of-evaluation">Two levels of evaluation</h3><p>Multi-turn evaluation can operate at two granularities:</p><ul><li>Turn-level evaluation assesses each individual exchange. For pipelines, we can easily reuse most of your single-turn evaluation machinery here. The key difference is that we pass the full conversation history as context to the judge, not just the last user message. Turn-level scoring is how you pinpoint where a conversation breaks down. If a five-turn dialogue fails at the end, turn-level scores might reveal the real problem started at turn three (say).</li></ul><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/04/image-84.png" class="kg-image" alt="" loading="lazy" width="986" height="521" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/04/image-84.png 600w, https://www.dailydoseofds.com/content/images/2026/04/image-84.png 986w" sizes="(min-width: 720px) 720px"></figure><ul><li>Task-level evaluation answers a different question: did the conversation accomplish the user's goal? For a customer support bot, that might mean the issue was resolved. For a coding assistant, it might mean the final snippet runs. Task-level evaluation often requires either explicit success criteria embedded in the test cases or a goal-extraction step where it is inferred what the user was trying to accomplish and then check whether it happened.</li></ul><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/04/image-85.png" class="kg-image" alt="" loading="lazy" width="1024" height="397" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/04/image-85.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2026/04/image-85.png 1000w, https://www.dailydoseofds.com/content/images/2026/04/image-85.png 1024w" sizes="(min-width: 720px) 720px"></figure><h3 id="metrics">Metrics</h3><p>Some key signals to observe and evaluate in multi-turn systems include:</p><ul><li>Context retention: whether the model remembers and uses information from earlier turns. If it forgets, that is a failure.</li><li>Coherence: whether the dialogue flows naturally from one message to another.</li><li>Relevancy: whether the dialogue stays on topic or does the system deviate to nonsensical tangents.</li></ul><p>Now, a point to note here is, we may not always need to manually implement all of these metrics ourselves. This is where evaluation frameworks become essential. Let’s now explore how DeepEval enables the evaluation of multi-turn conversations through its metrics.</p><h3 id="implementing-multi-turn-evaluation-with-deepeval">Implementing multi-turn evaluation with DeepEval</h3><p>DeepEval provides a practical framework for multi-turn evaluation. It represents a dialogue as a <code>ConversationalTestCase</code>, which is a sequence of <code>Turns</code>. We then apply the conversational metrics.</p><p>Here is a concrete example:</p>
<!--kg-card-begin: html-->
<div style="overflow:hidden;margin-left:auto;margin-right:auto;border-radius:10px;width:100%;max-width:896px;position:relative"><div style="width:100%;padding-bottom:114.95535714285714%"></div><iframe width="896" height="1030" title="" src="https://snappify.com/embed/043ea038-2fc2-4192-b388-99e3b803cd1e/54e017ee-fddf-4e82-8c1f-12a997fdee36?responsive=1&b=1" allow="clipboard-write" allowfullscreen="" loading="lazy" style="background:linear-gradient(337deg,#654EA3FF,#DA98B4FF);position:absolute;left:0;top:0;width:100%" frameborder="0"></iframe></div>
<!--kg-card-end: html-->
<p>This script, instead of evaluating a single prompt-response pair, evaluates an entire dialogue between a user and an assistant.</p><ul><li>The <code>ConversationalTestCase</code> and <code>Turn</code> classes are used to represent the conversation itself. Each message in the dialogue is represented as a <code>Turn</code>, with a <code>role</code> and the corresponding message <code>content</code>.</li><li>After defining the conversation, the script initializes three evaluation metrics.<ul><li><code>TurnRelevancyMetric</code> constructs sliding windows of turns for each turn, before using the LLM to determine whether the last turn in every sliding window has an "assistant" content that is relevant to the previous conversational context found in the sliding window.</li><li><code>KnowledgeRetentionMetric</code>, evaluates whether the assistant correctly remembers and uses information from earlier turns in the conversation.</li><li>Finally <code>safe_advice</code>, a <code>ConversationalGEval</code> metric, which is a modified implementation of standard G-Eval LLM-as-judge evaluation. Using this we can determine whether our LLM chatbot responses are up to standard with our custom <code>criteria</code> throughout the conversation.</li></ul></li><li>All three metrics are configured with thresholds and an evaluation model (<code>openai/gpt-4o-2024-08-06</code>).</li><li>The script then runs the evaluations in a standalone manner by calling <code>.measure(test_case)</code> on each metric.</li></ul><div class="kg-card kg-callout-card kg-callout-card-blue"><div class="kg-callout-emoji">👉</div><div class="kg-callout-text">While we could also use <code spellcheck="false" style="white-space: pre-wrap;">evaluate()</code>, standalone execution is useful for debugging or when integrating results into our own application or pipelines. The trade-off is that we won’t receive benefits like integration with the Confident AI platform, that the <code spellcheck="false" style="white-space: pre-wrap;">evaluate()</code> function provides.</div></div><ul><li>Finally, the results are printed.</li></ul><p>Overall, this example illustrates how with DeepEval we can evaluate multi-turn dialogue quality, checking for relevance, context retention, and domain-specific safety rules.</p><p>Apart from these, DeepEval provides several other multi-turn evaluation metrics. We encourage readers to explore them in the documentation as a self-learning activity.</p><figure class="kg-card kg-bookmark-card"><a class="kg-bookmark-container" href="https://deepeval.com/docs/metrics-introduction?ref=dailydoseofds.com"><div class="kg-bookmark-content"><div class="kg-bookmark-title">Introduction to LLM Metrics | DeepEval by Confident AI - The LLM Evaluation Framework</div><div class="kg-bookmark-description">deepeval offers 50+ SOTA, ready-to-use metrics for you to quickly get started with. Essentially, while a test case represents the thing you’re trying to measur, the metric acts as the ruler based on a specific criteria of interest.</div><div class="kg-bookmark-metadata"><img class="kg-bookmark-icon" src="https://www.dailydoseofds.com/content/images/icon/favicon-38.ico" alt=""><span class="kg-bookmark-author">DeepEval Logo</span></div></div><div class="kg-bookmark-thumbnail"><img src="https://www.dailydoseofds.com/content/images/thumbnail/social_card-3.png" alt="" onerror="this.style.display = 'none'"></div></a></figure><h3 id="conversation-simulation">Conversation simulation</h3>

<aside data-sx-content-cta class="relative p-0.5 js-toc-ignore">
        <div data-sx-content-cta-blur class="absolute w-full start-0 -top-0.5 -translate-y-full bg-linear-to-b to-70% from-black/0 to-background dark:to-background-dark pointer-events-none">
            &nbsp; <br> &nbsp; <br> &nbsp;
        </div>

    <div class="relative flex flex-col items-center text-center w-full px-4 py-10 lg:px-10 text-base rounded-xl sm:rounded-2xl">
        <div class="dark:hidden absolute -inset-px -z-10 rounded-xl bg-white shadow-pretty-sm transition-shadow duration-200"></div>
        <span class="hidden dark:block absolute -z-10 -inset-px rounded-inherit bg-radial-[50%_100%_at_50%_0%] from-white/10"></span>        
        <div class="hidden dark:block absolute -inset-px -z-10 rounded-xl bg-white/1 transition-colors duration-500 inset-shadow-md inset-shadow-white/2"></div>
        
        <div class="hidden dark:block absolute -z-10 -inset-0.5 mask-to-b mask-position-to-70% mask-opacity-to-50%">
            <div class="hidden dark:block absolute -z-10 inset-px rounded-xl border border-gray-50/10"></div>
        </div>                                                
        <div class="hidden dark:block absolute -z-10 -inset-px rounded-xl border border-gray-50/5 opacity-0"></div>

        <div class="hidden dark:block absolute -top-px start-1/2 -translate-x-1/2 w-1/2 h-px bg-linear-to-r from-white/0 to-white/0 via-gray-400/50"></div>

        <h2 class="text-gray-900 dark:text-gray-100 text-pretty text-xl sm:text-2xl leading-snug tracking-tight">




                                                <span class="hidden first:inline">
                                This lesson is for paying subscribers only
                            </span>                           
                    <span class="hidden first:inline">
                        This post is for paying subscribers only
                    </span>  

        </h2>

            <a href="/membership" class="mt-7 px-4 py-2.5 font-semibold text-sm text-center bg-primary rounded-full shadow-primary dark:shadow-none dark:drop-shadow-primary-sm hover:opacity-90 transition-opacity duration-200 inline-block" style="color: white;">
                Unlock Full Access
            </a>
            <p class="mt-5 text-sm">
                Already have an account? <button data-portal="signin" class="block sm:inline mx-auto font-medium underline decoration-primary decoration-2 underline-offset-2 hover:text-primary transition-colors duration-200">Sign in</button>
            </p>
    </div>
</aside>
]]></content:encoded>
                </item>
                <item>
                    <title><![CDATA[Evaluation: Model Benchmarks and LLM Application Assessment]]></title>
                    <description><![CDATA[LLMOps Part 10: Understanding model benchmarks, LLM application evaluation, and tooling.]]></description>
                    <link>https://www.dailydoseofds.com/llmops-crash-course-part-10/</link>
                    <guid isPermaLink="false">69cad8cb10b4a900019d100e</guid>

                        <category><![CDATA[MLOps/LLMOps Course]]></category>
                        <category><![CDATA[LLMOps]]></category>

                        <dc:creator><![CDATA[Avi Chawla]]></dc:creator>

                    <pubDate>Sun, 01 Mar 2026 00:43:00 +0530</pubDate>

                        <media:content url="https://www.dailydoseofds.com/content/images/2026/02/Sanyog-MLOps_-_I-3.png" medium="image"/>

                    <content:encoded><![CDATA[<img src="https://www.dailydoseofds.com/content/images/2026/02/Sanyog-MLOps_-_I-3.png" alt="Evaluation: Model Benchmarks and LLM Application Assessment"/> <h2 id="recap">Recap</h2><p>In Part 9, we began exploring the evaluation space of LLM applications, laying the groundwork, covering challenges and a practical taxonomy of evaluation methods.</p><p>We began by examining why LLM evaluation is different from traditional software or classical ML evaluation. Unlike deterministic systems, LLMs generate open-ended, probabilistic outputs. This introduces subjectivity, non-determinism, multi-dimensional quality criteria, and emergent failure modes such as hallucinations and bias.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/02/image-232.png" class="kg-image" alt="" loading="lazy" width="1000" height="546" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/02/image-232.png 600w, https://www.dailydoseofds.com/content/images/2026/02/image-232.png 1000w" sizes="(min-width: 720px) 720px"></figure><p>From there, we introduced a structured taxonomy of evaluation methods: intrinsic, deterministic, and subjective.</p><p>We first explored intrinsic metrics, such as entropy, cross-entropy, and perplexity. These metrics quantify how well a model approximates the true data distribution. The key takeaway was that intrinsic metrics are measure of language modeling, not task-level usefulness.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.dailydoseofds.com/content/images/2026/02/image-233.png" class="kg-image" alt="" loading="lazy" width="361" height="110"><figcaption><span style="white-space: pre-wrap;">Perplexity</span></figcaption></figure><p>We then moved to deterministic evaluation methods, which apply when ground truth or structural correctness exists. These included functional correctness, exact match and classification metrics, n-gram overlap metrics such as BLEU and ROUGE, embedding-based similarity metrics such as BERTScore, and strict format validation using schema checks.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/02/image-234.png" class="kg-image" alt="" loading="lazy" width="1000" height="491" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/02/image-234.png 600w, https://www.dailydoseofds.com/content/images/2026/02/image-234.png 1000w" sizes="(min-width: 720px) 720px"></figure><p>We emphasized that deterministic metrics are objective and automatable, but often fail to capture nuances of open-ended language generation.</p><p>Next, we examined subjective evaluation methods, which become essential when outputs are open-ended and multi-dimensional. We discussed human evaluation (ratings, rankings, open feedback), along with its cost, scale and variability constraints.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/02/image-235.png" class="kg-image" alt="" loading="lazy" width="499" height="264"></figure><p>We then explored LLM-as-a-judge, where strong models are prompted to score outputs based on explicit rubrics.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/02/image-236.png" class="kg-image" alt="" loading="lazy" width="1000" height="367" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/02/image-236.png 600w, https://www.dailydoseofds.com/content/images/2026/02/image-236.png 1000w" sizes="(min-width: 720px) 720px"></figure><p>We also introduced pairwise comparative evaluation and Elo rating systems, explaining how relative comparisons can produce more stable rankings.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/02/image-237.png" class="kg-image" alt="" loading="lazy" width="680" height="570" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/02/image-237.png 600w, https://www.dailydoseofds.com/content/images/2026/02/image-237.png 680w"></figure><p>Finally, in the hands-on demo, we grounded these ideas in practice using a summarization task. Two candidate summaries were evaluated using BLEU, ROUGE, BERTScore, and an LLM-as-a-judge pipeline. The demo illustrated a crucial insight: no single metric captures the full picture.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.dailydoseofds.com/content/images/2026/02/image-238.png" class="kg-image" alt="" loading="lazy" width="1638" height="491" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/02/image-238.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2026/02/image-238.png 1000w, https://www.dailydoseofds.com/content/images/size/w1600/2026/02/image-238.png 1600w, https://www.dailydoseofds.com/content/images/2026/02/image-238.png 1638w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">LLM Judge: Candidate A</span></figcaption></figure><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.dailydoseofds.com/content/images/2026/02/image-239.png" class="kg-image" alt="" loading="lazy" width="1443" height="493" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/02/image-239.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2026/02/image-239.png 1000w, https://www.dailydoseofds.com/content/images/2026/02/image-239.png 1443w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">LLM Judge: Candidate B</span></figcaption></figure><p>Altogether, the chapter served as a foundational guide to the evaluation of LLM applications.</p><p>If you haven’t yet gone through Part 9, we strongly recommend reviewing it first, as it establishes the conceptual foundation essential for understanding the material we’re about to cover.</p><p>Read it here:</p><figure class="kg-card kg-bookmark-card"><a class="kg-bookmark-container" href="https://www.dailydoseofds.com/llmops-crash-course-part-9/"><div class="kg-bookmark-content"><div class="kg-bookmark-title">Evaluation: Fundamentals</div><div class="kg-bookmark-description">LLMOps Part 9: A foundational guide to the evaluation of LLM applications, covering challenges and a practical taxonomy of evaluation methods.</div><div class="kg-bookmark-metadata"><img class="kg-bookmark-icon" src="https://www.dailydoseofds.com/content/images/icon/dailyds-logo-transparent-26.png" alt=""><span class="kg-bookmark-author">Daily Dose of Data Science</span><span class="kg-bookmark-publisher">Avi Chawla</span></div></div><div class="kg-bookmark-thumbnail"><img src="https://www.dailydoseofds.com/content/images/thumbnail/Sanyog-MLOps_-_I-2-8.png" alt="" onerror="this.style.display = 'none'"></div></a></figure><p>In this chapter, we will briefly discuss about benchmarks and build on the foundations from the last chapter and move toward understanding deeper task-specific methodologies, and tooling for evaluation of LLM applications.</p><p>As always, every notion will be explained through clear examples and walkthroughs to develop a solid understanding.</p><p>Let’s begin!</p><hr><h2 id="evaluation-benchmarks">Evaluation benchmarks</h2><p>In the previous part, we’ve covered how to evaluate a model’s outputs on particular tasks. Now, for a moment, let’s zoom out and discuss how do we assess the general capabilities and quality of an LLM in a broad sense.</p><p>This is important for model selection in LLMOps. Before using a model or building an app on it, you want to know its strengths, weaknesses, and how it compares to alternatives. The field has developed many benchmarks and standardized tests for this purpose.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/02/image-240-1-1-1.png" class="kg-image" alt="" loading="lazy" width="962" height="509" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/02/image-240-1-1-1.png 600w, https://www.dailydoseofds.com/content/images/2026/02/image-240-1-1-1.png 962w" sizes="(min-width: 720px) 720px"></figure><p>We’ll overview some common benchmarks and metrics used, and how to interpret them.</p><h3 id="mmlu">MMLU</h3><p>MMLU (massive multitask language understanding) is a benchmark covering 57 subjects including history, math, science, law, etc., with difficulty from high school to expert level. It’s a set of multiple-choice questions. The metric is accuracy (%) on these questions.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.dailydoseofds.com/content/images/2026/02/image-241.png" class="kg-image" alt="" loading="lazy" width="987" height="715" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/02/image-241.png 600w, https://www.dailydoseofds.com/content/images/2026/02/image-241.png 987w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">MMLU </span><a href="https://arxiv.org/pdf/2009.03300?ref=dailydoseofds.com"><span style="white-space: pre-wrap;">(Hendrycks et al. 2020)</span></a></figcaption></figure><p>It’s for measuring a model’s breadth of knowledge and reasoning. Non-expert human performance on MMLU is around 34%, whereas top models (Gemini 3 Pro, Claude Sonnet, etc.) go beyond 90% on it. If your use case requires broad knowledge, MMLU score is a good indicator. Many models report their MMLU in papers and on leaderboards.</p><p>As models continue to become more capable, many now achieve exceptionally high scores on the original MMLU benchmark. To better differentiate performance at the higher end, we have MMLU-Pro, which introduces a more challenging 10-option multiple-choice format.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.dailydoseofds.com/content/images/2026/02/image-243.png" class="kg-image" alt="" loading="lazy" width="1403" height="836" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/02/image-243.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2026/02/image-243.png 1000w, https://www.dailydoseofds.com/content/images/2026/02/image-243.png 1403w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Source: </span><a href="https://www.kaggle.com/benchmarks/open-benchmarks/mmlu?ref=dailydoseofds.com"><span style="white-space: pre-wrap;">Kaggle Benchmarks, MMLU Leaderboard</span></a></figcaption></figure><p>This increased difficulty leads to more realistic scores, making it a more discriminative benchmark for evaluating current AI models.</p><h3 id="hellaswag">HellaSwag</h3><p>A commonsense reasoning benchmark where each question is a sentence or paragraph with a blank, and the model must choose the best ending from options.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.dailydoseofds.com/content/images/2026/02/image-246.png" class="kg-image" alt="" loading="lazy" width="1566" height="515" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/02/image-246.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2026/02/image-246.png 1000w, https://www.dailydoseofds.com/content/images/2026/02/image-246.png 1566w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">HellaSwag </span><a href="https://arxiv.org/pdf/1905.07830?ref=dailydoseofds.com"><span style="white-space: pre-wrap;">(Zellers et al. 2019)</span></a></figcaption></figure><p>It was designed adversarially such that surface cues are misleading, so models really need deeper understanding. Accuracy on HellaSwag reflects a model’s commonsense reasoning ability. It’s often used in open LLM leaderboards.</p><h3 id="truthfulqa">TruthfulQA</h3><p>A benchmark to test whether a model tells the truth versus repeating common myths or falsehoods. It has questions (many about common misconceptions, false beliefs and tricky facts) and checks if the model’s answer is true. It tests models in both free-form generation and multiple-choice formats.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.dailydoseofds.com/content/images/2026/02/image-247.png" class="kg-image" alt="" loading="lazy" width="843" height="799" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/02/image-247.png 600w, https://www.dailydoseofds.com/content/images/2026/02/image-247.png 843w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">TruthfulQA </span><a href="https://arxiv.org/pdf/2109.07958?ref=dailydoseofds.com"><span style="white-space: pre-wrap;">(Lin et al. 2022)</span></a></figcaption></figure><p>Many models often struggle with TruthfulQA, tending to output plausible but incorrect answers if a misconception is popular (because they mimic training data). For applications where factual accuracy is crucial, a model’s TruthfulQA score can be an important indicator, in evaluating model choice.</p><h3 id="big-bench">BIG-Bench</h3><p>Beyond the Imitation Game benchmark (BIG-Bench) is a collection of over 200 diverse tasks. It includes everything from logical puzzles and mathematics to creative tasks and novel problem types.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.dailydoseofds.com/content/images/2026/02/image-248.png" class="kg-image" alt="" loading="lazy" width="1561" height="665" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/02/image-248.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2026/02/image-248.png 1000w, https://www.dailydoseofds.com/content/images/2026/02/image-248.png 1561w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Beyond the Imitation Game </span><a href="https://arxiv.org/pdf/2206.04615?ref=dailydoseofds.com"><span style="white-space: pre-wrap;">(Srivastava et al. 2023)</span></a></figcaption></figure><p>BIG-Bench was used to test models for emergent abilities, i.e., tasks where performance jumps as models get larger. It’s huge and eclectic; not typically one number but a suite of metrics. A subset called BIG-Bench Hard (BBH) consists of 23 particularly challenging tasks unsolved by smaller models. An even complex one is BIG-Bench Extra Hard (BBEH).</p><p>If a model does well on BBEH tasks, it’s considered quite advanced. BIG-Bench is mostly research-focused, but you might reference it if comparing frontier models’ advanced reasoning capabilities.</p>

<aside data-sx-content-cta class="relative p-0.5 js-toc-ignore">
        <div data-sx-content-cta-blur class="absolute w-full start-0 -top-0.5 -translate-y-full bg-linear-to-b to-70% from-black/0 to-background dark:to-background-dark pointer-events-none">
            &nbsp; <br> &nbsp; <br> &nbsp;
        </div>

    <div class="relative flex flex-col items-center text-center w-full px-4 py-10 lg:px-10 text-base rounded-xl sm:rounded-2xl">
        <div class="dark:hidden absolute -inset-px -z-10 rounded-xl bg-white shadow-pretty-sm transition-shadow duration-200"></div>
        <span class="hidden dark:block absolute -z-10 -inset-px rounded-inherit bg-radial-[50%_100%_at_50%_0%] from-white/10"></span>        
        <div class="hidden dark:block absolute -inset-px -z-10 rounded-xl bg-white/1 transition-colors duration-500 inset-shadow-md inset-shadow-white/2"></div>
        
        <div class="hidden dark:block absolute -z-10 -inset-0.5 mask-to-b mask-position-to-70% mask-opacity-to-50%">
            <div class="hidden dark:block absolute -z-10 inset-px rounded-xl border border-gray-50/10"></div>
        </div>                                                
        <div class="hidden dark:block absolute -z-10 -inset-px rounded-xl border border-gray-50/5 opacity-0"></div>

        <div class="hidden dark:block absolute -top-px start-1/2 -translate-x-1/2 w-1/2 h-px bg-linear-to-r from-white/0 to-white/0 via-gray-400/50"></div>

        <h2 class="text-gray-900 dark:text-gray-100 text-pretty text-xl sm:text-2xl leading-snug tracking-tight">




                                                <span class="hidden first:inline">
                                This lesson is for paying subscribers only
                            </span>                           
                    <span class="hidden first:inline">
                        This post is for paying subscribers only
                    </span>  

        </h2>

            <a href="/membership" class="mt-7 px-4 py-2.5 font-semibold text-sm text-center bg-primary rounded-full shadow-primary dark:shadow-none dark:drop-shadow-primary-sm hover:opacity-90 transition-opacity duration-200 inline-block" style="color: white;">
                Unlock Full Access
            </a>
            <p class="mt-5 text-sm">
                Already have an account? <button data-portal="signin" class="block sm:inline mx-auto font-medium underline decoration-primary decoration-2 underline-offset-2 hover:text-primary transition-colors duration-200">Sign in</button>
            </p>
    </div>
</aside>
]]></content:encoded>
                </item>
                <item>
                    <title><![CDATA[Evaluation: Fundamentals]]></title>
                    <description><![CDATA[LLMOps Part 9: A foundational guide to the evaluation of LLM applications, covering challenges and a practical taxonomy of evaluation methods.]]></description>
                    <link>https://www.dailydoseofds.com/llmops-crash-course-part-9/</link>
                    <guid isPermaLink="false">69cad8cb10b4a900019d100d</guid>

                        <category><![CDATA[MLOps/LLMOps Course]]></category>
                        <category><![CDATA[LLMOps]]></category>

                        <dc:creator><![CDATA[Avi Chawla]]></dc:creator>

                    <pubDate>Sun, 22 Feb 2026 01:07:00 +0530</pubDate>

                        <media:content url="https://www.dailydoseofds.com/content/images/2026/02/Sanyog-MLOps_-_I-2.png" medium="image"/>

                    <content:encoded><![CDATA[<img src="https://www.dailydoseofds.com/content/images/2026/02/Sanyog-MLOps_-_I-2.png" alt="Evaluation: Fundamentals"/> <h2 id="recap">Recap</h2><p>In the previous chapter (LLMOps Part 8), we examined memory context and temporal awareness as critical components of context engineering within LLM systems.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/02/image-154.png" class="kg-image" alt="" loading="lazy" width="1000" height="896" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/02/image-154.png 600w, https://www.dailydoseofds.com/content/images/2026/02/image-154.png 1000w" sizes="(min-width: 720px) 720px"></figure><p>We began by drawing a clear distinction between short-term and long-term memory.</p><p>Short-term memory was defined as immediate context within the session or more precisely, what's part of the "active" prompt. It is ephemeral, bounded, and directly readable by the model. We discussed practical strategies such as last-N turns and rolling summaries pattern that preserves coherence while keeping token growth under control.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/02/image-155.png" class="kg-image" alt="" loading="lazy" width="422" height="285"></figure><p>We then defined long-term memory as information retained across sessions or that persists even as short-term context moves on. It is implemented through external storage such as vector databases.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/02/image-156.png" class="kg-image" alt="" loading="lazy" width="887" height="352" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/02/image-156.png 600w, https://www.dailydoseofds.com/content/images/2026/02/image-156.png 887w" sizes="(min-width: 720px) 720px"></figure><p>Here, we emphasized a critical architectural principle: persistent information is not automatically usable. It must be retrieved, filtered, and injected into working context before the model can reason over it. We also explored storage strategies, retrieval frequency, caching, pruning and cost considerations.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/02/image-157.png" class="kg-image" alt="" loading="lazy" width="914" height="437" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/02/image-157.png 600w, https://www.dailydoseofds.com/content/images/2026/02/image-157.png 914w" sizes="(min-width: 720px) 720px"></figure><p>From there, we shifted to dynamic and temporal context injection. Unlike stored memory, dynamic context is not pre-authored or stored, but assembled or updated in real-time based on the current state.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/02/image-158.png" class="kg-image" alt="" loading="lazy" width="1000" height="406" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/02/image-158.png 600w, https://www.dailydoseofds.com/content/images/2026/02/image-158.png 1000w" sizes="(min-width: 720px) 720px"></figure><p>We showed how these injections move systems beyond static Q&amp;A into adaptive, state-aware behavior. Here, temporal grounding for memory emerged as a central theme: without timestamps and recency-aware ranking, memory becomes unreliable. Time metadata transforms memory from an archive into a living representation of evolving user state.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/02/image-159.png" class="kg-image" alt="" loading="lazy" width="1000" height="392" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/02/image-159.png 600w, https://www.dailydoseofds.com/content/images/2026/02/image-159.png 1000w" sizes="(min-width: 720px) 720px"></figure><p>Next, we implemented a basic hands-on memory-enabled assistant. The system combined short-term memory (turns + summary), long-term semantic memory (ChromaDB + embeddings) with timestamps, dynamic time injection, query rewriting for retrieval, memory relevance gating, and audit logging. The goal was to expose the mechanics of storing, retrieving, summarizing, injecting, and governing memory in a transparent and inspectable way. By walking through each component, we built intuition about how memory systems behave under the hood.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/02/image-160.png" class="kg-image" alt="" loading="lazy" width="730" height="231" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/02/image-160.png 600w, https://www.dailydoseofds.com/content/images/2026/02/image-160.png 730w" sizes="(min-width: 720px) 720px"></figure><p>Finally, we addressed context failure modes that frequently arise when context is poorly engineered: context poisoning, distraction, clash, token wastage, and latency issues.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/02/image-161.png" class="kg-image" alt="" loading="lazy" width="1000" height="546" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/02/image-161.png 600w, https://www.dailydoseofds.com/content/images/2026/02/image-161.png 1000w" sizes="(min-width: 720px) 720px"></figure><p>Altogether, the chapter reframed memory and dynamic context as engineered subsystems within LLM applications. Together, they extend the effective intelligence of the model and transform a stateless predictor into a stateful, adaptive and aware system.</p><p>If you haven’t yet gone through Part 8, we strongly recommend reviewing it first, as it helps maintain the natural learning flow of the series.</p><p>Read it here:</p><figure class="kg-card kg-bookmark-card"><a class="kg-bookmark-container" href="https://www.dailydoseofds.com/llmops-crash-course-part-8/"><div class="kg-bookmark-content"><div class="kg-bookmark-title">Context Engineering: Memory and Temporal Context</div><div class="kg-bookmark-description">LLMOps Part 8: A concise overview of memory, dynamic and temporal context in LLM systems, covering short and long-term memory, dynamic context injection, and some of the common context failure modes in agentic applications.</div><div class="kg-bookmark-metadata"><img class="kg-bookmark-icon" src="https://www.dailydoseofds.com/content/images/icon/dailyds-logo-transparent-25.png" alt=""><span class="kg-bookmark-author">Daily Dose of Data Science</span><span class="kg-bookmark-publisher">Avi Chawla</span></div></div><div class="kg-bookmark-thumbnail"><img src="https://www.dailydoseofds.com/content/images/thumbnail/Sanyog-MLOps_-_I-1-13.png" alt="" onerror="this.style.display = 'none'"></div></a></figure><p>In this and the next few chapters, we will be exploring evaluation methods and approaches for LLM-based applications. This chapter, in particular, focuses on building a strong understanding of the fundamental concepts.</p><p>As always, every notion will be explained through clear examples and walkthroughs to develop a solid understanding.</p><p>Let’s begin!</p><hr><h2 id="unique-challenges-in-llm-evaluation">Unique challenges in LLM evaluation</h2><p>Evaluating LLMs is fundamentally different, and often harder, than evaluating traditional software or even classical ML models. Unlike a deterministic function or classifier, an LLM generates open-ended, probabilistic outputs in natural language.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/02/image-162-1-1.png" class="kg-image" alt="" loading="lazy" width="1024" height="559" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/02/image-162-1-1.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2026/02/image-162-1-1.png 1000w, https://www.dailydoseofds.com/content/images/2026/02/image-162-1-1.png 1024w" sizes="(min-width: 720px) 720px"></figure><p>This introduces several challenges:</p><ul><li>Subjective and non-deterministic outputs: Many LLM tasks (creative writing, dialogue, etc.) don’t have a single objectively “correct” answer. An LLM’s response quality often lies on a spectrum, making strict pass/fail judgments difficult. Two answers can both be valid, or one may be “better” in style or clarity, a subjective call that’s hard to reduce to a simple metric. Moreover, identical prompts can yield different outputs on different runs due to randomness, so evaluation must account for variability.</li></ul><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/02/image-163-1.png" class="kg-image" alt="" loading="lazy" width="1024" height="182" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/02/image-163-1.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2026/02/image-163-1.png 1000w, https://www.dailydoseofds.com/content/images/2026/02/image-163-1.png 1024w" sizes="(min-width: 720px) 720px"></figure><ul><li>Lack of ground truth: For tasks like open-ended Q&amp;A, or chatbot behavior, we often lack a perfect ground truth. Often human references are also just one of many possible good outputs. This complicates reference-based evaluation, measuring model output just against a fixed reference might undervalue valid answers that differ in wording or approach.</li></ul><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/02/image-166-1.png" class="kg-image" alt="" loading="lazy" width="787" height="173" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/02/image-166-1.png 600w, https://www.dailydoseofds.com/content/images/2026/02/image-166-1.png 787w" sizes="(min-width: 720px) 720px"></figure><ul><li>Multifaceted quality criteria: The quality of an LLM output is multi-dimensional. It may need to be factually correct, relevant, coherent, concise, safe, stylistically appropriate, etc. No single scalar metric captures all these aspects. Hence evaluating “How good is this response?” often requires combining multiple criteria.</li></ul><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/02/image-168-1-1.png" class="kg-image" alt="" loading="lazy" width="1499" height="831" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/02/image-168-1-1.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2026/02/image-168-1-1.png 1000w, https://www.dailydoseofds.com/content/images/2026/02/image-168-1-1.png 1499w" sizes="(min-width: 720px) 720px"></figure><ul><li>Scale and automation needs: Human evaluation (e.g. having people read and score outputs) is considered the gold standard for subjective tasks, but it is slow, expensive, and not scalable for the rapid iteration cycles. LLM-driven applications might produce thousands of responses, or have new model versions daily. We can’t feasibly have humans check everything. We need automated or AI-assisted evaluation methods that approximate human judgment.</li></ul><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/02/image-169-1.png" class="kg-image" alt="" loading="lazy" width="1024" height="559" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/02/image-169-1.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2026/02/image-169-1.png 1000w, https://www.dailydoseofds.com/content/images/2026/02/image-169-1.png 1024w" sizes="(min-width: 720px) 720px"></figure><ul><li>Emergent behaviors and failure modes: LLMs can fail in unexpected ways, for example, hallucinating plausible-sounding but false information, making biased or toxic remarks, leaking system prompts, or failing on adversarial inputs. Evaluating these requires creative testing beyond standard accuracy metrics. We often must proactively design evals targeting potential failure modes to catch issues.</li></ul><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/02/image-170-1-1.png" class="kg-image" alt="" loading="lazy" width="731" height="171" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/02/image-170-1-1.png 600w, https://www.dailydoseofds.com/content/images/2026/02/image-170-1-1.png 731w" sizes="(min-width: 720px) 720px"></figure><p>Hence, we can say, LLM evaluation requires a mix of exact, objective metrics for tasks where ground truth or certain fixed structure exists, and human-like judgment for open-ended tasks. We need to handle subjectivity, ensure broad coverage of quality aspects, and perform effective evaluation with clear judgment calls and contextual understanding.</p><p>Now, in the next section, we’ll discuss the types of evaluation methods that have emerged to meet these challenges.</p><hr><h2 id="taxonomy-of-evaluation-methods">Taxonomy of evaluation methods</h2><p>Broadly, evaluation can be divided into three main categories:</p><ul><li>Intrinsic evaluation</li><li>Deterministic evaluation</li><li>Subjective evaluation</li></ul><p>Let’s break down each category and the key techniques therein:</p><h3 id="intrinsic-evaluation">Intrinsic evaluation</h3><p>Before we discuss evaluation for task-specific performance, it is critical that we also have a fair idea about the intrinsic metrics utilized during the pre-training and fine-tuning phases.</p><p>The capabilities of an LLM model are inextricably tied to its language modeling efficiency, which measures the mathematical divergence between the learned probability distribution and the true distribution of the training data.</p><h4 id="entropy">Entropy</h4><p>Entropy measures how much information, on average, a token carries. Higher the entropy, the more information a token carries, and hence more bits are needed to represent a token.</p><p>Let's illustrate this via a very intuitive example by Chip Huyen.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/02/image-149-1.png" class="kg-image" alt="" loading="lazy" width="1536" height="718" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/02/image-149-1.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2026/02/image-149-1.png 1000w, https://www.dailydoseofds.com/content/images/2026/02/image-149-1.png 1536w" sizes="(min-width: 720px) 720px"></figure><p>Figure (a): Consider you want to create a language to describe locations in a square. If your language has only two tokens, each token can tell you whether the location is up or down. Now, since there are only two tokens, one bit is sufficient to represent them, making the entropy of this language 1.</p><p>Figure (b): Now suppose, if your language has four tokens, each token can give you a more specific location: top-left, top-right, bottom-left, or bottom-right. However, since there are now four tokens, you need two bits to represent them, making the entropy of this language 2.</p><div class="kg-card kg-callout-card kg-callout-card-blue"><div class="kg-callout-emoji">👉</div><div class="kg-callout-text">The second language has higher entropy, since each token carries more information and each token requires more bits to represent.</div></div><p>Hence we can say, entropy calculates how difficult it is to predict what comes next in a language. The lower a language’s entropy, i.e., the lesser the information a token carries, the more predictable that language.</p><p>In the example discussed above, the language with only two tokens is easier to predict since you have to predict from only two possible tokens compared to four in the case of the second language.</p><div class="kg-card kg-callout-card kg-callout-card-blue"><div class="kg-callout-emoji">👉</div><div class="kg-callout-text">Analogy: This is equivalent to saying, if you can perfectly predict what someone will say next, then what is said carries no new information.</div></div><p>Hence we can conclude, entropy is highest when outcomes are equally likely and lowest when outcomes are deterministic. Thus, in language modeling:</p><ul><li>A highly structured domain (HTML, SQL, JSON) has lower entropy.</li><li>Casual conversation or creative texts have higher entropy.</li></ul><div class="kg-card kg-callout-card kg-callout-card-blue"><div class="kg-callout-emoji">👉</div><div class="kg-callout-text">Entropy is therefore a property of the data distribution itself, not the model. It measures intrinsic unpredictability.</div></div><h4 id="cross-entropy">Cross entropy</h4><p>When we train a language model on some data, the goal is to get the model to learn the distribution of the training data. A language model’s cross entropy on a dataset measures how difficult it is for the language model to predict what comes next in this data.</p><p>So if $P$ is the true distribution of the training data and $Q$ is the distribution learned by the language model, the cross-entropy $H(P,Q)$ is expressed mathematically as:</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/02/image-151.png" class="kg-image" alt="" loading="lazy" width="642" height="134" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/02/image-151.png 600w, https://www.dailydoseofds.com/content/images/2026/02/image-151.png 642w"></figure><p>where, $H(P)$ is entropy of training data and $D_{KL}$ represents the Kullback-Leibler divergence, measuring how the learned distribution diverges from the true distribution.</p><hr><p><strong>Quick note for those who might not know:</strong></p><p>KL divergence&nbsp;is calculated as follows:</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2023/08/image-531.png" class="kg-image" alt="" loading="lazy" width="483" height="93"></figure><p>It measures the expected extra information required to encode samples drawn from the true distribution $P$ when using an approximating distribution $Q$. In simple terms, it quantifies how inefficient $Q$ is at representing $P$.</p><p>Imagine this. Say&nbsp;$P$&nbsp;and&nbsp;$Q$&nbsp;were identical. This should result in zero inefficiency:</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2025/09/image-72.png" class="kg-image" alt="" loading="lazy" width="220" height="147"></figure><p>And, hence:</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/02/image-73-1.png" class="kg-image" alt="" loading="lazy" width="483" height="315"></figure><div class="kg-card kg-callout-card kg-callout-card-blue"><div class="kg-callout-emoji">👉</div><div class="kg-callout-text">Note: The larger the KL divergence, the greater the inefficiency of using $Q$ to represent $P$.</div></div><hr><p>Implication: A language model cannot achieve a cross-entropy lower than the inherent entropy of the dataset ($H(P)$). This is the theoretical lower bound of performance.</p><div class="kg-card kg-callout-card kg-callout-card-blue"><div class="kg-callout-emoji">👉</div><div class="kg-callout-text">Cross entropy isn’t symmetric. The cross entropy of $Q$ with respect to $P$: $H(P, Q)$ is different from the cross entropy of $P$ with respect to $Q$: $H(Q, P)$. Similarly, KL divergence is also not symmetric.</div></div><p>A language model is trained to minimize its cross entropy with respect to the training data.</p><div class="kg-card kg-callout-card kg-callout-card-blue"><div class="kg-callout-emoji">👉</div><div class="kg-callout-text">If the language model learns perfectly from its training data, the model’s cross entropy will be exactly the same as the entropy of the training data. The KL divergence of Q with respect to P will then be 0.</div></div><h4 id="perplexity">Perplexity</h4><p>Perplexity (PPL) is the exponentiated cross-entropy, representing the model's absolute uncertainty when predicting the next token. If measured in bits (base 2):</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/02/image-152.png" class="kg-image" alt="" loading="lazy" width="361" height="110"></figure><p>Popular ML frameworks (TensorFlow, PyTorch, etc.), use natural log, making perplexity the exponential of $e$:</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/02/image-153.png" class="kg-image" alt="" loading="lazy" width="373" height="144"></figure><p>A lower perplexity value indicates that the model assigns a higher probability to the true token sequence, or we can also say, the more the uncertainty the model has in predicting what comes next in a given dataset, the higher the perplexity.</p><p>With this we have understood about the intrinsic metrics, now let's go ahead and take a look at deterministic evaluation methods.</p><h3 id="deterministic-evaluation-methods">Deterministic evaluation methods</h3>

<aside data-sx-content-cta class="relative p-0.5 js-toc-ignore">
        <div data-sx-content-cta-blur class="absolute w-full start-0 -top-0.5 -translate-y-full bg-linear-to-b to-70% from-black/0 to-background dark:to-background-dark pointer-events-none">
            &nbsp; <br> &nbsp; <br> &nbsp;
        </div>

    <div class="relative flex flex-col items-center text-center w-full px-4 py-10 lg:px-10 text-base rounded-xl sm:rounded-2xl">
        <div class="dark:hidden absolute -inset-px -z-10 rounded-xl bg-white shadow-pretty-sm transition-shadow duration-200"></div>
        <span class="hidden dark:block absolute -z-10 -inset-px rounded-inherit bg-radial-[50%_100%_at_50%_0%] from-white/10"></span>        
        <div class="hidden dark:block absolute -inset-px -z-10 rounded-xl bg-white/1 transition-colors duration-500 inset-shadow-md inset-shadow-white/2"></div>
        
        <div class="hidden dark:block absolute -z-10 -inset-0.5 mask-to-b mask-position-to-70% mask-opacity-to-50%">
            <div class="hidden dark:block absolute -z-10 inset-px rounded-xl border border-gray-50/10"></div>
        </div>                                                
        <div class="hidden dark:block absolute -z-10 -inset-px rounded-xl border border-gray-50/5 opacity-0"></div>

        <div class="hidden dark:block absolute -top-px start-1/2 -translate-x-1/2 w-1/2 h-px bg-linear-to-r from-white/0 to-white/0 via-gray-400/50"></div>

        <h2 class="text-gray-900 dark:text-gray-100 text-pretty text-xl sm:text-2xl leading-snug tracking-tight">




                                                <span class="hidden first:inline">
                                This lesson is for paying subscribers only
                            </span>                           
                    <span class="hidden first:inline">
                        This post is for paying subscribers only
                    </span>  

        </h2>

            <a href="/membership" class="mt-7 px-4 py-2.5 font-semibold text-sm text-center bg-primary rounded-full shadow-primary dark:shadow-none dark:drop-shadow-primary-sm hover:opacity-90 transition-opacity duration-200 inline-block" style="color: white;">
                Unlock Full Access
            </a>
            <p class="mt-5 text-sm">
                Already have an account? <button data-portal="signin" class="block sm:inline mx-auto font-medium underline decoration-primary decoration-2 underline-offset-2 hover:text-primary transition-colors duration-200">Sign in</button>
            </p>
    </div>
</aside>
]]></content:encoded>
                </item>
                <item>
                    <title><![CDATA[Context Engineering: Memory and Temporal Context]]></title>
                    <description><![CDATA[LLMOps Part 8: A concise overview of memory, dynamic and temporal context in LLM systems, covering short and long-term memory, dynamic context injection, and some of the common context failure modes in agentic applications.]]></description>
                    <link>https://www.dailydoseofds.com/llmops-crash-course-part-8/</link>
                    <guid isPermaLink="false">69cad8cb10b4a900019d1009</guid>

                        <category><![CDATA[MLOps/LLMOps Course]]></category>
                        <category><![CDATA[LLMOps]]></category>

                        <dc:creator><![CDATA[Avi Chawla]]></dc:creator>

                    <pubDate>Sun, 15 Feb 2026 04:24:00 +0530</pubDate>

                        <media:content url="https://www.dailydoseofds.com/content/images/2026/02/Sanyog-MLOps_-_I-1.png" medium="image"/>

                    <content:encoded><![CDATA[<img src="https://www.dailydoseofds.com/content/images/2026/02/Sanyog-MLOps_-_I-1.png" alt="Context Engineering: Memory and Temporal Context"/> <h2 id="recap">Recap</h2><p>In the previous chapter (part 7), we explored the discipline of context engineering: the practice of designing the information environment in which an LLM application operates.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/02/image-96.png" class="kg-image" alt="" loading="lazy" width="1000" height="896" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/02/image-96.png 600w, https://www.dailydoseofds.com/content/images/2026/02/image-96.png 1000w" sizes="(min-width: 720px) 720px"></figure><p>We began by reframing the context window as a form of working memory, and showed why the core job of context engineering is maximizing signal under strict capacity constraints.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/02/image-97.png" class="kg-image" alt="" loading="lazy" width="860" height="624" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/02/image-97.png 600w, https://www.dailydoseofds.com/content/images/2026/02/image-97.png 860w" sizes="(min-width: 720px) 720px"></figure><p>From there, we built a practical taxonomy of context types that show up in real systems: instruction context, query context, knowledge context (RAG), memory context, tool context, user-specific context, environmental and temporal context.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/02/image-98.png" class="kg-image" alt="" loading="lazy" width="874" height="366" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/02/image-98.png 600w, https://www.dailydoseofds.com/content/images/2026/02/image-98.png 874w" sizes="(min-width: 720px) 720px"></figure><p>Next, we moved from "what context is" to "how context is constructed". We introduced modular, conditional context construction.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/02/image-100.png" class="kg-image" alt="" loading="lazy" width="898" height="440" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/02/image-100.png 600w, https://www.dailydoseofds.com/content/images/2026/02/image-100.png 898w" sizes="(min-width: 720px) 720px"></figure><p>We explored chunking as a core design decision and walked through modern chunking strategies. We also clarified how chunking behaves differently for summarization-style workloads, where coverage replaces relevance as the optimization goal.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/02/image-101.png" class="kg-image" alt="" loading="lazy" width="651" height="539" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/02/image-101.png 600w, https://www.dailydoseofds.com/content/images/2026/02/image-101.png 651w"></figure><p>We then covered the standard retrieval stack: vector search as a candidate generator, the need to tune precision vs. recall, and improvements like contextual retrieval.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/02/image-102.png" class="kg-image" alt="" loading="lazy" width="733" height="514" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/02/image-102.png 600w, https://www.dailydoseofds.com/content/images/2026/02/image-102.png 733w" sizes="(min-width: 720px) 720px"></figure><p>After that, we introduced summarization and context compression as context management tools, and discussed query-guided summarization, filtering, deduplication, structured compaction, and prompt compression techniques like LLMLingua.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/02/image-103.png" class="kg-image" alt="" loading="lazy" width="964" height="80" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/02/image-103.png 600w, https://www.dailydoseofds.com/content/images/2026/02/image-103.png 964w" sizes="(min-width: 720px) 720px"></figure><p>Moving ahead, we examined re-ranking and selection, especially the common bi-encoder + cross-encoder pattern that retrieves broadly and then tightens precision with a slower but more accurate scorer.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/02/image-105.png" class="kg-image" alt="" loading="lazy" width="1000" height="355" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/02/image-105.png 600w, https://www.dailydoseofds.com/content/images/2026/02/image-105.png 1000w" sizes="(min-width: 720px) 720px"></figure><p>Next, we also explored governance and assembly. The core takeaway was that context construction is not a single step, it is an engineered pipeline that balances relevance, coverage, and efficiency.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/02/image-106.png" class="kg-image" alt="" loading="lazy" width="1000" height="522" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/02/image-106.png 600w, https://www.dailydoseofds.com/content/images/2026/02/image-106.png 1000w" sizes="(min-width: 720px) 720px"></figure><p>Finally, we ended with three hands-on demos:</p><ul><li>Semantic and AST-based chunking</li><li>Bi-encoder retrieval with cross-encoder re-ranking</li><li>Prompt compression with LLMLingua.</li></ul><p>By the end of the chapter, we had moved from thinking of context as “extra text in the prompt” to seeing it as a structured, retrieval-driven, and governable system component, one that often determines whether an LLM application feels intelligent and dependable, or noisy and brittle.</p><p>If you haven’t yet gone through Part 7, we strongly recommend reviewing it first, as it helps maintain the natural learning flow of the series.</p><p>Read it here:</p><figure class="kg-card kg-bookmark-card"><a class="kg-bookmark-container" href="https://www.dailydoseofds.com/llmops-crash-course-part-7/"><div class="kg-bookmark-content"><div class="kg-bookmark-title">Context Engineering: An Introduction to the Information Environment for LLMs</div><div class="kg-bookmark-description">LLMOps Part 7: A conceptual overview of context engineering, covering context types, context construction principles, and retrieval-centric techniques for building high-signal inputs.</div><div class="kg-bookmark-metadata"><img class="kg-bookmark-icon" src="https://www.dailydoseofds.com/content/images/icon/dailyds-logo-transparent-21.png" alt=""><span class="kg-bookmark-author">Daily Dose of Data Science</span><span class="kg-bookmark-publisher">Avi Chawla</span></div></div><div class="kg-bookmark-thumbnail"><img src="https://www.dailydoseofds.com/content/images/thumbnail/Sanyog-MLOps_-_I-13.png" alt="" onerror="this.style.display = 'none'"></div></a></figure><p>In this chapter of discussing context engineering we will explore selected aspects of memory in LLM applications and dynamic and temporal context.</p><p>As always, every notion will be explained through clear examples and walkthroughs to develop a solid understanding.</p><p>Let’s begin!</p><hr><h2 id="memory-systems">Memory systems</h2><p>Humans have short-term and long-term memory; analogously, AI systems with LLMs can benefit from a notion of short-term context and long-term memory.</p><h3 id="short-term-memory">Short-term memory</h3><p>By short-term we usually refer to the immediate context within the session or more precisely, what's part of the "active" prompt. This is basically the conversation history or any information stored directly in the prompt for the current inference.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/02/image-109.png" class="kg-image" alt="" loading="lazy" width="422" height="285"></figure><p>It’s ephemeral by nature. Short-term memory covers the immediate dialogue (last few messages) and maybe a brief summary of earlier ones. It’s fast to access (the model reads it directly) but limited in capacity.</p><p>The design question is often how much of the recent conversation to include verbatim. A common approach is to do something like: include the last N turns. If over limit, start trimming the oldest. This works to keep coherence for recent references, but if the user goes “N messages ago we said XYZ”, the model might have already dropped that from short-term memory.</p><p>Here, as mentioned earlier, systems typically combine recent verbatim dialogue with a rolling summary of older context. But when even summaries start getting too generic and/or trimming is unavoidable, long-term memory is used to reintroduce relevant information (memories).</p><h3 id="long-term-memory">Long-term memory</h3><p>This refers to information retained across sessions or that persists even as short-term context moves on. Typically implemented via external storage: a vector database (or a vector store, depending on use case) for semantic memories, a relational DB for structured data, etc.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/02/image-110.png" class="kg-image" alt="" loading="lazy" width="887" height="352" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/02/image-110.png 600w, https://www.dailydoseofds.com/content/images/2026/02/image-110.png 887w" sizes="(min-width: 720px) 720px"></figure><p>Long-term memory allows the system to recall things that were said or learned earlier. For example, a support chatbot could “remember” a customer’s issue from yesterday when they come back today by looking it up in a user memory store.</p><div class="kg-card kg-callout-card kg-callout-card-blue"><div class="kg-callout-emoji">👉</div><div class="kg-callout-text">Analogy: Short-term memory in LLM systems is like RAM. It holds the active working context as part of the context window. It is fast, limited in size, and non-persistent. Long-term memory is like disk storage. It persists, however, the model cannot access it directly. To use long-term memory, relevant information must first be retrieved and injected into the active context.<br>In other words: persistent memory must be loaded into working memory before the model can reason over it.</div></div><h4 id="approach">Approach</h4><p>Whenever something potentially important comes up in conversation (like the user shares a piece of information that could be useful), you embed that sentence and store it in a vector DB with metadata.</p><div class="kg-card kg-callout-card kg-callout-card-blue"><div class="kg-callout-emoji">👉</div><div class="kg-callout-text">However, even complete logs could also be stored instead of just storing potential information bits, or even conversation summaries and hybrid approaches could be utilized. This piece of design can vary greatly and primarily depends on requirements.</div></div><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/02/image-116-1-1.png" class="kg-image" alt="" loading="lazy" width="737" height="460" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/02/image-116-1-1.png 600w, https://www.dailydoseofds.com/content/images/2026/02/image-116-1-1.png 737w" sizes="(min-width: 720px) 720px"></figure><p>Later, you query the vector DB with the new user query to see if any stored information is relevant.</p><div class="kg-card kg-callout-card kg-callout-card-blue"><div class="kg-callout-emoji">👉</div><div class="kg-callout-text">From a perspective, long-term memory can also include static knowledge. In that sense, the retrieval of company docs. is long-term memory of corporate knowledge. But typically, by long-term memory we mean memory specific to the ongoing agent or user events, as opposed to general knowledge (which falls under knowledge context).</div></div><h4 id="what-to-store-in-long-term-memory">What to store in long-term memory</h4><p>A core design decision is what actually gets stored. One option is to persist full conversation logs verbatim. This is the safest choice from a compliance, audit, or debugging perspective, because nothing is lost. The downside is obvious: raw logs grow quickly and are expensive to retrieve and inject into context.</p><p>A common alternative is to store summaries instead. After a conversation (or after a logical phase), the LLM generates a concise semantic summary capturing information, decisions, facts, and intent. This drastically compresses memory and keeps retrieval lightweight. The trade-off is information loss: details that seem unimportant today may become relevant later especially for compliance-sensitive use cases.</p><p>Hence, in practice, many robust (and compliance-sensitive) systems do both: full logs + summaries.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/02/image-113.png" class="kg-image" alt="" loading="lazy" width="1044" height="469" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/02/image-113.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2026/02/image-113.png 1000w, https://www.dailydoseofds.com/content/images/2026/02/image-113.png 1044w" sizes="(min-width: 720px) 720px"></figure><p>Full logs are archived in cheaper storage for safety, traceability and compliance, while summaries are what actually get retrieved and injected into prompts. This gives you compression for runtime use without permanently discarding information.</p><h4 id="when-to-retrieve-long-term-memory">When to retrieve long-term memory</h4><p>The simplest strategy is to retrieve long-term memory on every user query, using the current query (or conversation state) as the retrieval key. This is common because vector search is relatively cheap, and it avoids edge cases.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/02/image-118.png" class="kg-image" alt="" loading="lazy" width="914" height="437" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/02/image-118.png 600w, https://www.dailydoseofds.com/content/images/2026/02/image-118.png 914w" sizes="(min-width: 720px) 720px"></figure><p>That said, if the conversation is clearly staying within a single topic for a while, you could retrieve memory only once at the start of that topic and reuse it across turns. This slightly reduces overhead but adds logic and state management complexity.</p><div class="kg-card kg-callout-card kg-callout-card-blue"><div class="kg-callout-emoji">👉</div><div class="kg-callout-text">Hence, most production systems choose the conservative route: retrieve every time. The cost is usually small, and the benefit is robustness, especially in long-lived assistants where topic shifts are subtle and unpredictable.</div></div><h4 id="caching-retrieved-memory">Caching retrieved memory</h4><p>Another robust and widely used pattern is to introduce a memory cache that sits between the model and long-term storage. Instead of retrieving from long-term memory on every query, the system first checks this cache.</p><p>When relevant memories are fetched from long-term storage, they may be added to the cache directly or based on some threshold. For subsequent user queries, the system consults the cache first. If the cached memory is still relevant to the current conversation state, it can be reused directly without performing another long-term retrieval.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/02/image-119.png" class="kg-image" alt="" loading="lazy" width="873" height="384" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/02/image-119.png 600w, https://www.dailydoseofds.com/content/images/2026/02/image-119.png 873w" sizes="(min-width: 720px) 720px"></figure><p>This has a few important benefits. It reduces repeated vector searches for the same information, overall lowers latency, and simplifies prompt construction during continuous conversations on the same topic. It also creates a natural separation between "active" context (what’s currently in use) and "archived" context (everything stored long term).</p><p>If the cache lookup fails or confidence in relevance drops, the system can always fall back to long-term memory retrieval and refresh the cache accordingly. In practice, this pattern gives you most of the robustness of per-turn retrieval while being more efficient and easier to reason about during multi-turn interactions.</p><h4 id="memory-pruning-and-cleanup">Memory pruning and cleanup</h4><p>Over time, long-term memory can accumulate noise: duplicated facts, outdated preferences, temporary context that no longer matters, or even incorrect entries. If left unchecked, this can pollute retrieval and degrade response quality.</p><p>To prevent this, memory systems should support pruning and maintenance, such as:</p><ul><li>Deduplicating near-identical entries</li><li>Clustering and merging semantically similar memories</li><li>Decaying or removing memories that are rarely retrieved</li><li>Marking entries as outdated based on time or newer conflicting information</li></ul><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/02/image-120.png" class="kg-image" alt="" loading="lazy" width="795" height="365" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/02/image-120.png 600w, https://www.dailydoseofds.com/content/images/2026/02/image-120.png 795w" sizes="(min-width: 720px) 720px"></figure><div class="kg-card kg-callout-card kg-callout-card-blue"><div class="kg-callout-emoji">👉</div><div class="kg-callout-text">This can be achieved by supporting manual deletion or defining a size-, time-, and/or relevance-based thresholds beyond which the cleanup process must kick-in.</div></div><h4 id="cost-considerations">Cost considerations</h4><p>Long-term memory adds cost in two places: storing embeddings and performing vector searches. However, these costs are usually modest compared to calls we make to LLMs, compounding token costs (especially in a fully <a href="https://www.dailydoseofds.com/ai-agents-crash-course-part-16-with-implementation/#sequential-memory">sequential</a> setup and multi-turn conversations).</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/02/image-122.png" class="kg-image" alt="" loading="lazy" width="817" height="543" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/02/image-122.png 600w, https://www.dailydoseofds.com/content/images/2026/02/image-122.png 817w" sizes="(min-width: 720px) 720px"></figure><p>In fact, effective memory retrieval can reduce overall cost. By injecting the right context early, the model is less likely to ask follow-up clarification, hallucinate, or drift off topic; each of which would otherwise require extra tokens and extra model calls.</p><p>The real cost is often complexity, not money. Memory systems require careful design, evaluation, and maintenance to ensure they help in the manner intended.</p><hr><p>In summary, a well-designed memory system (short-term + long-term) extends the effective context of the LLM beyond a fixed window, enabling continuity and accumulation of knowledge over time. It’s a key part of context engineering for any stateful AI interaction.</p><div class="kg-card kg-callout-card kg-callout-card-blue"><div class="kg-callout-emoji">👉</div><div class="kg-callout-text">Note: Based on our discussion so far and in the previous chapter, you might recall that the primary focus earlier was retrieval-centric. What becomes clear here is that memory (at least the majority of it) is also retrieval-centric in nature. We may use different retrieval techniques (ANN, metadata filtering, hybrid methods, etc.), but at the end of the day, it is still retrieval.</div></div><p>Next, let’s discuss injecting dynamic and temporal context.</p><hr><h2 id="dynamic-and-temporal-context-injection">Dynamic and temporal context injection</h2><p>So far, we talked about fetching context in a fairly static way (stored memory). But many applications require injecting dynamic context, information that changes with time or is generated on the fly, into the model’s prompt.</p><p>Dynamic Context means the context is not pre-authored or stored, but assembled or updated in real-time based on the current state.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/02/image-124-1.png" class="kg-image" alt="" loading="lazy" width="1024" height="416" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/02/image-124-1.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2026/02/image-124-1.png 1000w, https://www.dailydoseofds.com/content/images/2026/02/image-124-1.png 1024w" sizes="(min-width: 720px) 720px"></figure><p>Examples:</p><ul><li>The current date/time: Many prompts explicitly include date to help the model avoid confusion with its training cutoff and handle temporal questions.</li><li>Real-time data: If a user asks “What’s the stock price of X right now?”, you need to fetch that via an API. The result is dynamic context, it didn’t exist until now. Similarly, weather info, news headlines, latest database entries; all are dynamic.</li><li>User’s current interaction state: If the user is filling a form and the model assists, the partially filled fields could be dynamic context on each turn. Or in a code assistant, the current file contents or cursor position is context that changes as user edits.</li><li>Tool results: In an agent performing actions, each time it runs a tool (API call, code execution), the output is new context fed in. That we covered under tool context, but it’s inherently dynamic because it’s a function of user input and current environment.</li><li>Generative chain-of-thought: Some approaches feed the model’s own prior reasoning steps back in as context for further reasoning (like reflecting or refining). This is also dynamic self-generated context.</li></ul><p>Temporal context specifically refers to context that has a time dimension:</p><ul><li>The simplest temporal context is the current date/time injection, as mentioned.</li><li>Another is knowledge that decays with time, for example, “the current top trending topics on Twitter” changes daily. Hence, we might implement strategic forgetting for time-limited facts.</li></ul><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/02/image-107.png" class="kg-image" alt="" loading="lazy" width="1000" height="433" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/02/image-107.png 600w, https://www.dailydoseofds.com/content/images/2026/02/image-107.png 1000w" sizes="(min-width: 720px) 720px"></figure><ul><li>Temporal memory: as conversations progress, older context might be summarized (as we talked), effectively injecting a “summary so far” at some regular interval. That summary can be considered a temporal context injection technique, you compress older context over time.</li></ul><h3 id="injection-methods">Injection methods</h3><p>A few common dynamic and temporal context injection methods are:</p><h4 id="event-driven-context-refresh">Event-driven context refresh</h4><p>This means if certain events happen (time ticks, new data arrival), the context is updated.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/02/image-125-1.png" class="kg-image" alt="" loading="lazy" width="800" height="423" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/02/image-125-1.png 600w, https://www.dailydoseofds.com/content/images/2026/02/image-125-1.png 800w" sizes="(min-width: 720px) 720px"></figure><p>For instance, if the conversation crosses midnight and date changed, maybe update the "Today’s date" context if it’s persistent. Or if a monitored data source (like stock price) changes significantly while the user is interacting, maybe spontaneously update context or notify the model. Usually though, it’s user-driven (user asks and system fetches at that time).</p><h4 id="scheduled-context-injection">Scheduled context injection</h4><p>In some agent scenarios, the agent might have tasks that run periodically and then feed results in.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/02/image-126-1.png" class="kg-image" alt="" loading="lazy" width="724" height="205" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/02/image-126-1.png 600w, https://www.dailydoseofds.com/content/images/2026/02/image-126-1.png 724w" sizes="(min-width: 720px) 720px"></figure><p>For example, an agent that monitors an email inbox every hour and then in the next conversation turn mentions any new important email. That means the context pipeline might include a step “check for new emails (tool), if found, include summary in prompt.” So at different times, context content changes because environment changed.</p><h4 id="user-specific-dynamic-context">User-specific dynamic context</h4><p>If a user’s current context like location changes (they moved to a different city), ideally the system knows and updates what it tells the model.</p><p>If location is used in answers, and user moves, you want “User is in London now” context instead of the old “User in Paris.” This could be handled by always querying a profile service for current information at prompt time.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/02/image-127-1-1.png" class="kg-image" alt="" loading="lazy" width="702" height="428" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/02/image-127-1-1.png 600w, https://www.dailydoseofds.com/content/images/2026/02/image-127-1-1.png 702w"></figure><h3 id="memory-aging">Memory aging</h3><p>We touched on summarization as one method. Another is strategic forgetting (dropping if assumed no longer needed). Some systems do “forget” intentionally things that are resolved or won't come up again, to minimize risk of retrieval picking up unnecessary stuff.</p><div class="kg-card kg-callout-card kg-callout-card-blue"><div class="kg-callout-emoji">👉</div><div class="kg-callout-text">Here forgetting is talked about with reference to intelligence modules of the system. Raw logs might still need to be maintained in a cheap storage unit for auditability and compliance purposes, if required.</div></div><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/02/image-108.png" class="kg-image" alt="" loading="lazy" width="1000" height="433" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/02/image-108.png 600w, https://www.dailydoseofds.com/content/images/2026/02/image-108.png 1000w" sizes="(min-width: 720px) 720px"></figure>

<aside data-sx-content-cta class="relative p-0.5 js-toc-ignore">
        <div data-sx-content-cta-blur class="absolute w-full start-0 -top-0.5 -translate-y-full bg-linear-to-b to-70% from-black/0 to-background dark:to-background-dark pointer-events-none">
            &nbsp; <br> &nbsp; <br> &nbsp;
        </div>

    <div class="relative flex flex-col items-center text-center w-full px-4 py-10 lg:px-10 text-base rounded-xl sm:rounded-2xl">
        <div class="dark:hidden absolute -inset-px -z-10 rounded-xl bg-white shadow-pretty-sm transition-shadow duration-200"></div>
        <span class="hidden dark:block absolute -z-10 -inset-px rounded-inherit bg-radial-[50%_100%_at_50%_0%] from-white/10"></span>        
        <div class="hidden dark:block absolute -inset-px -z-10 rounded-xl bg-white/1 transition-colors duration-500 inset-shadow-md inset-shadow-white/2"></div>
        
        <div class="hidden dark:block absolute -z-10 -inset-0.5 mask-to-b mask-position-to-70% mask-opacity-to-50%">
            <div class="hidden dark:block absolute -z-10 inset-px rounded-xl border border-gray-50/10"></div>
        </div>                                                
        <div class="hidden dark:block absolute -z-10 -inset-px rounded-xl border border-gray-50/5 opacity-0"></div>

        <div class="hidden dark:block absolute -top-px start-1/2 -translate-x-1/2 w-1/2 h-px bg-linear-to-r from-white/0 to-white/0 via-gray-400/50"></div>

        <h2 class="text-gray-900 dark:text-gray-100 text-pretty text-xl sm:text-2xl leading-snug tracking-tight">




                                                <span class="hidden first:inline">
                                This lesson is for paying subscribers only
                            </span>                           
                    <span class="hidden first:inline">
                        This post is for paying subscribers only
                    </span>  

        </h2>

            <a href="/membership" class="mt-7 px-4 py-2.5 font-semibold text-sm text-center bg-primary rounded-full shadow-primary dark:shadow-none dark:drop-shadow-primary-sm hover:opacity-90 transition-opacity duration-200 inline-block" style="color: white;">
                Unlock Full Access
            </a>
            <p class="mt-5 text-sm">
                Already have an account? <button data-portal="signin" class="block sm:inline mx-auto font-medium underline decoration-primary decoration-2 underline-offset-2 hover:text-primary transition-colors duration-200">Sign in</button>
            </p>
    </div>
</aside>
]]></content:encoded>
                </item>
                <item>
                    <title><![CDATA[Context Engineering: An Introduction to the Information Environment for LLMs]]></title>
                    <description><![CDATA[LLMOps Part 7: A conceptual overview of context engineering, covering context types, context construction principles, and retrieval-centric techniques for building high-signal inputs.]]></description>
                    <link>https://www.dailydoseofds.com/llmops-crash-course-part-7/</link>
                    <guid isPermaLink="false">69cad8cb10b4a900019d1008</guid>

                        <category><![CDATA[MLOps/LLMOps Course]]></category>
                        <category><![CDATA[LLMOps]]></category>

                        <dc:creator><![CDATA[Avi Chawla]]></dc:creator>

                    <pubDate>Mon, 09 Feb 2026 00:38:55 +0530</pubDate>

                        <media:content url="https://www.dailydoseofds.com/content/images/2026/02/Sanyog-MLOps_-_I.png" medium="image"/>

                    <content:encoded><![CDATA[<img src="https://www.dailydoseofds.com/content/images/2026/02/Sanyog-MLOps_-_I.png" alt="Context Engineering: An Introduction to the Information Environment for LLMs"/> <h2 id="recap">Recap</h2><p>In the previous chapter (part 6), we started our discussion by introducing prompt versioning. We explained why even small prompt changes can have outsized and unpredictable effects, and why treating prompts as non-versioned text could be dangerous.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/02/image.png" class="kg-image" alt="" loading="lazy" width="1000" height="603" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/02/image.png 600w, https://www.dailydoseofds.com/content/images/2026/02/image.png 1000w" sizes="(min-width: 720px) 720px"></figure><p>From there, we laid out the core principles of prompt versioning: separating prompts from application code, enforcing immutability of prompt versions, adopting systematic versioning schemes, and use of metadata.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/02/image-2.png" class="kg-image" alt="" loading="lazy" width="1000" height="473" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/02/image-2.png 600w, https://www.dailydoseofds.com/content/images/2026/02/image-2.png 1000w" sizes="(min-width: 720px) 720px"></figure><p>We also emphasized that prompt changes should be treated as runtime configuration changes, not code changes, and must be governed with the same rigor as any other critical system component.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/02/image-3.png" class="kg-image" alt="" loading="lazy" width="1000" height="429" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/02/image-3.png 600w, https://www.dailydoseofds.com/content/images/2026/02/image-3.png 1000w" sizes="(min-width: 720px) 720px"></figure><p>We then expanded this discussion into prompt templates, the form in which prompts are most commonly used in real applications. We showed how templates enable dynamic prompt construction while preserving structure and consistency, and why different templates should exist for different formats and use cases.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/02/image-4.png" class="kg-image" alt="" loading="lazy" width="876" height="409" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/02/image-4.png 600w, https://www.dailydoseofds.com/content/images/2026/02/image-4.png 876w" sizes="(min-width: 720px) 720px"></figure><p>Next, we moved into the security dimension with defensive prompting. We examined why prompt-based attacks matter, especially in high-stakes systems, and categorized common attack families. </p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/02/image-5.png" class="kg-image" alt="" loading="lazy" width="1000" height="577" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/02/image-5.png 600w, https://www.dailydoseofds.com/content/images/2026/02/image-5.png 1000w" sizes="(min-width: 720px) 720px"></figure><p>Crucially, we reframed prompt security as a layered problem, spanning model-level behaviors, prompt-level constraints, and system-level safeguards, rather than something that can be “solved” with a single clever prompt.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/02/image-6.png" class="kg-image" alt="" loading="lazy" width="1000" height="470" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/02/image-6.png 600w, https://www.dailydoseofds.com/content/images/2026/02/image-6.png 1000w" sizes="(min-width: 720px) 720px"></figure><p>Next, to ground the ideas of prompt versioning in practice, we walked through a hands-on prompt versioning workflow using Langfuse.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/02/image-7.png" class="kg-image" alt="" loading="lazy" width="1000" height="469" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/02/image-7.png 600w, https://www.dailydoseofds.com/content/images/2026/02/image-7.png 1000w" sizes="(min-width: 720px) 720px"></figure><p>Finally, we explored a few modern and practical prompting ideas that help teams steer behavior beyond normal practices: verbalized sampling, role prompting and prompt repetition.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/02/image-8.png" class="kg-image" alt="" loading="lazy" width="639" height="356" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/02/image-8.png 600w, https://www.dailydoseofds.com/content/images/2026/02/image-8.png 639w"></figure><p>By the end of the chapter, we had moved from viewing prompts as static strings to understanding them as managed, versioned, and secure components of an LLM system.</p><p>If you haven’t yet gone through Part 6, we strongly recommend reviewing it first, as it helps maintain the natural learning flow of the series.</p><p>Read it here:</p><figure class="kg-card kg-bookmark-card"><a class="kg-bookmark-container" href="https://www.dailydoseofds.com/llmops-crash-course-part-6/"><div class="kg-bookmark-content"><div class="kg-bookmark-title">Context Engineering: Prompt Management, Defense, and Control</div><div class="kg-bookmark-description">LLMOps Part 6: Exploring prompt versioning, defensive prompting, and techniques such as verbalized sampling, role prompting and more.</div><div class="kg-bookmark-metadata"><img class="kg-bookmark-icon" src="https://www.dailydoseofds.com/content/images/icon/dailyds-logo-transparent-16.png" alt=""><span class="kg-bookmark-author">Daily Dose of Data Science</span><span class="kg-bookmark-publisher">Avi Chawla</span></div></div><div class="kg-bookmark-thumbnail"><img src="https://www.dailydoseofds.com/content/images/thumbnail/Sanyog-MLOps_-_I-3-10.png" alt="" onerror="this.style.display = 'none'"></div></a></figure><p>In this chapter we zoom out from prompt engineering to the broader discipline of context engineering, and understand the much larger information environment that drives model behavior.</p><p>As always, every notion will be explained through clear examples and walkthroughs to develop a solid understanding.</p><p>Let’s begin!</p><hr><h2 id="introduction">Introduction</h2><p>Context engineering can be understood as a collection of methods and techniques used to supply information and instructions to large language models in applications in a deliberate and structured way.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/02/image-16.png" class="kg-image" alt="" loading="lazy" width="1456" height="1305" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/02/image-16.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2026/02/image-16.png 1000w, https://www.dailydoseofds.com/content/images/2026/02/image-16.png 1456w" sizes="(min-width: 720px) 720px"></figure><p>An LLM’s context window functions like its working memory. It contains the prompt, response, recent conversation history, retrieved knowledge, etc. This window is finite, which makes efficient use of that space critical.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/02/image-25.png" class="kg-image" alt="" loading="lazy" width="2000" height="448" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/02/image-25.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2026/02/image-25.png 1000w, https://www.dailydoseofds.com/content/images/size/w1600/2026/02/image-25.png 1600w, https://www.dailydoseofds.com/content/images/2026/02/image-25.png 2302w" sizes="(min-width: 720px) 720px"></figure><p>Context engineering is therefore about maximizing signal within limited capacity, deciding what information to include for a given model invocation, what to leave out, and how to structure what remains so the model can reason effectively.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.dailydoseofds.com/content/images/2026/02/image-18.png" class="kg-image" alt="" loading="lazy" width="860" height="624" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/02/image-18.png 600w, https://www.dailydoseofds.com/content/images/2026/02/image-18.png 860w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Analogy: Think of LLM as CPU and context window as RAM</span></figcaption></figure><p>The ultimate goal of context engineering is to bridge the gap between a static model and the dynamic world in which an application operates. Through context, the model can be connected to external knowledge via retrieval, to past interactions via memory, to real-time data or actions via tools and perception, and to user-specific details such as preferences or profiles.</p><div class="kg-card kg-callout-card kg-callout-card-blue"><div class="kg-callout-emoji">👉</div><div class="kg-callout-text">These additions allow the model to act as if it understands the current situation, even though its underlying parameters remain fixed.</div></div><p>In production systems, this is what often determines whether an AI system feels intelligent and useful, or disconnected and unreliable.</p><div class="kg-card kg-callout-card kg-callout-card-blue"><div class="kg-callout-emoji">👉</div><div class="kg-callout-text">In essence, we can remember this one line: context engineering is the practice of designing the information environment in which an LLM operates.</div></div><p>Now, to make this more concrete, let’s establish a taxonomy of context types that we might include in an LLM’s input, depending on the use case.</p><hr><h2 id="taxonomy-of-context">Taxonomy of context</h2><p>When we talk about “context” for LLM applications, it’s not one monolithic thing. We can break it down into categories, each representing a type of information that might be included in the model’s input.</p><p>Here’s a taxonomy of context often considered in LLMOps:</p><h3 id="instruction-context">Instruction context</h3><p>This is the part of the input that tells the model what its task or role is. It includes the system prompt or any high-level directives. Essentially, it’s the instructions or rules governing the model’s behavior.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/02/image-19.png" class="kg-image" alt="" loading="lazy" width="1000" height="421" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/02/image-19.png 600w, https://www.dailydoseofds.com/content/images/2026/02/image-19.png 1000w" sizes="(min-width: 720px) 720px"></figure><p>It usually does not contain factual data for the task itself, but rather sets the stage and defines how the model should operate (persona, style, boundaries, etc.). Instruction context is crucial to align the model’s responses with the desired tone and policy.</p><p>Without it, the model might default to a generic behavior or something undesired. In systems engineering terms, this is part of the configuration for the model on each request. It’s often static or changes infrequently.</p><h3 id="query-or-user-context">Query or user context</h3><p>This is the immediate input from the user: their question, command, or message. It defines the problem to solve right now. If we consider a chat, the latest user message is the core of the query context.</p><h3 id="knowledge-context">Knowledge context</h3><p>This is additional information provided to help answer the query, typically fetched from some database or documents. It’s often implemented via retrieval-augmented generation (RAG).</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/02/image-21.png" class="kg-image" alt="" loading="lazy" width="874" height="366" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/02/image-21.png 600w, https://www.dailydoseofds.com/content/images/2026/02/image-21.png 874w" sizes="(min-width: 720px) 720px"></figure><p>Knowledge context can include company documentation, knowledge base articles, textbook snippets, etc. It ensures the model isn’t just relying on parametric memory (which might be outdated or insufficient) but has access to the latest or specific facts.</p><p>For example, if a user asks “What’s our refund policy?”, your system might pull the refund policy text from your FAQ and include it in the prompt as context. When building pipelines, a lot of effort goes into getting this knowledge context right (finding the right pieces, chunking them, etc.).</p><p>Note: We'll not be covering everything about RAG in depth here, since we already have a detailed course on it. We recommend the readers unfamiliar with RAG checking that out:</p><figure class="kg-card kg-bookmark-card"><a class="kg-bookmark-container" href="https://www.dailydoseofds.com/rag-crash-course/"><div class="kg-bookmark-content"><div class="kg-bookmark-title">RAG Course</div><div class="kg-bookmark-description">A hands-on series on building production-grade RAG systems. It covers the fundamentals of RAG, naive RAG, RAG evaluation, RAG optimization, Multimodal RAG, Graph RAG, Vision RAG, etc. (with implementation).</div><div class="kg-bookmark-metadata"><img class="kg-bookmark-icon" src="https://www.dailydoseofds.com/content/images/icon/dailyds-logo-transparent-11.png" alt=""><span class="kg-bookmark-author">Daily Dose of Data Science</span><span class="kg-bookmark-publisher">Avi Chawla</span></div></div><div class="kg-bookmark-thumbnail"><img src="https://www.dailydoseofds.com/content/images/thumbnail/rag.webp" alt="" onerror="this.style.display = 'none'"></div></a></figure><h3 id="memory-context">Memory context</h3><p>This refers to including previous interactions or events so the model has continuity. In a chat scenario, the memory context would be the past dialogue turns that are relevant to keep the conversation coherent.</p><p>In a more general setting, memory might include things like what actions the agent has already taken, what the intermediate results were, or any conclusions already reached. Broadly, we break memory into:</p><ul><li>Short-term: talks about what’s in the active prompt (or immediate context within the session).</li></ul><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/02/image-22.png" class="kg-image" alt="" loading="lazy" width="422" height="285"></figure><ul><li>Long-term: knowledge and experience across sessions (or what’s stored elsewhere and fetched).</li></ul><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/02/image-23.png" class="kg-image" alt="" loading="lazy" width="887" height="352" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/02/image-23.png 600w, https://www.dailydoseofds.com/content/images/2026/02/image-23.png 887w" sizes="(min-width: 720px) 720px"></figure><p>But conceptually, memory context is anything that happened before now that the model should “remember” in order to respond appropriately. For example, if earlier the user said their name or preferences, you want the model to remember that later in the session.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/02/image-24-1.png" class="kg-image" alt="" loading="lazy" width="821" height="284" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/02/image-24-1.png 600w, https://www.dailydoseofds.com/content/images/2026/02/image-24-1.png 821w" sizes="(min-width: 720px) 720px"></figure><p>Or suppose if this is the 5th step in a multi-step response chain, the model should have the prior steps’ outcomes in context.</p><p>Checkout the articles linked below for a detailed discussion on memory for AI systems:</p><figure class="kg-card kg-bookmark-card"><a class="kg-bookmark-container" href="https://www.dailydoseofds.com/ai-agents-crash-course-part-8-with-implementation/"><div class="kg-bookmark-content"><div class="kg-bookmark-title">A Practical Deep Dive Into Memory for Agentic Systems (Part A)</div><div class="kg-bookmark-description">AI Agents Crash Course—Part 8 (with implementation).</div><div class="kg-bookmark-metadata"><img class="kg-bookmark-icon" src="https://www.dailydoseofds.com/content/images/icon/dailyds-logo-transparent-14.png" alt=""><span class="kg-bookmark-author">Daily Dose of Data Science</span><span class="kg-bookmark-publisher">Avi Chawla</span></div></div><div class="kg-bookmark-thumbnail"><img src="https://www.dailydoseofds.com/content/images/thumbnail/agent_crash_course--1--48.png" alt="" onerror="this.style.display = 'none'"></div></a></figure><figure class="kg-card kg-bookmark-card"><a class="kg-bookmark-container" href="https://www.dailydoseofds.com/ai-agents-crash-course-part-9-with-implementation/"><div class="kg-bookmark-content"><div class="kg-bookmark-title">A Practical Deep Dive Into Memory for Agentic Systems (Part B)</div><div class="kg-bookmark-description">AI Agents Crash Course—Part 9 (with implementation).</div><div class="kg-bookmark-metadata"><img class="kg-bookmark-icon" src="https://www.dailydoseofds.com/content/images/icon/dailyds-logo-transparent-15.png" alt=""><span class="kg-bookmark-author">Daily Dose of Data Science</span><span class="kg-bookmark-publisher">Avi Chawla</span></div></div><div class="kg-bookmark-thumbnail"><img src="https://www.dailydoseofds.com/content/images/thumbnail/agent_crash_course--1--49.png" alt="" onerror="this.style.display = 'none'"></div></a></figure><figure class="kg-card kg-bookmark-card"><a class="kg-bookmark-container" href="https://www.dailydoseofds.com/ai-agents-crash-course-part-15-with-implementation/"><div class="kg-bookmark-content"><div class="kg-bookmark-title">A Practical Deep Dive Into Memory Optimization for Agentic Systems (Part A)</div><div class="kg-bookmark-description">AI Agents Crash Course—Part 15 (with implementation).</div><div class="kg-bookmark-metadata"><img class="kg-bookmark-icon" src="https://www.dailydoseofds.com/content/images/icon/dailyds-logo-transparent-17.png" alt=""><span class="kg-bookmark-author">Daily Dose of Data Science</span><span class="kg-bookmark-publisher">Avi Chawla</span></div></div><div class="kg-bookmark-thumbnail"><img src="https://www.dailydoseofds.com/content/images/thumbnail/agent_crash_course--1--50.png" alt="" onerror="this.style.display = 'none'"></div></a></figure><figure class="kg-card kg-bookmark-card"><a class="kg-bookmark-container" href="https://www.dailydoseofds.com/ai-agents-crash-course-part-16-with-implementation/"><div class="kg-bookmark-content"><div class="kg-bookmark-title">A Practical Deep Dive Into Memory Optimization for Agentic Systems (Part B)</div><div class="kg-bookmark-description">AI Agents Crash Course—Part 16 (with implementation).</div><div class="kg-bookmark-metadata"><img class="kg-bookmark-icon" src="https://www.dailydoseofds.com/content/images/icon/dailyds-logo-transparent-18.png" alt=""><span class="kg-bookmark-author">Daily Dose of Data Science</span><span class="kg-bookmark-publisher">Avi Chawla</span></div></div><div class="kg-bookmark-thumbnail"><img src="https://www.dailydoseofds.com/content/images/thumbnail/agent_crash_course--1--1-7.png" alt="" onerror="this.style.display = 'none'"></div></a></figure><figure class="kg-card kg-bookmark-card"><a class="kg-bookmark-container" href="https://www.dailydoseofds.com/ai-agents-crash-course-part-17-with-implementation/"><div class="kg-bookmark-content"><div class="kg-bookmark-title">A Practical Deep Dive Into Memory Optimization for Agentic Systems (Part C)</div><div class="kg-bookmark-description">AI Agents Crash Course—Part 17 (with implementation).</div><div class="kg-bookmark-metadata"><img class="kg-bookmark-icon" src="https://www.dailydoseofds.com/content/images/icon/dailyds-logo-transparent-20.png" alt=""><span class="kg-bookmark-author">Daily Dose of Data Science</span><span class="kg-bookmark-publisher">Avi Chawla</span></div></div><div class="kg-bookmark-thumbnail"><img src="https://www.dailydoseofds.com/content/images/thumbnail/agent_crash_course--1--2-3.png" alt="" onerror="this.style.display = 'none'"></div></a></figure><h3 id="tool-context">Tool context</h3><p>When an LLM uses external tools (like search engines, calculators, databases, code execution), the outputs of those tools become context for the model. In an agent loop, after the model decides on an action and you execute it, you feed the result back into the model as an observation. This is context in the sense that it’s additional info. given to the model mid-task.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/02/image-26.png" class="kg-image" alt="" loading="lazy" width="1456" height="669" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/02/image-26.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2026/02/image-26.png 1000w, https://www.dailydoseofds.com/content/images/2026/02/image-26.png 1456w" sizes="(min-width: 720px) 720px"></figure><p>For instance, the model says “Action: call weather API for London today,” your system does it and gets “It’s 15°C and sunny,” and then you supply to the model: “Observation: 15°C and sunny in London.” Now the model has that data in its context for the next step, perhaps to report it to the user.</p><p>Tool context thus is a dynamic, often real-time form of context that expands what the model can do beyond its training data.</p><div class="kg-card kg-callout-card kg-callout-card-blue"><div class="kg-callout-emoji">👉</div><div class="kg-callout-text">Another aspect is perceptual context. If the model can perceive images or audio, the description of what it “sees” or “hears” becomes part of the context. For example: “Observation: [CAPTION] a dog sitting on a couch.” This is similar to a tool output, where the tool is an image recognition module, unless we are specifically discussing truly multimodal LLMs.</div></div><p>Note: We'll not be covering tooling in depth here, since we already have detailed discussion on tool use with LLMs as part of our Agents and MCP courses. We recommend the readers unfamiliar with tooling checking out the AI Agents course:</p><figure class="kg-card kg-bookmark-card"><a class="kg-bookmark-container" href="https://www.dailydoseofds.com/ai-agents-crash-course/"><div class="kg-bookmark-content"><div class="kg-bookmark-title">AI Agents Course</div><div class="kg-bookmark-description">A series of technical deep dives on AI Agents that covers fundamentals and backgrounds, Flows, Knowledge, Memory, implementation of Agentic Patterns from scratch, and much more (with implementations).</div><div class="kg-bookmark-metadata"><img class="kg-bookmark-icon" src="https://www.dailydoseofds.com/content/images/icon/dailyds-logo-transparent-12.png" alt=""><span class="kg-bookmark-author">Daily Dose of Data Science</span><span class="kg-bookmark-publisher">Avi Chawla</span></div></div><div class="kg-bookmark-thumbnail"><img src="https://www.dailydoseofds.com/content/images/thumbnail/agents.webp" alt="" onerror="this.style.display = 'none'"></div></a></figure><h3 id="user-specific-context">User-specific context</h3><p>This category includes any information tailored to the specific user or session that can help personalize or contextualize the response. It might be a user profile (e.g., their name, location, membership status), their past interactions or preferences, or even dynamic context like current date/time as it relates to the user (maybe the user’s local time for greeting, etc.).</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/02/image-27-1.png" class="kg-image" alt="" loading="lazy" width="821" height="284" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/02/image-27-1.png 600w, https://www.dailydoseofds.com/content/images/2026/02/image-27-1.png 821w" sizes="(min-width: 720px) 720px"></figure><p>Including user-specific context allows the model to tailor answers, for instance, if you know the user’s proficiency level is beginner, you might include that as context: “The user is a beginner in programming.” Then the model can adjust the complexity of its answer accordingly.</p><p>Another example: if the user has an open support ticket, the context might include summary of that issue so the model doesn’t ask them to repeat it.</p><div class="kg-card kg-callout-card kg-callout-card-blue"><div class="kg-callout-emoji">👉</div><div class="kg-callout-text">Privacy is crucial here: any user-specific data should be handled carefully (don’t inadvertently leak one user’s info. to another’s context, etc.).</div></div><h3 id="environmental-and-temporal-context">Environmental and temporal context</h3><p>Sometimes listed separately, this includes things like the current date and time, the platform or device info., or other environmental variables that might influence the model’s response. For example, telling the model "Today’s date is 2026-01-19". Why do this? If the user asks “Is the event coming up soon?”, the model knowing today’s date can determine that.</p><div class="kg-card kg-callout-card kg-callout-card-blue"><div class="kg-callout-emoji">👉</div><div class="kg-callout-text">Note: If you’re a bit confused and think there’s no clear separation between the different categories of context, voila you’re right! These categories often overlap and interact. Separating them conceptually is helpful, but in practice overlap is common.</div></div><p>Now that we have defined the different types of context, let’s move on to some important techniques and principles for context construction and management of these different context elements.</p><hr><h2 id="principles-and-techniques-for-context-construction">Principles and techniques for context construction</h2><p>Context construction involves several principles and techniques to assemble model input from multiple information sources.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/02/image-31.png" class="kg-image" alt="" loading="lazy" width="898" height="440" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/02/image-31.png 600w, https://www.dailydoseofds.com/content/images/2026/02/image-31.png 898w" sizes="(min-width: 720px) 720px"></figure><p>Rather than a single fixed procedure, real-world systems employ modular, conditional stages that are composed differently depending on the query, task, and operational constraints.</p><p>This section surveys commonly used techniques in practice, with a particular focus on retrieval-centric context assembly.</p><div class="kg-card kg-callout-card kg-callout-card-blue"><div class="kg-callout-emoji">👉</div><div class="kg-callout-text">Note that this section is not intended to be exhaustive, but instead to provide a mental framework for designing robust context pipelines.</div></div><p>Typical modules/stages involved:</p>

<aside data-sx-content-cta class="relative p-0.5 js-toc-ignore">
        <div data-sx-content-cta-blur class="absolute w-full start-0 -top-0.5 -translate-y-full bg-linear-to-b to-70% from-black/0 to-background dark:to-background-dark pointer-events-none">
            &nbsp; <br> &nbsp; <br> &nbsp;
        </div>

    <div class="relative flex flex-col items-center text-center w-full px-4 py-10 lg:px-10 text-base rounded-xl sm:rounded-2xl">
        <div class="dark:hidden absolute -inset-px -z-10 rounded-xl bg-white shadow-pretty-sm transition-shadow duration-200"></div>
        <span class="hidden dark:block absolute -z-10 -inset-px rounded-inherit bg-radial-[50%_100%_at_50%_0%] from-white/10"></span>        
        <div class="hidden dark:block absolute -inset-px -z-10 rounded-xl bg-white/1 transition-colors duration-500 inset-shadow-md inset-shadow-white/2"></div>
        
        <div class="hidden dark:block absolute -z-10 -inset-0.5 mask-to-b mask-position-to-70% mask-opacity-to-50%">
            <div class="hidden dark:block absolute -z-10 inset-px rounded-xl border border-gray-50/10"></div>
        </div>                                                
        <div class="hidden dark:block absolute -z-10 -inset-px rounded-xl border border-gray-50/5 opacity-0"></div>

        <div class="hidden dark:block absolute -top-px start-1/2 -translate-x-1/2 w-1/2 h-px bg-linear-to-r from-white/0 to-white/0 via-gray-400/50"></div>

        <h2 class="text-gray-900 dark:text-gray-100 text-pretty text-xl sm:text-2xl leading-snug tracking-tight">




                                                <span class="hidden first:inline">
                                This lesson is for paying subscribers only
                            </span>                           
                    <span class="hidden first:inline">
                        This post is for paying subscribers only
                    </span>  

        </h2>

            <a href="/membership" class="mt-7 px-4 py-2.5 font-semibold text-sm text-center bg-primary rounded-full shadow-primary dark:shadow-none dark:drop-shadow-primary-sm hover:opacity-90 transition-opacity duration-200 inline-block" style="color: white;">
                Unlock Full Access
            </a>
            <p class="mt-5 text-sm">
                Already have an account? <button data-portal="signin" class="block sm:inline mx-auto font-medium underline decoration-primary decoration-2 underline-offset-2 hover:text-primary transition-colors duration-200">Sign in</button>
            </p>
    </div>
</aside>
]]></content:encoded>
                </item>
                <item>
                    <title><![CDATA[Context Engineering: Prompt Management, Defense, and Control]]></title>
                    <description><![CDATA[LLMOps Part 6: Exploring prompt versioning, defensive prompting, and techniques such as verbalized sampling, role prompting and more.]]></description>
                    <link>https://www.dailydoseofds.com/llmops-crash-course-part-6/</link>
                    <guid isPermaLink="false">69cad8cb10b4a900019d1005</guid>

                        <category><![CDATA[MLOps/LLMOps Course]]></category>
                        <category><![CDATA[LLMOps]]></category>

                        <dc:creator><![CDATA[Avi Chawla]]></dc:creator>

                    <pubDate>Sun, 01 Feb 2026 01:31:24 +0530</pubDate>

                        <media:content url="https://www.dailydoseofds.com/content/images/2026/01/Sanyog-MLOps_-_I-3.png" medium="image"/>

                    <content:encoded><![CDATA[<img src="https://www.dailydoseofds.com/content/images/2026/01/Sanyog-MLOps_-_I-3.png" alt="Context Engineering: Prompt Management, Defense, and Control"/> <h2 id="recap">Recap</h2><p>In the previous chapter (part 5), we laid the foundations of prompt engineering, a subset of context engineering.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/01/image-381.png" class="kg-image" alt="" loading="lazy" width="1000" height="896" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/01/image-381.png 600w, https://www.dailydoseofds.com/content/images/2026/01/image-381.png 1000w" sizes="(min-width: 720px) 720px"></figure><p>We started by building a basic understanding of prompt engineering and context engineering, and explained why prompts are best viewed as control surfaces that shape model behavior probabilistically.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/01/image-382.png" class="kg-image" alt="" loading="lazy" width="871" height="394" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/01/image-382.png 600w, https://www.dailydoseofds.com/content/images/2026/01/image-382.png 871w" sizes="(min-width: 720px) 720px"></figure><p>We then grounded this idea with the fundamentals of in-context learning, showing how zero-shot and few-shot prompting work.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/01/image-383.png" class="kg-image" alt="" loading="lazy" width="1236" height="793" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/01/image-383.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2026/01/image-383.png 1000w, https://www.dailydoseofds.com/content/images/2026/01/image-383.png 1236w" sizes="(min-width: 720px) 720px"></figure><p>Next, we introduced a systematic prompt development workflow. This framed prompting similar to software development: specification, debugging, evaluation, and versioning.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/01/image-384.png" class="kg-image" alt="" loading="lazy" width="724" height="928" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/01/image-384.png 600w, https://www.dailydoseofds.com/content/images/2026/01/image-384.png 724w" sizes="(min-width: 720px) 720px"></figure><p>From there, we covered the major prompt types used in real applications: system prompts, user prompts, few-shot examples, instruction + context prompts, and formatting prompts.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/01/image-385.png" class="kg-image" alt="" loading="lazy" width="981" height="519" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/01/image-385.png 600w, https://www.dailydoseofds.com/content/images/2026/01/image-385.png 981w" sizes="(min-width: 720px) 720px"></figure><p>We then moved beyond basics into widely used prompting techniques: chain-of-thought prompting, ReAct as a reasoning-and-tool-use control loop, and self-consistency as an “ensemble at inference time” to improve reliability when correctness matters.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/01/image-386.png" class="kg-image" alt="" loading="lazy" width="1014" height="877" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/01/image-386.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2026/01/image-386.png 1000w, https://www.dailydoseofds.com/content/images/2026/01/image-386.png 1014w" sizes="(min-width: 720px) 720px"></figure><p>Finally, we tied the concepts together with a hands-on implementation: a small math-solver pipeline that combines system instructions, strict formatting, chain-of-thought, and demonstrates a pattern where reasoning is logged for audits while only the clean final answer is shown to the user.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/01/image-387.png" class="kg-image" alt="" loading="lazy" width="1638" height="491" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/01/image-387.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2026/01/image-387.png 1000w, https://www.dailydoseofds.com/content/images/size/w1600/2026/01/image-387.png 1600w, https://www.dailydoseofds.com/content/images/2026/01/image-387.png 1638w" sizes="(min-width: 720px) 720px"></figure><p>By the end of the chapter, we had a clear mental model of how prompts influence model behavior, how to iterate on prompts, and how advanced prompting patterns trade simplicity for capability.</p><p>If you haven’t yet gone through Part 5, we strongly recommend reviewing it first, as it lays the conceptual groundwork for everything that follows.</p><p>Read it here:</p><figure class="kg-card kg-bookmark-card"><a class="kg-bookmark-container" href="https://www.dailydoseofds.com/llmops-crash-course-part-5/"><div class="kg-bookmark-content"><div class="kg-bookmark-title">Context Engineering: Foundations, Categories, and Techniques of Prompt Engineering</div><div class="kg-bookmark-description">LLMOps Part 5: An introduction to prompt engineering (a subset of context engineering), covering prompt types, the prompt development workflow, and key techniques in the field.</div><div class="kg-bookmark-metadata"><img class="kg-bookmark-icon" src="https://www.dailydoseofds.com/content/images/icon/dailyds-logo-transparent-9.png" alt=""><span class="kg-bookmark-author">Daily Dose of Data Science</span><span class="kg-bookmark-publisher">Avi Chawla</span></div></div><div class="kg-bookmark-thumbnail"><img src="https://www.dailydoseofds.com/content/images/thumbnail/Sanyog-MLOps_-_I-2-7.png" alt="" onerror="this.style.display = 'none'"></div></a></figure><p>In this chapter, we will be going deeper into prompt versioning, defensive prompting, and techniques like verbalized sampling, role prompting and more.</p><p>As always, every notion will be explained through clear examples and walkthroughs to develop a solid understanding.</p><p>Let’s begin!</p><hr><h2 id="prompt-versioning">Prompt versioning</h2><p>Prompt versioning means treating each prompt as a versioned artifact with a clear history and explicit relationships to how the system behaves. In production environments, prompts are as critical as code and configuration.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/01/image-398-1.png" class="kg-image" alt="" loading="lazy" width="1024" height="617" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/01/image-398-1.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2026/01/image-398-1.png 1000w, https://www.dailydoseofds.com/content/images/2026/01/image-398-1.png 1024w" sizes="(min-width: 720px) 720px"></figure><p>Small changes in wording, structure, or constraints can significantly alter model outputs, sometimes in non-obvious ways.</p><p>Without proper versioning, we lose traceability, rollback capability, reproducibility, and the ability to govern or reason about prompt-driven behavior. Debugging becomes guesswork, and even minor prompt tweaks can silently introduce regressions.</p><p>A production-ready approach to prompt versioning follows a set of core principles as discussed below:</p><h3 id="separate-prompts-from-application-code">Separate prompts from application code</h3><p>Prompts should not live inline inside application logic. Embedding prompt text directly in functions or API calls tightly couples behavior changes to code deploys.</p><p>Instead, prompts should be stored externally, for example as files in a repository, entries in a database, or records in a registry.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/01/image-397-1.png" class="kg-image" alt="" loading="lazy" width="1024" height="484" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/01/image-397-1.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2026/01/image-397-1.png 1000w, https://www.dailydoseofds.com/content/images/2026/01/image-397-1.png 1024w" sizes="(min-width: 720px) 720px"></figure><p>This separation of concerns allows prompts to evolve independently of code, reduces the risk of unintended rollouts, and makes prompt changes more visible and auditable.</p><h3 id="immutable-prompt-versions">Immutable prompt versions</h3><p>Each prompt version should be immutable. Once a version is created, it must never be edited in place.</p><p>If a prompt needs to change, even slightly, a new version should be created. This mirrors how release artifacts are handled in mature software systems. Any reference to a version identifier must always resolve to the same prompt text and context.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/01/image-399-1.png" class="kg-image" alt="" loading="lazy" width="1024" height="450" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/01/image-399-1.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2026/01/image-399-1.png 1000w, https://www.dailydoseofds.com/content/images/2026/01/image-399-1.png 1024w" sizes="(min-width: 720px) 720px"></figure><p>Immutability is what makes logs, evaluations, audits, and incident investigations trustworthy. Without it, historical behavior cannot be reliably reconstructed.</p><h3 id="clear-and-systematic-versioning">Clear and systematic versioning</h3><p>Prompt versions should follow a consistent and explicit versioning scheme. Many teams adopt semantic versioning using a three-part identifier such as <code>major.minor.patch</code> like <code>1.0.0</code> where:</p><ul><li>A major version indicates a breaking or structural change in prompt behavior.</li><li>A minor version reflects additive or behavioral improvements.</li><li>A patch version captures small refinements or wording tweaks.</li></ul><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/01/image-396-1.png" class="kg-image" alt="" loading="lazy" width="840" height="478" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/01/image-396-1.png 600w, https://www.dailydoseofds.com/content/images/2026/01/image-396-1.png 840w" sizes="(min-width: 720px) 720px"></figure><p>This structure communicates intent and risk to anyone consuming the prompt and aligns prompt evolution with established engineering conventions.</p><h3 id="metadata-and-context-tracking">Metadata and context tracking</h3><p>Versioning is not just about storing text. Each prompt version should carry metadata that explains its purpose and provenance.</p><p>At minimum, metadata should include who created or modified the prompt, when it was changed, target model and parameters.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/01/image-394-1.png" class="kg-image" alt="" loading="lazy" width="1000" height="339" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/01/image-394-1.png 600w, https://www.dailydoseofds.com/content/images/2026/01/image-394-1.png 1000w" sizes="(min-width: 720px) 720px"></figure><p>In more mature systems where large teams needs to collaborate, this can expand to include linked evaluation results, supported use cases, and environment tags such as development, staging, or production.</p><h3 id="testing-and-evaluation-as-a-gate">Testing and evaluation as a gate</h3><p>Prompt changes and versions should be evaluated using consistent testing.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/01/image-400.png" class="kg-image" alt="" loading="lazy" width="724" height="685" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/01/image-400.png 600w, https://www.dailydoseofds.com/content/images/2026/01/image-400.png 724w" sizes="(min-width: 720px) 720px"></figure><p>Evaluation may include automated test sets, task-specific metrics, failure analysis, or structured human review. Results should be recorded and used as promotion gates.</p><div class="kg-card kg-callout-card kg-callout-card-blue"><div class="kg-callout-emoji">👉</div><div class="kg-callout-text">We will be discussing evaluation as a whole in a later chapter.</div></div><h3 id="regression-handling-and-rollback">Regression handling and rollback</h3><p>Because LLM outputs are probabilistic, even small prompt changes can introduce unexpected regressions. For this reason, prompt updates must be treated as runtime configuration changes, not code changes. This is another reason why decoupling of prompts from application logic is important.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/01/image-401-1-1.png" class="kg-image" alt="" loading="lazy" width="493" height="484"></figure><p>A robust prompt versioning strategy includes an immediate rollback mechanism. If a new prompt version causes degraded quality, incorrect behavior, or format violations, we should be able to switch back to the last known stable prompt at runtime, without redeploying the application or rebuilding infrastructure.</p><p>This is typically achieved by resolving the “active” prompt version dynamically (for example, via an alias) rather than hard-coding a specific version in the application. Rolling back then becomes a simple pointer change, similar to reverting a feature flag.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/01/image-402-1-1.png" class="kg-image" alt="" loading="lazy" width="1024" height="439" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/01/image-402-1-1.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2026/01/image-402-1-1.png 1000w, https://www.dailydoseofds.com/content/images/2026/01/image-402-1-1.png 1024w" sizes="(min-width: 720px) 720px"></figure><p>Prompt regressions should be handled with the same rigor as code regressions: monitored, quickly reversible, and isolated from deployment cycles.</p><hr><p>To summarize: prompt versioning in production is fundamentally about explicit management. Prompts should be stored separately from code, versioned immutably, enriched with metadata, evaluated systematically, deployed through controlled aliases and monitored in operation. When treated this way, prompts evolve predictably, just like any other core component in a well-engineered production system.</p><p>With this, we have covered the key concepts of prompt versioning and will explore them further in the hands-on section of this chapter. Let us now move on to another core topic: prompt templates and their role in prompt management.</p><hr><h2 id="prompt-templates">Prompt templates</h2><p>Most real-world LLM applications need to format prompts dynamically: inserting user input or contextual data into the prompt. Hardcoding a single static prompt is rarely enough.</p><p>This is where prompt templates come in. A prompt template is essentially a string with placeholders that get filled in at runtime. Managing these templates well is important for clarity and consistency.</p><p>You can use simple string formatting or more sophisticated template engines to manage prompts. For instance, consider the simple template below:</p>
<!--kg-card-begin: html-->
<div style="overflow:hidden;margin-left:auto;margin-right:auto;border-radius:10px;width:100%;max-width:552px;position:relative"><div style="width:100%;padding-bottom:33.33333333333333%"></div><iframe width="552" height="184" title="" src="https://snappify.com/embed/13375c77-e3a4-458e-a2d2-eb157ff97c94/bc424c8c-fc32-4436-8833-f2bf9e54e279?responsive=1&b=1" allow="clipboard-write" allowfullscreen="" loading="lazy" style="background:linear-gradient(337deg,#654EA3FF,#DA98B4FF);position:absolute;left:0;top:0;width:100%" frameborder="0"></iframe></div>
<!--kg-card-end: html-->
<p>Here <code>{itinerary_details}</code> is a placeholder for data that will be plugged in (e.g., the user’s flight info). By using a template, you ensure the structure and wording of the prompt remain consistent every time, and only the variable parts change. This reduces human error.</p><p>Organizing templates in code or config files is also useful. For example, you might have a YAML or JSON file that stores prompts with keys, and your application loads them.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/01/image-405-1.png" class="kg-image" alt="" loading="lazy" width="1024" height="440" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/01/image-405-1.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2026/01/image-405-1.png 1000w, https://www.dailydoseofds.com/content/images/2026/01/image-405-1.png 1024w" sizes="(min-width: 720px) 720px"></figure><p>This ties back to prompt versioning, templates can be versioned just like full prompts. In fact templates are the prompts that are mostly in use and talked about for AI applications.</p><p>However, with templates, there are two very important things that we must keep in mind:</p>

<aside data-sx-content-cta class="relative p-0.5 js-toc-ignore">
        <div data-sx-content-cta-blur class="absolute w-full start-0 -top-0.5 -translate-y-full bg-linear-to-b to-70% from-black/0 to-background dark:to-background-dark pointer-events-none">
            &nbsp; <br> &nbsp; <br> &nbsp;
        </div>

    <div class="relative flex flex-col items-center text-center w-full px-4 py-10 lg:px-10 text-base rounded-xl sm:rounded-2xl">
        <div class="dark:hidden absolute -inset-px -z-10 rounded-xl bg-white shadow-pretty-sm transition-shadow duration-200"></div>
        <span class="hidden dark:block absolute -z-10 -inset-px rounded-inherit bg-radial-[50%_100%_at_50%_0%] from-white/10"></span>        
        <div class="hidden dark:block absolute -inset-px -z-10 rounded-xl bg-white/1 transition-colors duration-500 inset-shadow-md inset-shadow-white/2"></div>
        
        <div class="hidden dark:block absolute -z-10 -inset-0.5 mask-to-b mask-position-to-70% mask-opacity-to-50%">
            <div class="hidden dark:block absolute -z-10 inset-px rounded-xl border border-gray-50/10"></div>
        </div>                                                
        <div class="hidden dark:block absolute -z-10 -inset-px rounded-xl border border-gray-50/5 opacity-0"></div>

        <div class="hidden dark:block absolute -top-px start-1/2 -translate-x-1/2 w-1/2 h-px bg-linear-to-r from-white/0 to-white/0 via-gray-400/50"></div>

        <h2 class="text-gray-900 dark:text-gray-100 text-pretty text-xl sm:text-2xl leading-snug tracking-tight">




                                                <span class="hidden first:inline">
                                This lesson is for paying subscribers only
                            </span>                           
                    <span class="hidden first:inline">
                        This post is for paying subscribers only
                    </span>  

        </h2>

            <a href="/membership" class="mt-7 px-4 py-2.5 font-semibold text-sm text-center bg-primary rounded-full shadow-primary dark:shadow-none dark:drop-shadow-primary-sm hover:opacity-90 transition-opacity duration-200 inline-block" style="color: white;">
                Unlock Full Access
            </a>
            <p class="mt-5 text-sm">
                Already have an account? <button data-portal="signin" class="block sm:inline mx-auto font-medium underline decoration-primary decoration-2 underline-offset-2 hover:text-primary transition-colors duration-200">Sign in</button>
            </p>
    </div>
</aside>
]]></content:encoded>
                </item>
                <item>
                    <title><![CDATA[Context Engineering: Foundations, Categories, and Techniques of Prompt Engineering]]></title>
                    <description><![CDATA[LLMOps Part 5: An introduction to prompt engineering (a subset of context engineering), covering prompt types, the prompt development workflow, and key techniques in the field.]]></description>
                    <link>https://www.dailydoseofds.com/llmops-crash-course-part-5/</link>
                    <guid isPermaLink="false">69cad8cb10b4a900019d1004</guid>

                        <category><![CDATA[MLOps/LLMOps Course]]></category>
                        <category><![CDATA[LLMOps]]></category>

                        <dc:creator><![CDATA[Avi Chawla]]></dc:creator>

                    <pubDate>Sun, 25 Jan 2026 00:15:13 +0530</pubDate>

                        <media:content url="https://www.dailydoseofds.com/content/images/2026/01/Sanyog-MLOps_-_I-2.png" medium="image"/>

                    <content:encoded><![CDATA[<img src="https://www.dailydoseofds.com/content/images/2026/01/Sanyog-MLOps_-_I-2.png" alt="Context Engineering: Foundations, Categories, and Techniques of Prompt Engineering"/> <h2 id="recap">Recap</h2><p>In Part 4, we explored key decoding strategies, sampling parameters, and the general lifecycle of LLM-based applications.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/01/image-304.png" class="kg-image" alt="" loading="lazy" width="1000" height="395" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/01/image-304.png 600w, https://www.dailydoseofds.com/content/images/2026/01/image-304.png 1000w" sizes="(min-width: 720px) 720px"></figure><p>We began by introducing the concept of decoding, and how it functions. We explored the key decoding strategies in detail: greedy decoding, beam search, top-K, top-P, and min-P strategies.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/01/image-305.png" class="kg-image" alt="" loading="lazy" width="1000" height="455" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/01/image-305.png 600w, https://www.dailydoseofds.com/content/images/2026/01/image-305.png 1000w" sizes="(min-width: 720px) 720px"></figure><p>We then went ahead and understood the some of the major generation parameters, like temperature, top-P, top-K, max tokens, etc. in detail.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/01/image-306-2.png" class="kg-image" alt="" loading="lazy" width="982" height="738" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/01/image-306-2.png 600w, https://www.dailydoseofds.com/content/images/2026/01/image-306-2.png 982w" sizes="(min-width: 720px) 720px"></figure><p>We also examined the complete text generation workflow and sketched a mental map of how LLMs function.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/01/image-307.png" class="kg-image" alt="" loading="lazy" width="870" height="814" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/01/image-307.png 600w, https://www.dailydoseofds.com/content/images/2026/01/image-307.png 870w" sizes="(min-width: 720px) 720px"></figure><p>Moving ahead, we grounded the learned concepts about decoding and generation parameters with hands-on experiments. We compared the greedy and beam search decoding strategies, and showed how they operate at the token level.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.dailydoseofds.com/content/images/2026/01/image-308.png" class="kg-image" alt="" loading="lazy" width="808" height="426" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/01/image-308.png 600w, https://www.dailydoseofds.com/content/images/2026/01/image-308.png 808w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Greedy decoding</span></figcaption></figure><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.dailydoseofds.com/content/images/2026/01/image-309.png" class="kg-image" alt="" loading="lazy" width="801" height="422" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/01/image-309.png 600w, https://www.dailydoseofds.com/content/images/2026/01/image-309.png 801w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Beam search</span></figcaption></figure><p>We also compared different values for generation parameters (temperature, top-P, top-K), clearly demonstrating the influence the model’s responses.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.dailydoseofds.com/content/images/2026/01/image-310.png" class="kg-image" alt="" loading="lazy" width="1637" height="435" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/01/image-310.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2026/01/image-310.png 1000w, https://www.dailydoseofds.com/content/images/size/w1600/2026/01/image-310.png 1600w, https://www.dailydoseofds.com/content/images/2026/01/image-310.png 1637w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Temperature comparison</span></figcaption></figure><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.dailydoseofds.com/content/images/2026/01/image-311.png" class="kg-image" alt="" loading="lazy" width="1591" height="405" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/01/image-311.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2026/01/image-311.png 1000w, https://www.dailydoseofds.com/content/images/2026/01/image-311.png 1591w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Top-P comparison</span></figcaption></figure><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.dailydoseofds.com/content/images/2026/01/image-312.png" class="kg-image" alt="" loading="lazy" width="1589" height="407" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/01/image-312.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2026/01/image-312.png 1000w, https://www.dailydoseofds.com/content/images/2026/01/image-312.png 1589w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Top-K comparison</span></figcaption></figure><p>Finally, we transitioned towards more of LLMOps by exploring the general lifecycle of LLM-based applications and its different stages.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/01/image-313.png" class="kg-image" alt="" loading="lazy" width="902" height="1080" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/01/image-313.png 600w, https://www.dailydoseofds.com/content/images/2026/01/image-313.png 902w" sizes="(min-width: 720px) 720px"></figure><p>By the end of Part 4, we had developed a clear understanding of the core concepts and components underlying LLMs, along with a strong LLMOps mindset gained through understanding the application lifecycle.</p><p>Although this chapter is not strongly related to the previous part, we strongly recommend reviewing Part 4 first, as it helps maintain the natural learning flow of the series.</p><p>Read it here:</p><figure class="kg-card kg-bookmark-card"><a class="kg-bookmark-container" href="https://www.dailydoseofds.com/llmops-crash-course-part-4/"><div class="kg-bookmark-content"><div class="kg-bookmark-title">Building Blocks of LLMs: Decoding, Generation Parameters, and the LLM Application Lifecycle</div><div class="kg-bookmark-description">LLMOps Part 4: An exploration of key decoding strategies, sampling parameters, and the general lifecycle of LLM-based applications.</div><div class="kg-bookmark-metadata"><img class="kg-bookmark-icon" src="https://www.dailydoseofds.com/content/images/icon/dailyds-logo-transparent-7.png" alt=""><span class="kg-bookmark-author">Daily Dose of Data Science</span><span class="kg-bookmark-publisher">Avi Chawla</span></div></div><div class="kg-bookmark-thumbnail"><img src="https://www.dailydoseofds.com/content/images/thumbnail/Sanyog-MLOps_-_I-1-12.png" alt="" onerror="this.style.display = 'none'"></div></a></figure><p>In this and the next few chapters, we will be exploring context (and prompt) engineering. This chapter, in particular, focuses on the fundamentals of prompt engineering.</p><p>As always, every notion will be explained through clear examples and walkthroughs to develop a solid understanding.</p><p>Let’s begin!</p><hr><h2 id="introduction">Introduction</h2><p>Large Language Models (LLMs) have unlocked powerful new capabilities in software, but harnessing them in production systems (LLMOps) requires more than just plugging in an API.</p><p>Two critical disciplines have emerged in recent years:</p><ul><li>Prompt engineering</li><li>Context engineering</li></ul><p>Prompt engineering focuses on how we design the textual inputs (prompts) to guide an LLM’s behavior, while context engineering is about managing the entire information flow, structuring the surrounding data, tools, and environment that feed into the model. Together, these techniques enable us to build reliable and efficient LLM-powered applications.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.dailydoseofds.com/content/images/2026/01/image-315-1.png" class="kg-image" alt="" loading="lazy" width="871" height="394" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/01/image-315-1.png 600w, https://www.dailydoseofds.com/content/images/2026/01/image-315-1.png 871w" sizes="(min-width: 720px) 720px"><figcaption><a href="https://www.dailydoseofds.com/llmops-crash-course-part-1/#levers-of-ai-engineering"><span style="white-space: pre-wrap;">Prompts and context are two essential levers of AI engineering</span></a></figcaption></figure><div class="kg-card kg-callout-card kg-callout-card-blue"><div class="kg-callout-emoji">👉</div><div class="kg-callout-text">Just a heads up: Prompt engineering should not be seen as unrelated to context engineering. Prompt engineering is a subset of context engineering, since prompts are one way (to a certain extent) of shaping the model’s context. We mentioned prompt engineering separately because it is a large and mature discipline in its own and has been widely practiced for much longer, even before context engineering was popularized as a unified concept.</div></div><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/01/image-316.png" class="kg-image" alt="" loading="lazy" width="1456" height="1305" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/01/image-316.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2026/01/image-316.png 1000w, https://www.dailydoseofds.com/content/images/2026/01/image-316.png 1456w" sizes="(min-width: 720px) 720px"></figure><p>Now let’s go ahead and explore the core concepts of prompt engineering, a key subdomain of context engineering. We will begin with the fundamental principles, followed by a practical workflow for developing effective prompts. Finally, we will examine various prompt types and advanced prompting techniques.</p><hr><h2 id="fundamentals-of-prompting">Fundamentals of prompting</h2><p>A prompt is the input given to an LLM to draw out a response, it can be a question, instruction, or any text the model completes or responds to.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/01/image-317.png" class="kg-image" alt="" loading="lazy" width="1000" height="223" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/01/image-317.png 600w, https://www.dailydoseofds.com/content/images/2026/01/image-317.png 1000w" sizes="(min-width: 720px) 720px"></figure><p>From an engineering perspective, a prompt is not just casual text; it is a piece of logic we design to program the model’s behavior (often called “soft programming”).</p><p>In essence, prompt engineering means writing instructions that effectively leverage the knowledge already present in the model. Unlike traditional software where we write deterministic code, prompting is more of an interactive, probabilistic programming: we guide a black-box model with instructions, then observe how it behaves, and refine our approach.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/01/image-319-1.png" class="kg-image" alt="" loading="lazy" width="1024" height="492" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/01/image-319-1.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2026/01/image-319-1.png 1000w, https://www.dailydoseofds.com/content/images/2026/01/image-319-1.png 1024w" sizes="(min-width: 720px) 720px"></figure><h3 id="why-is-prompt-engineering-needed">Why is prompt engineering needed?</h3><p>LLMs are trained to continue text in a plausible way, but to get the specific output we want, we must craft the input skillfully. A poorly designed prompt can lead to irrelevant, inconsistent or incorrect outputs, even from a very capable model.</p><p>On the other hand, a well-designed prompt can make even a weaker model perform surprisingly well on a given task. Prompt engineering has therefore become “the skill of designing prompts that guide a generative AI model toward the kind of response you actually want”.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/01/image-320-1-1.png" class="kg-image" alt="" loading="lazy" width="1024" height="489" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/01/image-320-1-1.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2026/01/image-320-1-1.png 1000w, https://www.dailydoseofds.com/content/images/2026/01/image-320-1-1.png 1024w" sizes="(min-width: 720px) 720px"></figure><h3 id="the-system-perspective">The system perspective</h3><p>From a system standpoint, prompting is a first-class component of the application. We treat prompts like code: we design them methodically, test them, version-control them, and continuously improve them.</p><p>Importantly, prompt engineering is often the most accessible way to adapt an LLM to your needs without model retraining. It requires no updates to the model’s weights, instead, you leverage the model’s already present knowledge by providing instructions and context.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/01/image-321-1.png" class="kg-image" alt="" loading="lazy" width="1024" height="460" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/01/image-321-1.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2026/01/image-321-1.png 1000w, https://www.dailydoseofds.com/content/images/2026/01/image-321-1.png 1024w" sizes="(min-width: 720px) 720px"></figure><p>This makes prompt engineering fast to iterate on and deploy, which is why many early LLM applications relied solely on prompting. However, it’s not a silver bullet, more complex tasks often require going beyond skillful prompting to exploring a broader context pipeline, fine-tuning, or additional data.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/01/image-339.png" class="kg-image" alt="" loading="lazy" width="885" height="775" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/01/image-339.png 600w, https://www.dailydoseofds.com/content/images/2026/01/image-339.png 885w" sizes="(min-width: 720px) 720px"></figure><h3 id="in-context-learning">In-context learning</h3><p>Prompting is closely tied to the concept of in-context learning. Researchers discovered with GPT-3 that large models can learn tasks from examples given in the prompt itself, without any parameter updates.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/01/image-324-1.png" class="kg-image" alt="" loading="lazy" width="1024" height="460" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/01/image-324-1.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2026/01/image-324-1.png 1000w, https://www.dailydoseofds.com/content/images/2026/01/image-324-1.png 1024w" sizes="(min-width: 720px) 720px"></figure><p>For instance, if you prompt the model with a few example question-answer pairs (few-shot prompting), the model can infer the pattern and apply it to a new question. If you provide no examples, expecting the model to solve the task from just instructions, that’s zero-shot prompting.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/01/image-323.png" class="kg-image" alt="" loading="lazy" width="1236" height="793" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/01/image-323.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2026/01/image-323.png 1000w, https://www.dailydoseofds.com/content/images/2026/01/image-323.png 1236w" sizes="(min-width: 720px) 720px"></figure><p>Few-shot prompting (sometimes called few-shot in-context learning) often improves accuracy by showing the model what format or style of answer is expected. For example, adding 5 examples makes it a “5-shot” prompt. GPT-3’s paper “Language Models are Few-Shot Learners” highlighted this ability.</p><p>However, there are diminishing returns with newer advanced models: experiments have shown GPT-4 sometimes doesn’t gain much from few-shot examples on certain tasks compared to zero-shot. This is likely because newer models are better at following instructions out-of-the-box, so a clear zero-shot instruction often suffices for them.</p><p>But in niche domains, a few examples can still boost performance significantly if the model’s training data lacked those patterns. In practice, you should experiment to find the optimal number of examples for your task: balancing performance versus the added prompt length (which increases cost and latency).</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/01/image-327-1.png" class="kg-image" alt="" loading="lazy" width="934" height="404" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/01/image-327-1.png 600w, https://www.dailydoseofds.com/content/images/2026/01/image-327-1.png 934w" sizes="(min-width: 720px) 720px"></figure><p>Note that few-shot prompts use up context space, so you can’t include too many examples given the context window is limited.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/01/image-356.png" class="kg-image" alt="" loading="lazy" width="2000" height="448" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/01/image-356.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2026/01/image-356.png 1000w, https://www.dailydoseofds.com/content/images/size/w1600/2026/01/image-356.png 1600w, https://www.dailydoseofds.com/content/images/2026/01/image-356.png 2302w" sizes="(min-width: 720px) 720px"></figure><p>In summary, the fundamental idea is that prompts provide both instructions and context to the model. They can include a description of the task, relevant information or data, and examples of the task being performed.</p><p>Overall, prompt engineering is fundamentally about constructing these elements in a way that the model reliably produces the desired behavior. We now turn to a systematic approach to developing prompts.</p><hr><h2 id="systematic-prompt-development-workflow">Systematic prompt development workflow</h2><p>Developing an effective prompt is an iterative engineering process. Rather than guess-and-check ad hoc, it helps to follow a structured workflow:</p><h3 id="define-the-task-and-success-criteria">Define the task and success criteria</h3><p>Start by clearly defining what you want the model to do (e.g., “extract the total price from an invoice email and output a JSON”). Identify what a correct output looks like and any constraints (formatting, tone, length, etc.).</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/01/image-332.png" class="kg-image" alt="" loading="lazy" width="614" height="462" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/01/image-332.png 600w, https://www.dailydoseofds.com/content/images/2026/01/image-332.png 614w"></figure><p>This is analogous to writing a spec; it will guide your prompt design and evaluation criteria.</p><h3 id="draft-an-initial-prompt">Draft an initial prompt</h3><p>Create a first version of the prompt. At a minimum, state the instruction clearly. Depending on the task, you might include some examples or a specific format.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.dailydoseofds.com/content/images/2026/01/image-333-1.png" class="kg-image" alt="" loading="lazy" width="1044" height="433" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/01/image-333-1.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2026/01/image-333-1.png 1000w, https://www.dailydoseofds.com/content/images/2026/01/image-333-1.png 1044w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Write an initial prompt</span></figcaption></figure>

<aside data-sx-content-cta class="relative p-0.5 js-toc-ignore">
        <div data-sx-content-cta-blur class="absolute w-full start-0 -top-0.5 -translate-y-full bg-linear-to-b to-70% from-black/0 to-background dark:to-background-dark pointer-events-none">
            &nbsp; <br> &nbsp; <br> &nbsp;
        </div>

    <div class="relative flex flex-col items-center text-center w-full px-4 py-10 lg:px-10 text-base rounded-xl sm:rounded-2xl">
        <div class="dark:hidden absolute -inset-px -z-10 rounded-xl bg-white shadow-pretty-sm transition-shadow duration-200"></div>
        <span class="hidden dark:block absolute -z-10 -inset-px rounded-inherit bg-radial-[50%_100%_at_50%_0%] from-white/10"></span>        
        <div class="hidden dark:block absolute -inset-px -z-10 rounded-xl bg-white/1 transition-colors duration-500 inset-shadow-md inset-shadow-white/2"></div>
        
        <div class="hidden dark:block absolute -z-10 -inset-0.5 mask-to-b mask-position-to-70% mask-opacity-to-50%">
            <div class="hidden dark:block absolute -z-10 inset-px rounded-xl border border-gray-50/10"></div>
        </div>                                                
        <div class="hidden dark:block absolute -z-10 -inset-px rounded-xl border border-gray-50/5 opacity-0"></div>

        <div class="hidden dark:block absolute -top-px start-1/2 -translate-x-1/2 w-1/2 h-px bg-linear-to-r from-white/0 to-white/0 via-gray-400/50"></div>

        <h2 class="text-gray-900 dark:text-gray-100 text-pretty text-xl sm:text-2xl leading-snug tracking-tight">




                                                <span class="hidden first:inline">
                                This lesson is for paying subscribers only
                            </span>                           
                    <span class="hidden first:inline">
                        This post is for paying subscribers only
                    </span>  

        </h2>

            <a href="/membership" class="mt-7 px-4 py-2.5 font-semibold text-sm text-center bg-primary rounded-full shadow-primary dark:shadow-none dark:drop-shadow-primary-sm hover:opacity-90 transition-opacity duration-200 inline-block" style="color: white;">
                Unlock Full Access
            </a>
            <p class="mt-5 text-sm">
                Already have an account? <button data-portal="signin" class="block sm:inline mx-auto font-medium underline decoration-primary decoration-2 underline-offset-2 hover:text-primary transition-colors duration-200">Sign in</button>
            </p>
    </div>
</aside>
]]></content:encoded>
                </item>
                <item>
                    <title><![CDATA[Building Blocks of LLMs: Decoding, Generation Parameters, and the LLM Application Lifecycle]]></title>
                    <description><![CDATA[LLMOps Part 4: An exploration of key decoding strategies, sampling parameters, and the general lifecycle of LLM-based applications.]]></description>
                    <link>https://www.dailydoseofds.com/llmops-crash-course-part-4/</link>
                    <guid isPermaLink="false">69cad8cb10b4a900019d0f0c</guid>

                        <category><![CDATA[MLOps/LLMOps Course]]></category>
                        <category><![CDATA[LLMOps]]></category>

                        <dc:creator><![CDATA[Avi Chawla]]></dc:creator>

                    <pubDate>Sun, 18 Jan 2026 00:14:21 +0530</pubDate>

                        <media:content url="https://www.dailydoseofds.com/content/images/2026/01/Sanyog-MLOps_-_I-1.png" medium="image"/>

                    <content:encoded><![CDATA[<img src="https://www.dailydoseofds.com/content/images/2026/01/Sanyog-MLOps_-_I-1.png" alt="Building Blocks of LLMs: Decoding, Generation Parameters, and the LLM Application Lifecycle"/> <h2 id="recap">Recap</h2><p>Before we dive into Part 4, let’s briefly recap what we covered in the previous chapter.</p><p>In Part 3, we moved beyond the foundational translation layer of tokenization and embeddings, and stepped into the core mechanism that actually drives the intelligence, particularly attention.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/01/image-150.png" class="kg-image" alt="" loading="lazy" width="535" height="803"></figure><p>We began by introducing the attention mechanism, the central innovation behind transformers. We explored self-attention in detail. We broke down how input embeddings are projected into Query, Key, and Value vectors, and how scaled dot-product attention computes relevance scores between tokens.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/01/image-151.png" class="kg-image" alt="" loading="lazy" width="682" height="565" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/01/image-151.png 600w, https://www.dailydoseofds.com/content/images/2026/01/image-151.png 682w"></figure><p>To make this concrete, we also walked through a small illustrative example with manually defined $Q$, $K$, and $V$ vectors.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/01/image-152.png" class="kg-image" alt="" loading="lazy" width="1000" height="421" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/01/image-152.png 600w, https://www.dailydoseofds.com/content/images/2026/01/image-152.png 1000w" sizes="(min-width: 720px) 720px"></figure><p>We then extended the discussion to multi-head attention and saw why a single attention computation is insufficient to capture the many parallel relationships present in language, and how splitting the embedding space into multiple subspaces allows different heads to focus on different aspects.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/01/image-153.png" class="kg-image" alt="" loading="lazy" width="1362" height="460" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/01/image-153.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2026/01/image-153.png 1000w, https://www.dailydoseofds.com/content/images/2026/01/image-153.png 1362w" sizes="(min-width: 720px) 720px"></figure><p>We also examined how head outputs are concatenated and why the final output projection matrix $W^O$ is necessary to recombine and mix information across heads.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/01/image-154.png" class="kg-image" alt="" loading="lazy" width="952" height="559" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/01/image-154.png 600w, https://www.dailydoseofds.com/content/images/2026/01/image-154.png 952w" sizes="(min-width: 720px) 720px"></figure><p>Next, we introduced causal masking, a critical concept for autoregressive language models. We explained why models must be prevented from attending to future tokens during training, and how this is enforced by adding an upper-triangular mask of negative infinity values before the softmax operation.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/01/image-155.png" class="kg-image" alt="" loading="lazy" width="560" height="407"></figure><p>After covering attention, we broadened our perspective to architectural choices. We contrasted dense transformer architecture with mixture-of-experts (MoE) models, explaining how MoE introduces sparse activation through expert selection.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/01/image-156.png" class="kg-image" alt="" loading="lazy" width="1000" height="463" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/01/image-156.png 600w, https://www.dailydoseofds.com/content/images/2026/01/image-156.png 1000w" sizes="(min-width: 720px) 720px"></figure><p>We then shifted from architecture to learning dynamics by examining pretraining and fine-tuning. Pretraining was portrayed as learning the statistical distribution of language via next-token prediction at massive scale, producing a capable but indifferent model.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/01/image-159.png" class="kg-image" alt="" loading="lazy" width="1474" height="568" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/01/image-159.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2026/01/image-159.png 1000w, https://www.dailydoseofds.com/content/images/2026/01/image-159.png 1474w" sizes="(min-width: 720px) 720px"></figure><p>Fine-tuning, particularly instruction tuning, was presented as behavioral conditioning that reshapes this distribution, making the model cooperative, task-aware, and aligned with user intent.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/01/image-158.png" class="kg-image" alt="" loading="lazy" width="1300" height="536" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/01/image-158.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2026/01/image-158.png 1000w, https://www.dailydoseofds.com/content/images/2026/01/image-158.png 1300w" sizes="(min-width: 720px) 720px"></figure><p>Finally, we grounded these ideas with hands-on experiments. We inspected next-token probability distributions to show that model outputs are fundamentally probabilistic, and that even a single-word change in a prompt can reshape the distribution.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/01/image-160.png" class="kg-image" alt="" loading="lazy" width="1211" height="411" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/01/image-160.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2026/01/image-160.png 1000w, https://www.dailydoseofds.com/content/images/2026/01/image-160.png 1211w" sizes="(min-width: 720px) 720px"></figure><p>We also directly compared pretrained and instruction-tuned models under identical prompts, clearly demonstrating the difference between text completion and instruction-following behavior.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/01/image-161.png" class="kg-image" alt="" loading="lazy" width="879" height="563" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/01/image-161.png 600w, https://www.dailydoseofds.com/content/images/2026/01/image-161.png 879w" sizes="(min-width: 720px) 720px"></figure><p>By the end of Part 3, we had developed a clear understanding of how tokens exchange information, how architectural choices affect efficiency and behavior, and why pretraining and fine-tuning lead to different model personalities.</p><p>If you haven’t yet gone through Part 3, we strongly recommend reviewing it first, as it lays the conceptual groundwork for everything that follows.</p><p>Read it here:</p><figure class="kg-card kg-bookmark-card"><a class="kg-bookmark-container" href="https://www.dailydoseofds.com/llmops-crash-course-part-3/"><div class="kg-bookmark-content"><div class="kg-bookmark-title">Building Blocks of LLMs: Attention, Architectural Designs and Training</div><div class="kg-bookmark-description">LLMOps Part 3: A focused look at the core ideas behind attention mechanism, transformer and mixture-of-experts architectures, and model pretraining and fine-tuning.</div><div class="kg-bookmark-metadata"><img class="kg-bookmark-icon" src="https://www.dailydoseofds.com/content/images/icon/dailyds-logo-transparent-5.png" alt=""><span class="kg-bookmark-author">Daily Dose of Data Science</span><span class="kg-bookmark-publisher">Avi Chawla</span></div></div><div class="kg-bookmark-thumbnail"><img src="https://www.dailydoseofds.com/content/images/thumbnail/Sanyog-MLOps_-_I-12.png" alt="" onerror="this.style.display = 'none'"></div></a></figure><p>In this chapter, we will be exploring decoding strategies, generation parameters, and the broader lifecycle of LLM-based applications.</p><p>As always, every notion will be explained through clear examples and walkthroughs to develop a solid understanding.</p><p>Let’s begin!</p><hr><h2 id="llm-decoding">LLM decoding</h2><p>There is a common misconception that large language models directly produce text. In reality, they do not generate text outright. Instead, at each step, they compute logits, which are scores assigned to every token in the model’s vocabulary. These logits are then converted into probabilities using the softmax function.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/01/image-236-1.png" class="kg-image" alt="" loading="lazy" width="1024" height="404" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/01/image-236-1.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2026/01/image-236-1.png 1000w, https://www.dailydoseofds.com/content/images/2026/01/image-236-1.png 1024w" sizes="(min-width: 720px) 720px"></figure><p>As we saw in the hands-on section of the previous chapter, given a sequence of tokens, each possible next token has an associated probability of being selected. However, how the actual choice of the next token is made has not been discussed by us so far.</p><p>Large language models that are autoregressive predict the next token based on all previously generated tokens. Consider a sequence of tokens $w = w_1, w_2, \ldots, w_t$. The joint probability of the entire sequence can be factorized using the chain rule of probability as:</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/01/image-285.png" class="kg-image" alt="" loading="lazy" width="2000" height="718" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/01/image-285.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2026/01/image-285.png 1000w, https://www.dailydoseofds.com/content/images/size/w1600/2026/01/image-285.png 1600w, https://www.dailydoseofds.com/content/images/2026/01/image-285.png 2139w" sizes="(min-width: 720px) 720px"></figure><p>For each token $w_i$, the term $P(w_i \mid w_1, \ldots, w_{i-1})$ represents the conditional probability of that token given the preceding context. At every generation step, the model computes this conditional probability for every token in its vocabulary.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/01/image-261-1.png" class="kg-image" alt="" loading="lazy" width="1280" height="568" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/01/image-261-1.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2026/01/image-261-1.png 1000w, https://www.dailydoseofds.com/content/images/2026/01/image-261-1.png 1280w" sizes="(min-width: 720px) 720px"></figure><p>This naturally leads to an important question: how do we use these probabilities to actually generate text?</p><p>This is where decoding strategies come into play. Decoding strategies define how a single token is selected from the probability distribution at each step.</p><p>Fundamentally, decoding is the process of converting the model’s raw numerical outputs into human-readable text.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/01/image-286.png" class="kg-image" alt="" loading="lazy" width="2000" height="930" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/01/image-286.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2026/01/image-286.png 1000w, https://www.dailydoseofds.com/content/images/size/w1600/2026/01/image-286.png 1600w, https://www.dailydoseofds.com/content/images/2026/01/image-286.png 2262w" sizes="(min-width: 720px) 720px"></figure><p>While the model provides a probability distribution over the entire vocabulary, the decoding strategy determines which specific token is chosen and appended to the sequence before moving on to the next step.</p><p><strong>How LLM decoding works (in simple language)?</strong></p><p>As we already know by now, at its core, an LLM is a&nbsp;next-token predictor&nbsp;that, given a sequence of words or sub-words (tokens), calculates possibilities over its entire vocabulary for what the next most likely token should be.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/01/image-262.png" class="kg-image" alt="" loading="lazy" width="1024" height="404" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/01/image-262.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2026/01/image-262.png 1000w, https://www.dailydoseofds.com/content/images/2026/01/image-262.png 1024w" sizes="(min-width: 720px) 720px"></figure><p>Decoding strategies are the set of rules used to translate these raw probabilities into coherent, human-readable text by selecting a token.&nbsp;The process is autoregressive: each newly chosen token is added to the sequence and used as part of the input for predicting the subsequent token, continuing until a stopping condition (like an "end-of-sequence" token or maximum length) is met.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/01/image-241-1-1.png" class="kg-image" alt="" loading="lazy" width="1024" height="466" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/01/image-241-1-1.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2026/01/image-241-1-1.png 1000w, https://www.dailydoseofds.com/content/images/2026/01/image-241-1-1.png 1024w" sizes="(min-width: 720px) 720px"></figure><h3 id="decoding-strategies">Decoding strategies</h3><p>Now that we know the fundamental meaning of decoding, let's go ahead and discuss the four major decoding strategies themselves. These strategies define how the next token is chosen at each step. We will explain each and compare their behavior:</p><h4 id="greedy-decoding">Greedy decoding</h4><p>Greedy decoding always picks the highest probability token at each step:</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/01/image-211.png" class="kg-image" alt="" loading="lazy" width="555" height="83"></figure><p>It’s simple and fast (no extra search). The advantage is that it produces the single most likely sequence according to the model’s learned distribution (or at least a local optimum of that).</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/01/image-240-1.png" class="kg-image" alt="" loading="lazy" width="1024" height="461" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/01/image-240-1.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2026/01/image-240-1.png 1000w, https://www.dailydoseofds.com/content/images/2026/01/image-240-1.png 1024w" sizes="(min-width: 720px) 720px"></figure><p>However, greedy outputs often suffer from repetition or blandness. This is especially true for open-ended generation, where the model’s distribution has a long tail.</p><p>For example, a model can get stuck repeating some statement because the model keeps choosing the highest probability continuation, which might circle back to a previous phrase.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/01/image-287.png" class="kg-image" alt="" loading="lazy" width="2000" height="733" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/01/image-287.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2026/01/image-287.png 1000w, https://www.dailydoseofds.com/content/images/size/w1600/2026/01/image-287.png 1600w, https://www.dailydoseofds.com/content/images/size/w2400/2026/01/image-287.png 2400w" sizes="(min-width: 720px) 720px"></figure><p>Greedy decoding can be appropriate in more constrained tasks.</p><p>For instance, in translation, greedy often does well for language pairs with monotonic alignment, and it’s used in real-time systems due to speed. But for tasks like dialog or storytelling, greedy is usually not what you want if you expect diverse and engaging outputs.</p><h4 id="beam-search">Beam search</h4><p>Beam search is an extension of greedy search that keeps track of multiple hypotheses (paths) at each step instead of one.</p><p>If beam width is $B$, it will explore the top $B$ tokens for the first word, then for each of those, explore top $B$ for the second word (so $B^2$ combinations, but it keeps only the top B sequences by total probability), and so on.</p><p>In essence, beam search tries to approximate the globally most likely sequence under the model, rather than making greedy local choices. This often yields better results in tasks where the model’s probability correlates with quality (like translation or summarization).</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.dailydoseofds.com/content/images/2026/01/image-264-1-1.png" class="kg-image" alt="" loading="lazy" width="1024" height="478" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/01/image-264-1-1.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2026/01/image-264-1-1.png 1000w, https://www.dailydoseofds.com/content/images/2026/01/image-264-1-1.png 1024w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Beam width = 3</span></figcaption></figure><p>However, beam search has known downsides for LLMs in open generation:</p><ul><li>It tends to produce repetitive outputs too (the model maximizing probability often means it finds a loop it likes). In fact, beam search can be worse than greedy search for repetition. It has been observed that beyond a certain beam size, the output quality significantly decreases for storytelling tasks.</li><li>Beam search can result in length bias: the model often assigns a higher probability to shorter sequences (since each additional token multiplies probabilities, lowering it). Without a length penalty, beam search may prefer ending early to get a higher average score. This is why a length penalty is introduced to encourage longer outputs when appropriate.</li></ul><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/01/image-288.png" class="kg-image" alt="" loading="lazy" width="1024" height="559" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/01/image-288.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2026/01/image-288.png 1000w, https://www.dailydoseofds.com/content/images/2026/01/image-288.png 1024w" sizes="(min-width: 720px) 720px"></figure><p>In practice, beam search is favored in tasks where precision and logical consistency trump creativity, for example, in summarization (you want the most likely fluent summary), translation, or generating code (where randomness could introduce errors). Meanwhile, for conversational AI or creative generation, beam search can make the model too rigid and prone to generic responses, so sampling is preferred.</p><h4 id="top-k-sampling">Top-K sampling</h4><p>We will discuss the top-K and top-P parameters in a later section. For now, it’s enough to understand that with top-K sampling, at each generation step, the model samples from at most the top K most probable options.</p><p>It's a simple truncation: the low-probability tail is cut off. This avoids weird tokens but still leaves randomness among the top tokens.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/01/image-268-1.png" class="kg-image" alt="" loading="lazy" width="1024" height="414" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/01/image-268-1.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2026/01/image-268-1.png 1000w, https://www.dailydoseofds.com/content/images/2026/01/image-268-1.png 1024w" sizes="(min-width: 720px) 720px"></figure><p>Note that if K is large (e.g. K = 100 or so), it’s almost like full sampling except excluding really unlikely picks, which hardly changes the distribution. If K is very small (e.g. K = 2 or 3), the model becomes only a bit random, it’s like “restricted sampling” where it might alternate between a couple of likely continuations, leading to some limited variation but not a lot.</p><p>This type of strategy tends to improve output quality and coherence because low-probability tokens often are low for a reason (either they are nonsensical in context or extremely rare completions). As a result, top-K sampling can reduce the chance of nonsensical completions while still allowing the model to surprise us by not always picking the top option.</p><p>However, like every strategy, there are certain gray areas here too. One drawback is that choosing K is not obvious. A fixed K might be too high in some contexts and too low in others. For example, if the model is very sure about the next word (distribution is peaked), then whether K = 500 or K = 5 doesn’t matter because maybe only 3 tokens had significant probability anyway. But if the distribution is flat (lots of possibilities), limiting to K might prematurely cut off some plausible options.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/01/image-269-1.png" class="kg-image" alt="" loading="lazy" width="1024" height="476" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/01/image-269-1.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2026/01/image-269-1.png 1000w, https://www.dailydoseofds.com/content/images/2026/01/image-269-1.png 1024w" sizes="(min-width: 720px) 720px"></figure><p>Despite the above limitation, top-K is computationally efficient and easy to implement, so it’s quite popular.</p><h4 id="nucleus-sampling">Nucleus sampling</h4><p>Nucleus (top-P) sampling includes the smallest set of possible tokens whose cumulative probability is greater than or equal to P.</p><p>Nucleus sampling addresses the adaptiveness issue. With top-P, you might sample from 2 tokens in one case (if the model is confident) or 20 in another (if unsure), whatever number is needed to reach the cumulative probability P.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/01/image-267-1.png" class="kg-image" alt="" loading="lazy" width="1024" height="408" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/01/image-267-1.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2026/01/image-267-1.png 1000w, https://www.dailydoseofds.com/content/images/2026/01/image-267-1.png 1024w" sizes="(min-width: 720px) 720px"></figure><p>This ensures that the tail is cut off based on significance, not an arbitrary count. So in general, top-P tends to preserve more contextually appropriate diversity. If a model is 90% sure about something, top-P = 0.9 will basically become greedy (only that token considered). If a model is very uncertain, top-P=0.9 might allow many options, reflecting genuine ambiguity.</p><p>The quality of top-P outputs is often very good for conversational and creative tasks. It was shown to produce fluent text with less likelihood of incoherence compared to unfiltered random sampling.</p><p>That said, one still needs to be reasonable about the choice of P (too high P, like 0.99, brings back long-tail weirdness; too low P, like 0.7, might cut off some normal continuations).</p><h4 id="min-p-sampling">Min-p sampling</h4><p>Min-P sampling is a dynamic truncation method that adjusts the sampling threshold based on the model's confidence at each decoding step.</p><p>Unlike top-P, which uses a fixed cumulative probability threshold, min-P looks at the probability of the most likely token and only keeps tokens that are at least a certain fraction (the min-P value) as likely.</p><p>So if your top token has 60% probability and min-P is set to 0.1, only tokens with at least 6% probability make the cut. But if the top token is just 20% confident, then the adapted 2% threshold lets many more candidates through.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/01/image-289.png" class="kg-image" alt="" loading="lazy" width="2000" height="1095" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/01/image-289.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2026/01/image-289.png 1000w, https://www.dailydoseofds.com/content/images/size/w1600/2026/01/image-289.png 1600w, https://www.dailydoseofds.com/content/images/size/w2400/2026/01/image-289.png 2400w" sizes="(min-width: 720px) 720px"></figure>

<aside data-sx-content-cta class="relative p-0.5 js-toc-ignore">
        <div data-sx-content-cta-blur class="absolute w-full start-0 -top-0.5 -translate-y-full bg-linear-to-b to-70% from-black/0 to-background dark:to-background-dark pointer-events-none">
            &nbsp; <br> &nbsp; <br> &nbsp;
        </div>

    <div class="relative flex flex-col items-center text-center w-full px-4 py-10 lg:px-10 text-base rounded-xl sm:rounded-2xl">
        <div class="dark:hidden absolute -inset-px -z-10 rounded-xl bg-white shadow-pretty-sm transition-shadow duration-200"></div>
        <span class="hidden dark:block absolute -z-10 -inset-px rounded-inherit bg-radial-[50%_100%_at_50%_0%] from-white/10"></span>        
        <div class="hidden dark:block absolute -inset-px -z-10 rounded-xl bg-white/1 transition-colors duration-500 inset-shadow-md inset-shadow-white/2"></div>
        
        <div class="hidden dark:block absolute -z-10 -inset-0.5 mask-to-b mask-position-to-70% mask-opacity-to-50%">
            <div class="hidden dark:block absolute -z-10 inset-px rounded-xl border border-gray-50/10"></div>
        </div>                                                
        <div class="hidden dark:block absolute -z-10 -inset-px rounded-xl border border-gray-50/5 opacity-0"></div>

        <div class="hidden dark:block absolute -top-px start-1/2 -translate-x-1/2 w-1/2 h-px bg-linear-to-r from-white/0 to-white/0 via-gray-400/50"></div>

        <h2 class="text-gray-900 dark:text-gray-100 text-pretty text-xl sm:text-2xl leading-snug tracking-tight">




                                                <span class="hidden first:inline">
                                This lesson is for paying subscribers only
                            </span>                           
                    <span class="hidden first:inline">
                        This post is for paying subscribers only
                    </span>  

        </h2>

            <a href="/membership" class="mt-7 px-4 py-2.5 font-semibold text-sm text-center bg-primary rounded-full shadow-primary dark:shadow-none dark:drop-shadow-primary-sm hover:opacity-90 transition-opacity duration-200 inline-block" style="color: white;">
                Unlock Full Access
            </a>
            <p class="mt-5 text-sm">
                Already have an account? <button data-portal="signin" class="block sm:inline mx-auto font-medium underline decoration-primary decoration-2 underline-offset-2 hover:text-primary transition-colors duration-200">Sign in</button>
            </p>
    </div>
</aside>
]]></content:encoded>
                </item>
                <item>
                    <title><![CDATA[Building Blocks of LLMs: Attention, Architectural Designs and Training]]></title>
                    <description><![CDATA[LLMOps Part 3: A focused look at the core ideas behind attention mechanism, transformer and mixture-of-experts architectures, and model pretraining and fine-tuning.]]></description>
                    <link>https://www.dailydoseofds.com/llmops-crash-course-part-3/</link>
                    <guid isPermaLink="false">69cad8cb10b4a900019d0f08</guid>

                        <category><![CDATA[MLOps/LLMOps Course]]></category>

                        <dc:creator><![CDATA[Avi Chawla]]></dc:creator>

                    <pubDate>Sun, 11 Jan 2026 01:57:00 +0530</pubDate>

                        <media:content url="https://www.dailydoseofds.com/content/images/2026/01/Sanyog-MLOps_-_I.png" medium="image"/>

                    <content:encoded><![CDATA[<img src="https://www.dailydoseofds.com/content/images/2026/01/Sanyog-MLOps_-_I.png" alt="Building Blocks of LLMs: Attention, Architectural Designs and Training"/> <h2 id="recap">Recap</h2><p>Before we dive into Part 3, let’s briefly recap what we covered in the previous part of this course.</p><p>In Part 2, we started with our exploration of the concepts and the core building blocks of large language models. We began by understanding the fundamental concept behind tokenization.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/01/image-2.png" class="kg-image" alt="" loading="lazy" width="1000" height="249" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/01/image-2.png 600w, https://www.dailydoseofds.com/content/images/2026/01/image-2.png 1000w" sizes="(min-width: 720px) 720px"></figure><p>Next, we discussed subword tokenization and understood why it is superior to word-level or character-level tokenization.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/01/image-3.png" class="kg-image" alt="" loading="lazy" width="1000" height="511" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/01/image-3.png 600w, https://www.dailydoseofds.com/content/images/2026/01/image-3.png 1000w" sizes="(min-width: 720px) 720px"></figure><p>After that, we explored the three most common subword-based tokenization algorithms: byte-pair encoding, WordPiece algorithm and Unigram tokenization. We also learned about the byte-level BPE and understood how it helps in solving the OOV problem.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/01/image-4.png" class="kg-image" alt="" loading="lazy" width="1000" height="511" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/01/image-4.png 600w, https://www.dailydoseofds.com/content/images/2026/01/image-4.png 1000w" sizes="(min-width: 720px) 720px"></figure><p>After tokenization, we examined and explored the concept of embeddings and why they are required. We also understood that tokenization is one stage of the two-stage critical translation process, with the other stage being embedding, which maps the tokens into a continuous vector space.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/01/image-8.png" class="kg-image" alt="" loading="lazy" width="847" height="165" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/01/image-8.png 600w, https://www.dailydoseofds.com/content/images/2026/01/image-8.png 847w" sizes="(min-width: 720px) 720px"></figure><p>Moving forward, we explored the types of embeddings: token embeddings and positional embeddings. We understood that token embeddings are learned representations that map token IDs to vectors, and positional embeddings inject position information.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/01/image-6.png" class="kg-image" alt="" loading="lazy" width="1000" height="546" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/01/image-6.png 600w, https://www.dailydoseofds.com/content/images/2026/01/image-6.png 1000w" sizes="(min-width: 720px) 720px"></figure><p>Next, we examined and learned about the two types of positional embeddings: absolute positional embeddings and relative positional embeddings.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.dailydoseofds.com/content/images/2025/12/image-207.png" class="kg-image" alt="" loading="lazy" width="1477" height="474" srcset="https://www.dailydoseofds.com/content/images/size/w600/2025/12/image-207.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2025/12/image-207.png 1000w, https://www.dailydoseofds.com/content/images/2025/12/image-207.png 1477w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Absolute positional embeddings</span></figcaption></figure><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.dailydoseofds.com/content/images/2025/12/image-208.png" class="kg-image" alt="" loading="lazy" width="969" height="157" srcset="https://www.dailydoseofds.com/content/images/size/w600/2025/12/image-208.png 600w, https://www.dailydoseofds.com/content/images/2025/12/image-208.png 969w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Relative positional embeddings</span></figcaption></figure><p>Finally, we took a hands-on approach to tokenization and embeddings. We compared different GPT tokenizers and examined how token embeddings function as a lookup table.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.dailydoseofds.com/content/images/2026/01/image-7.png" class="kg-image" alt="" loading="lazy" width="502" height="225"><figcaption><span style="white-space: pre-wrap;">Comparison of GPT tokenizers</span></figcaption></figure><p>By the end of Part 2, we had a clear understanding of the two-stage translation process. Tokenization converts raw text into discrete symbols drawn from a finite vocabulary, while embeddings convert those symbols into continuous vector representations that neural networks operate on.</p><p>If you haven’t yet studied Part 2, we strongly recommend reviewing it first, as it establishes the conceptual foundation essential for understanding the material we’re about to cover.</p><p>Read it here:</p><figure class="kg-card kg-bookmark-card"><a class="kg-bookmark-container" href="https://www.dailydoseofds.com/llmops-crash-course-part-2/"><div class="kg-bookmark-content"><div class="kg-bookmark-title">Building Blocks of LLMs: Tokenization and Embeddings</div><div class="kg-bookmark-description">LLMOps Part 2: A detailed walkthrough of tokenization, embeddings, and positional representations, building the foundational translation layer that enables LLMs to process and reason over text.</div><div class="kg-bookmark-metadata"><img class="kg-bookmark-icon" src="https://www.dailydoseofds.com/content/images/icon/dailyds-logo-transparent-3.png" alt=""><span class="kg-bookmark-author">Daily Dose of Data Science</span><span class="kg-bookmark-publisher">Avi Chawla</span></div></div><div class="kg-bookmark-thumbnail"><img src="https://www.dailydoseofds.com/content/images/thumbnail/Sanyog-MLOps_-_I-3-9.png" alt="" onerror="this.style.display = 'none'"></div></a></figure><p>In this chapter, we shall further examine key components of large language models, focusing on the attention mechanism, the core differences between transformer and mixture-of-experts architectures, and the fundamentals of pretraining and fine-tuning.</p><p>As always, every notion will be explained through clear examples and walkthroughs to develop a solid understanding.</p><div class="kg-card kg-callout-card kg-callout-card-blue"><div class="kg-callout-emoji">👉</div><div class="kg-callout-text">This chapter presumes proficiency in Python programming, linear algebra, as well as a working knowledge of fundamental ML and NLP terminology.</div></div><p>Let’s begin!</p><hr><h2 id="attention-mechanism">Attention mechanism</h2><p>“Attention” is the key innovation that enabled Transformers to leapfrog previous RNN-based models. At a high level, an attention mechanism lets a model dynamically weight the influence of other tokens when encoding a given token.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/01/image-36.png" class="kg-image" alt="" loading="lazy" width="535" height="803"></figure><p>In other words, the model can attend to relevant parts of the sequence as it processes each position. Instead of a fixed-size memory, attention provides a flexible lookup: for each token, the model can refer to any other token’s representation to inform itself, with learnable weights indicating how much to use each other token.</p><p>Hence, in a nutshell, attention is the mathematical engine that allows the model to route information between tokens, enabling it to model long-range dependencies.</p><h3 id="self-attention">Self-attention</h3><p>The most common form of attention is self-attention (specifically scaled dot-product self-attention). In self-attention, each position in the sequence sends queries and keys to every other position and receives back a weighted sum of values.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/01/9b20491a-f55f-49d0-8af3-c29624d31cf2_566x498--2-.gif" class="kg-image" alt="" loading="lazy" width="566" height="498"></figure><p>At its heart, it is a differentiable dictionary lookup. For every token in a sequence, the model projects the input embedding into three distinct vectors, concretely, for each token (at position $i$), the model computes:</p><ul><li>a Query vector $Q_i$​: Represents what the current token is looking for (e.g., a verb looking for its subject).</li><li>a Key vector $K_i$: Represents what the current token "advertises" itself as (e.g., "I am a noun, plural").</li><li>a Value vector $V_i$​: The actual content information the token holds.</li></ul><hr><p><strong>Important note:</strong></p><p>In a transformer, input token embeddings are linearly projected into Query (Q), Key (K), and Value (V) representations using three separate learned weight matrices.</p><p>Let the input embedding matrix be: $X \in \mathbb{R}^{n \times d_{\text{model}}}$, where $n$ is the sequence length and $d_{\text{model}}$ is the embedding dimension.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/01/image-66.png" class="kg-image" alt="" loading="lazy" width="1631" height="1019" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/01/image-66.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2026/01/image-66.png 1000w, https://www.dailydoseofds.com/content/images/size/w1600/2026/01/image-66.png 1600w, https://www.dailydoseofds.com/content/images/2026/01/image-66.png 1631w" sizes="(min-width: 720px) 720px"></figure><p>During training, the model learns three distinct weight matrices: $W_Q \in \mathbb{R}^{d_{\text{model}} \times d_k}$, $W_K \in \mathbb{R}^{d_{\text{model}} \times d_k}$ and $W_V \in \mathbb{R}^{d_{\text{model}} \times d_v}$. Here, $d_k$ is the length of a query or key vector and $d_v$ is the length of a value vector.</p><p>The Query, Key, and Value matrices are computed as:</p><ul><li>$Q = X W_Q$</li><li>$K = X W_K$</li><li>$V = X W_V$</li></ul><div class="kg-card kg-callout-card kg-callout-card-blue"><div class="kg-callout-emoji">👉</div><div class="kg-callout-text">Let $Q_i$, $K_i$, $V_i$​ denote the $i$-th row of the matrices $Q$, $K$, $V$ corresponding to the query, key, and value vectors of the $i$-th token.</div></div><hr><p>As mentioned above, $Q_i$, $K_i$ and $V_i$ are all derived from the embedding (or the output of the previous layer) via learned linear transformations. You can think of:</p><ul><li>The Query as the question this token is asking at this layer.</li><li>The Key as the distilled information this token is offering to others.</li><li>The Value as the actual content to be used if this token is attended to.</li></ul><p>The attention weights between token $i$ and token $j$ are computed as the (scaled) dot product of $Q_i$​ and $K_j$​. Intuitively, this measures how relevant token $j$’s content is to token $i$’s query. All tokens $j$ are considered, and a softmax is applied to these dot products to obtain a nice probability distribution that sums to 1.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/01/image-14.png" class="kg-image" alt="" loading="lazy" width="976" height="150" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/01/image-14.png 600w, https://www.dailydoseofds.com/content/images/2026/01/image-14.png 976w" sizes="(min-width: 720px) 720px"></figure><p>Deconstructing the Formula:</p><ul><li>$Q_i \cdot K_j$: Computes the dot product between the query of token $i$ and the key of token $j$, measuring their relevance.</li><li>$\sqrt{d_k}$ (scaling factor): As the dimensionality $d_k$​ increases, dot-product magnitudes grow, pushing softmax into regions with very small gradients. Dividing by $\sqrt{d_k}$​​ keeps the variance of scores roughly constant and stabilizes training.</li><li>Softmax: Applied over all $j$, it normalizes the attention scores for a fixed query $Q_i$​, producing a distribution of attention weights that sum to 1.</li><li>$V_j$ (Aggregation): The output is a weighted sum of value vectors, producing a context-aware representation for token $i$.</li></ul><p>Here's the formula depicted as a diagram:</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/01/image-38-1.png" class="kg-image" alt="" loading="lazy" width="682" height="565" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/01/image-38-1.png 600w, https://www.dailydoseofds.com/content/images/2026/01/image-38-1.png 682w"></figure><div class="kg-card kg-callout-card kg-callout-card-blue"><div class="kg-callout-emoji">👉</div><div class="kg-callout-text">These weights determine how much attention token $i$ pays to every other position. Finally, the output for token $i$ is the weighted sum of all the Value vectors, using those attention weights.</div></div><h4 id="dummy-example-illustrative">Dummy example (illustrative)</h4><p>Now, let's walk through a basic example to understand this better.</p><p>Suppose we have a tiny sequence of 2 tokens and a very small vector dimension (2-D for simplicity). We'll manually set some Query/Key/Value vectors to illustrate different attention scenarios:</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/01/image-67.png" class="kg-image" alt="" loading="lazy" width="400" height="265"></figure><ul><li>Token 1 has query $Q_1 = [1,\,0]$</li><li>Token 2 has query $Q_2 = [1,\,1]$</li><li>Token 1’s key $K_1 = [1,\,0]$</li><li>Token 2’s key $K_2 = [0,\,1]$</li><li>Token 1’s value $V_1 = [10,\,0]$</li><li>Token 2’s value $V_2 = [0,\,10]$</li></ul><p>Here, token 1’s query is <code>[1,0]</code> which perfectly matches token 1’s own key <code>[1,0]</code> but is orthogonal to token 2’s key.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/01/image-68.png" class="kg-image" alt="" loading="lazy" width="400" height="225"></figure><p>Token 2’s query <code>[1,1]</code> is equally similar to both keys. Now we compute attention:</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/01/image-71.png" class="kg-image" alt="" loading="lazy" width="540" height="333"></figure><p>Here is the step-by-step breakdown using the summation formula explicitly for each token index $i$. This method highlights how a single token "looks at" every other token $j$ (including itself) to construct its own output.</p><p>Given:</p><ul><li>$d_k = 2$, so scaling factor $\sqrt{d_k} \approx 1.414$.</li></ul><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/01/image-72.png" class="kg-image" alt="" loading="lazy" width="1187" height="281" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/01/image-72.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2026/01/image-72.png 1000w, https://www.dailydoseofds.com/content/images/2026/01/image-72.png 1187w" sizes="(min-width: 720px) 720px"></figure><ul><li>Token 1 ($j=1$): $K_1 = [1, 0]$, $V_1 = [10, 0]$</li><li>Token 2 ($j=2$): $K_2 = [0, 1]$, $V_2 = [0, 10]$</li></ul><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/01/image-76.png" class="kg-image" alt="" loading="lazy" width="529" height="261"></figure><p>Let's compute the attention output for Token 1.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/01/image-77.png" class="kg-image" alt="" loading="lazy" width="563" height="164"></figure><p>We want to find $\text{AttentionOutput}_1$, which will be a vector that denotes the representation of token 1 based on other tokens.</p><p>Thus, we fix $i=1$ and compute attention over all $j$ values → $j=1, 2$.</p><p>So the structure of the output is:</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/01/image-79.png" class="kg-image" alt="" loading="lazy" width="747" height="225" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/01/image-79.png 600w, https://www.dailydoseofds.com/content/images/2026/01/image-79.png 747w" sizes="(min-width: 720px) 720px"></figure><p>Next, we compute raw similarity scores using dot products:</p><p>We compare the query of token 1 with every key ($Q_1 \cdot K_j$).</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/01/image-82.png" class="kg-image" alt="" loading="lazy" width="441" height="405"></figure><p>So the raw scores are:</p><ul><li>$s_{1,1}$ = 1</li><li>$s_{1,2}$ = 0</li></ul><p>Next, we scale the scores by dividing each score by $\sqrt{2}$:</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/01/image-83.png" class="kg-image" alt="" loading="lazy" width="441" height="449"></figure><p>Moving on, Softmax is applied across the index $j$, while $i$ stays fixed.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/01/image-84.png" class="kg-image" alt="" loading="lazy" width="1963" height="827" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/01/image-84.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2026/01/image-84.png 1000w, https://www.dailydoseofds.com/content/images/size/w1600/2026/01/image-84.png 1600w, https://www.dailydoseofds.com/content/images/2026/01/image-84.png 1963w" sizes="(min-width: 720px) 720px"></figure><p>Simplifying the exponential terms, we get the attention scores as:</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/01/image-86.png" class="kg-image" alt="" loading="lazy" width="441" height="391"></figure><p>Now we combine the value vectors ($V$) using these attention weights:</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/01/image-87.png" class="kg-image" alt="" loading="lazy" width="1563" height="132" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/01/image-87.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2026/01/image-87.png 1000w, https://www.dailydoseofds.com/content/images/2026/01/image-87.png 1563w" sizes="(min-width: 720px) 720px"></figure><p>Similar to above, we can find $\text{AttentionOutput}_2$. Consider this as a self-learning exercise, and here are some hints and results for verification:</p><ul><li>We fix $i=2$ and sum over $j=1, 2$.</li><li>Weight $\alpha_{2,1}$ will come out to be: $0.5$</li><li>Weight $\alpha_{2,2}$ will come out to be: $0.5$</li><li>Weighted sum of values: $[5, \, 5]$</li></ul><p>Here's the complete example summarized in the form of a diagram:</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/01/image-40.png" class="kg-image" alt="" loading="lazy" width="1435" height="760" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/01/image-40.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2026/01/image-40.png 1000w, https://www.dailydoseofds.com/content/images/2026/01/image-40.png 1435w" sizes="(min-width: 720px) 720px"></figure><div class="kg-card kg-callout-card kg-callout-card-blue"><div class="kg-callout-emoji">👉</div><div class="kg-callout-text">Important note: This example demonstrates the mechanics of self-attention in its simplest, unmasked form. For clarity, causal masking is not applied here, meaning each token is allowed to attend to ALL tokens in the sequence. This setup is purely illustrative. Causal (autoregressive) masking, which restricts attention to past and current tokens, will be introduced and discussed separately in a later section.</div></div><p>Now that we know how the calculations work, let's answer the question, "What does the output actually depict?"</p><p>The attention output depicts a contextualized representation of each word based on its relationships (weights) with other words.</p><p>For token 1, the attention mechanism calculated weights of 0.67 for itself and 0.33 for token 2. This means the output vector is not just the static definition of $V_1$.</p><p>It is a compound concept built from a $67\%$ share of $V_1$ and a $33\%$ share of $V_2$. The model has kept the core identity of the word strong, but has added a slight flavor of the other token.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/01/image-41.png" class="kg-image" alt="" loading="lazy" width="418" height="215"></figure><div class="kg-card kg-callout-card kg-callout-card-blue"><div class="kg-callout-emoji">👉</div><div class="kg-callout-text">Note that the model is not literally mixing meanings, but features useful for the task.</div></div><p>Similarly, for token 2, the attention mechanism calculated weights of 0.50 for itself and 0.50 for token 1. This means token 2’s representation depends equally on itself and token 1.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/01/image-42.png" class="kg-image" alt="" loading="lazy" width="418" height="215"></figure><p>In summary, the outputs depict the flow of information. It shows exactly how much context each token absorbed from its neighbors to update its own representation.</p><p>With this, we understand the fundamental idea of self-attention. We can now extend this discussion by introducing multi-head attention, which builds on the same principle while enabling the model to capture information from multiple representation subspaces simultaneously.</p><h3 id="multi-head-attention">Multi-head attention</h3><p>In our previous example, we had a single attention mechanism processing the tokens. However, language is too complex for a single calculation to capture everything. A sentence contains grammatical structure, semantic meaning, and references.</p><p>To solve this, transformers use multi-head attention (MHA) in practice.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/01/image-43.png" class="kg-image" alt="" loading="lazy" width="1362" height="460" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/01/image-43.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2026/01/image-43.png 1000w, https://www.dailydoseofds.com/content/images/2026/01/image-43.png 1362w" sizes="(min-width: 720px) 720px"></figure><p>Hence, let's go ahead and understand the conceptual ideas behind multi-head attention.</p><h4 id="the-dimensionality-split">The dimensionality split</h4><p>This is where the projection mathematics becomes essential. We take the massive "main highway" of information ($d_{\text{model}}$) and split it into smaller "side roads" (subspaces).</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/01/image-88.png" class="kg-image" alt="" loading="lazy" width="500" height="246"></figure><ul><li>$d_{\text{model}}$ (the full embedding): Each input token starts as a vector $X \in \mathbb{R}^{d_{\text{model}}}$ (e.g., 512). This vector contains all token information entangled together.</li><li>$h$ (number of heads): We split the workload into $h$ parallel heads (e.g., $h=8$).</li><li>$d_k, d_v$ (the head dimensions): Each head operates on a lower-dimensional subspace. In standard architectures (like GPT/BERT), we divide the dimensions equally, say:</li></ul><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/01/image-17.png" class="kg-image" alt="" loading="lazy" width="415" height="82"></figure><div class="kg-card kg-callout-card kg-callout-card-blue"><div class="kg-callout-emoji">👉</div><div class="kg-callout-text">Clarification: Mathematically, $d_k$ and $d_v$ do not have to be equal. $d_k$ is about "matching" (keys), while $d_v$ is about "payload" (values). However, we almost always set them equal for computational efficiency and to ensure the final concatenation fits perfectly back into $d_{\text{model}}$.</div></div><h4 id="learnable-projections">Learnable projections</h4><p>For each head $i$, the model learns three matrices (which we discussed earlier also):</p><ul><li>$W_i^Q \in \mathbb{R}^{d_{\text{model}} \times d_k}$</li><li>$W_i^K \in \mathbb{R}^{d_{\text{model}} \times d_k}$</li><li>$W_i^V \in \mathbb{R}^{d_{\text{model}} \times d_v}$</li></ul><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/01/image-45-1.png" class="kg-image" alt="" loading="lazy" width="1024" height="465" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/01/image-45-1.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2026/01/image-45-1.png 1000w, https://www.dailydoseofds.com/content/images/2026/01/image-45-1.png 1024w" sizes="(min-width: 720px) 720px"></figure><h4 id="the-complete-flow">The complete flow</h4><p>Firstly, the input vector is split into parts, where each part denotes a head:</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/01/image-92.png" class="kg-image" alt="" loading="lazy" width="580" height="280"></figure><p>Each segment of the input goes through the standard attention formula independently, with its own weight matrices:</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/01/image-18.png" class="kg-image" alt="" loading="lazy" width="420" height="51"></figure><p>This results in 8 (total heads) separate vectors, each of size 64.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/01/image-93.png" class="kg-image" alt="" loading="lazy" width="771" height="555" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/01/image-93.png 600w, https://www.dailydoseofds.com/content/images/2026/01/image-93.png 771w" sizes="(min-width: 720px) 720px"></figure><p>Next, we stack the output vectors side-by-side to restore the original width.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/01/image-19.png" class="kg-image" alt="" loading="lazy" width="597" height="48"></figure><p>For example, size: $8 \times 64 = 512$ (Back to $d_{\text{model}}$).</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/01/image-94.png" class="kg-image" alt="" loading="lazy" width="1238" height="540" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/01/image-94.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2026/01/image-94.png 1000w, https://www.dailydoseofds.com/content/images/2026/01/image-94.png 1238w" sizes="(min-width: 720px) 720px"></figure><p>Finally, we multiply by one last matrix, $W^O$.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/01/image-20.png" class="kg-image" alt="" loading="lazy" width="534" height="52"></figure><p>This $W^O$ (output weight matrix) is the final learnable linear layer in the MHA block. You can think of it as the "manager" of the attention heads.</p><p>To understand $W^O$ properly, let's look at the sizes involved:</p><ul><li>As mentioned above, the MHA is the concatenated output of all 8 heads. Size: $8 \text{ heads} \times 64 \text{ ($d_v$)} = 512$ dimensions.</li><li>Desired output: The vector that goes to the next layer (the feed-forward network): $512$ ($d_{\text{model}}$).</li></ul><p>Therefore, $W^O$ is a square matrix of size $512 \times 512$.</p><p>You might ask: "If the concatenation is already size 512, and we want size 512, why not just stop there? Why multiply by another matrix?"</p><p>There's a very critical reason. After concatenation, the vector is segmented:</p><ul><li>Dimensions 0-63: Contain only information from Head 1 (e.g., Grammar).</li><li>Dimensions 64-127: Contain only information from Head 2 (e.g., Vocabulary).</li><li>and so on.</li></ul><p>These parts haven't "talked" to each other yet. They are just sitting side-by-side.</p><p>Multiplying by $W^O$ allows the model to take a weighted sum across all these segments. It mixes the learnings of different heads to create a unified representation:</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/01/image-Photoroom--10--1.png" class="kg-image" alt="" loading="lazy" width="952" height="559" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/01/image-Photoroom--10--1.png 600w, https://www.dailydoseofds.com/content/images/2026/01/image-Photoroom--10--1.png 952w" sizes="(min-width: 720px) 720px"></figure><div class="kg-card kg-callout-card kg-callout-card-blue"><div class="kg-callout-emoji">👉</div><div class="kg-callout-text">Analogy: Imagine 8 people, each write a report on a separate piece of paper. Concatenation is stapling the 8 papers together and $W^O$ is like a manager reading all 8 papers and writing a single, cohesive summary.</div></div><p>Here is the complete concept in the form of a diagram:</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/01/image-Photoroom--11--1.png" class="kg-image" alt="" loading="lazy" width="1024" height="565" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/01/image-Photoroom--11--1.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2026/01/image-Photoroom--11--1.png 1000w, https://www.dailydoseofds.com/content/images/2026/01/image-Photoroom--11--1.png 1024w" sizes="(min-width: 720px) 720px"></figure><p>With this, we have completed understanding the key details of multi-head attention (MHA) and attention in general. However, before moving on to another topic, there is one more crucial concept we need to understand: causal masking.</p><h3 id="causal-masking">Causal masking</h3><p>In Generative LLMs (like GPT), we use causal language modeling (CLM), where the objective is to predict the next token.</p><p>The Problem:</p>

<aside data-sx-content-cta class="relative p-0.5 js-toc-ignore">
        <div data-sx-content-cta-blur class="absolute w-full start-0 -top-0.5 -translate-y-full bg-linear-to-b to-70% from-black/0 to-background dark:to-background-dark pointer-events-none">
            &nbsp; <br> &nbsp; <br> &nbsp;
        </div>

    <div class="relative flex flex-col items-center text-center w-full px-4 py-10 lg:px-10 text-base rounded-xl sm:rounded-2xl">
        <div class="dark:hidden absolute -inset-px -z-10 rounded-xl bg-white shadow-pretty-sm transition-shadow duration-200"></div>
        <span class="hidden dark:block absolute -z-10 -inset-px rounded-inherit bg-radial-[50%_100%_at_50%_0%] from-white/10"></span>        
        <div class="hidden dark:block absolute -inset-px -z-10 rounded-xl bg-white/1 transition-colors duration-500 inset-shadow-md inset-shadow-white/2"></div>
        
        <div class="hidden dark:block absolute -z-10 -inset-0.5 mask-to-b mask-position-to-70% mask-opacity-to-50%">
            <div class="hidden dark:block absolute -z-10 inset-px rounded-xl border border-gray-50/10"></div>
        </div>                                                
        <div class="hidden dark:block absolute -z-10 -inset-px rounded-xl border border-gray-50/5 opacity-0"></div>

        <div class="hidden dark:block absolute -top-px start-1/2 -translate-x-1/2 w-1/2 h-px bg-linear-to-r from-white/0 to-white/0 via-gray-400/50"></div>

        <h2 class="text-gray-900 dark:text-gray-100 text-pretty text-xl sm:text-2xl leading-snug tracking-tight">




                                                <span class="hidden first:inline">
                                This lesson is for paying subscribers only
                            </span>                           
                    <span class="hidden first:inline">
                        This post is for paying subscribers only
                    </span>  

        </h2>

            <a href="/membership" class="mt-7 px-4 py-2.5 font-semibold text-sm text-center bg-primary rounded-full shadow-primary dark:shadow-none dark:drop-shadow-primary-sm hover:opacity-90 transition-opacity duration-200 inline-block" style="color: white;">
                Unlock Full Access
            </a>
            <p class="mt-5 text-sm">
                Already have an account? <button data-portal="signin" class="block sm:inline mx-auto font-medium underline decoration-primary decoration-2 underline-offset-2 hover:text-primary transition-colors duration-200">Sign in</button>
            </p>
    </div>
</aside>
]]></content:encoded>
                </item>
                <item>
                    <title><![CDATA[Tools, Resources and Prompts]]></title>
                    <description><![CDATA[Tools, prompts and resources form the three core capabilities of the MCP framework. Capabilities are essentially the features or functions that the server makes available. 

 * Tools: Executable actions or functions that the AI (host/client) can invoke (often with side effects or external API calls).
 * Resources: Read-only data sources that]]></description>
                    <link>https://www.dailydoseofds.com/tools-resources-and-prompts/</link>
                    <guid isPermaLink="false">69cad8cb10b4a900019d0f06</guid>

                        <category><![CDATA[MCP Guidebook]]></category>

                        <dc:creator><![CDATA[Avi Chawla]]></dc:creator>

                    <pubDate>Sun, 04 Jan 2026 02:05:43 +0530</pubDate>


                    <content:encoded><![CDATA[<p>Tools, prompts and resources form the three core capabilities of the MCP framework. Capabilities are essentially the features or functions that the server makes available.&nbsp;</p><ul><li>Tools: Executable actions or functions that the AI (host/client) can invoke (often with side effects or external API calls).</li><li>Resources: Read-only data sources that the AI (host/client) can query for information (no side effects, just retrieval).</li><li>Prompts: Predefined prompt templates or workflows that the server can supply.</li></ul><h2 id="tools"><strong>Tools</strong></h2><p>Tools are what they sound like: functions that do something on behalf of the AI model. These are typically operations that can have effects or require computation beyond the AI’s own capabilities.</p><p>Importantly, Tools are usually triggered by the AI model’s choice, which means the LLM (via the host) decides to call a tool when it determines it needs that functionality.</p><p>Suppose we have a simple tool for weather. In an MCP server’s code, it might look like:</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/01/data-src-image-4da3a25c-8a4e-471c-91b3-ee86bc16188e.png" class="kg-image" alt="" loading="lazy" width="913" height="426" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/01/data-src-image-4da3a25c-8a4e-471c-91b3-ee86bc16188e.png 600w, https://www.dailydoseofds.com/content/images/2026/01/data-src-image-4da3a25c-8a4e-471c-91b3-ee86bc16188e.png 913w" sizes="(min-width: 720px) 720px"></figure><p>This Python function, registered with <code>@mcp.tool()</code>, can be invoked by the AI via MCP.</p><p>When the AI calls tools/call with name "<code>get_weather</code>" and <code>{"location": "San Francisco"}</code> as arguments, the server will execute <code>get_weather("San Francisco")</code> and return the dictionary result.</p><p>The client will get that JSON result and make it available to the AI. Notice the tool returns structured data (temperature, conditions), and the AI can then use or verbalize (generate a response) that info.</p><p>Since tools can do things like file I/O or network calls, an MCP implementation often requires that the user permit a tool call.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/01/data-src-image-068a1017-1cbe-4a27-93eb-53990d4ec038.png" class="kg-image" alt="" loading="lazy" width="1340" height="1072" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/01/data-src-image-068a1017-1cbe-4a27-93eb-53990d4ec038.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2026/01/data-src-image-068a1017-1cbe-4a27-93eb-53990d4ec038.png 1000w, https://www.dailydoseofds.com/content/images/2026/01/data-src-image-068a1017-1cbe-4a27-93eb-53990d4ec038.png 1340w" sizes="(min-width: 720px) 720px"></figure><p>For example, Claude’s client might pop up “The AI wants to use the ‘<code>get_weather</code>’ tool, allow yes/no?” the first time, to avoid abuse. This ensures the human stays in control of powerful actions.</p><p>Tools are analogous to “functions” in classic function calling, but under MCP, they are used in a more flexible, dynamic context. They are model-controlled but developer/governance-approved in execution.</p><h2 id="resources"><strong>Resources&nbsp;</strong></h2><p>Resources provide read-only data to the AI model.</p><p>These are like databases or knowledge bases that the AI can query to get information, but not modify.</p><p>Unlike tools, resources typically do not involve heavy computation or side effects, since they are often just information lookup.</p><p>Another key difference is that resources are usually accessed under the host application’s control (not spontaneously by the model). In practice, this might mean the Host knows when to fetch a certain context for the model.</p><p>For instance, if a user says, “Use the company handbook to answer my question,” the Host might call a resource that retrieves relevant handbook sections and feeds them to the model.</p><p>Resources could include a local file’s contents, a snippet from a knowledge base or documentation, a database query result (read-only), or any static data like configuration info.</p><p>Essentially, anything the AI might need to know as context. An AI research assistant could have resources like “ArXiv papers database,” where it can retrieve an abstract or reference when asked.</p><p>A simple resource could be a function to read a file:</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/01/data-src-image-2987464e-d413-4931-b1ac-ee70475e64db.png" class="kg-image" alt="" loading="lazy" width="835" height="316" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/01/data-src-image-2987464e-d413-4931-b1ac-ee70475e64db.png 600w, https://www.dailydoseofds.com/content/images/2026/01/data-src-image-2987464e-d413-4931-b1ac-ee70475e64db.png 835w" sizes="(min-width: 720px) 720px"></figure><p>Here we use a decorator @mcp.resource("file://{path}") which might indicate a template for resource URIs.</p><p>The AI (or Host) could ask the server for resources.get with a URI like file://home/user/notes.txt, and the server would callread_file("/home/user/notes.txt") and return the text.</p><p>Notice that resources are usually identified by some identifier (like a URI or name) rather than being free-form functions.</p><p>They are also often application-controlled, meaning the app decides when to retrieve them (to avoid the model just reading everything arbitrarily).</p><p>From a safety standpoint, since resources are read-only, they are less dangerous, but still, one must consider privacy and permissions (the AI shouldn’t read files it’s not supposed to).</p><p>The Host can regulate which resource URIs it allows the AI to access, or the server might restrict access to certain data.</p><p>In summary, Resources give the AI knowledge without handing over the keys to change anything.</p><p>They’re the MCP equivalent of giving the model reference material when needed, which acts like a smarter, on-demand retrieval system integrated through the protocol.</p><h2 id="prompts"><strong>Prompts</strong></h2><p>Prompts in the MCP context are a special concept: they are predefined prompt templates or conversation flows that can be injected to guide the AI’s behavior.</p><p>Essentially, a Prompt capability provides a canned set of instructions or an example dialogue that can help steer the model for certain tasks.</p><p>But why have prompts as a capability?</p><p>Think of recurring patterns: e.g., a prompt that sets up the system role as “You are a code reviewer,” and the user’s code is inserted for analysis.</p><p>Rather than hardcoding that in the host application, the MCP server can supply it.</p><p>Prompts can also represent multi-turn workflows.</p><p>For instance, a prompt might define how to conduct a step-by-step diagnostic interview with a user. By exposing this via MCP, any client can retrieve and use these sophisticated prompts on demand.</p><p>As far as control is concerned, Prompts are usually user-controlled or developer-controlled.</p><p>The user might pick a prompt/template from a UI (e.g., “Summarize this document” template), which the host then fetches from the server.</p><p>The model doesn’t spontaneously decide to use prompts the way it does tools.</p><p>Rather, the prompt sets the stage before the model starts generating. In that sense, prompts are often fetched at the beginning of an interaction or when the user chooses a specific “mode”.</p><p>Suppose we have a prompt template for code review. The MCP server might have:</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/01/data-src-image-eb3caf9c-ded5-41d6-b119-ed31957943fb.png" class="kg-image" alt="" loading="lazy" width="763" height="222" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/01/data-src-image-eb3caf9c-ded5-41d6-b119-ed31957943fb.png 600w, https://www.dailydoseofds.com/content/images/2026/01/data-src-image-eb3caf9c-ded5-41d6-b119-ed31957943fb.png 763w" sizes="(min-width: 720px) 720px"></figure><p>This prompt function returns a list of message objects (in OpenAI format) that set up a code review scenario.</p><p>When the host invokes this prompt, it gets those messages and can insert the actual code to be reviewed into the user content.</p><p>Then it provides these messages to the model before the model’s own answer. Essentially, the server is helping to structure the conversation.</p><p>While we have personally not seen much applicability of this yet, common use cases for prompt capabilities include things like “brainstorming guide,” “step-by-step problem solver template,” or domain-specific system roles.</p><p>By having them on the server, they can be updated or improved without changing the client app, and different servers can offer different specialized prompts.</p><p>An important point to note here is that prompts, as a capability, blur the line between data and instructions.</p><p>They represent best practices or predefined strategies for the AI to use.</p><p>In a way, MCP prompts are similar to how ChatGPT plugins can suggest how to format a query, but here it’s standardized and discoverable via the protocol.</p>]]></content:encoded>
                </item>
                <item>
                    <title><![CDATA[MCP Architecture Overview]]></title>
                    <description><![CDATA[At its heart, MCP follows a client-server architecture (much like the web or other network protocols).

However, the terminology is tailored to the AI context. There are three main roles to understand: the Host, the Client, and the Server.


Host

The Host is the user-facing AI application, the environment where]]></description>
                    <link>https://www.dailydoseofds.com/mcp-architecture-overview/</link>
                    <guid isPermaLink="false">69cad8cb10b4a900019d0f05</guid>

                        <category><![CDATA[MCP Guidebook]]></category>

                        <dc:creator><![CDATA[Avi Chawla]]></dc:creator>

                    <pubDate>Sun, 04 Jan 2026 02:01:40 +0530</pubDate>


                    <content:encoded><![CDATA[<p>At its heart, MCP follows a client-server architecture (much like the web or other network protocols).</p><p>However, the terminology is tailored to the AI context. There are three main roles to understand: the Host, the Client, and the Server.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/01/data-src-image-7ac69662-379a-4fd2-af4d-ef87a9a70196.png" class="kg-image" alt="" loading="lazy" width="855" height="485" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/01/data-src-image-7ac69662-379a-4fd2-af4d-ef87a9a70196.png 600w, https://www.dailydoseofds.com/content/images/2026/01/data-src-image-7ac69662-379a-4fd2-af4d-ef87a9a70196.png 855w" sizes="(min-width: 720px) 720px"></figure><h2 id="host"><strong>Host</strong></h2><p>The Host is the user-facing AI application, the environment where the AI model lives and interacts with the user.</p><p>This could be a chat application (like OpenAI’s ChatGPT interface or Anthropic’s Claude desktop app), an AI-enhanced IDE (like Cursor), or any custom app that embeds an AI assistant like Chainlit.</p><p>Host is the one that initiates connections to the available MCP servers when the system needs them. It captures the user's input, keeps the conversation history, and displays the model’s replies.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/01/data-src-image-aa0b6e37-b7c8-4ee8-93dd-83e8a008a4bf.png" class="kg-image" alt="" loading="lazy" width="1600" height="1135" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/01/data-src-image-aa0b6e37-b7c8-4ee8-93dd-83e8a008a4bf.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2026/01/data-src-image-aa0b6e37-b7c8-4ee8-93dd-83e8a008a4bf.png 1000w, https://www.dailydoseofds.com/content/images/2026/01/data-src-image-aa0b6e37-b7c8-4ee8-93dd-83e8a008a4bf.png 1600w" sizes="(min-width: 720px) 720px"></figure><h2 id="client"><strong>Client</strong></h2><p>The MCP Client is a component within the Host that handles the low-level communication with an MCP Server.</p><p>Think of the Client as the adapter or messenger. While the Host decides what to do, the Client knows how to speak MCP to actually carry out those instructions with the server.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/01/data-src-image-b93cb17d-1b95-4313-8145-daf78513dc00.png" class="kg-image" alt="" loading="lazy" width="1232" height="597" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/01/data-src-image-b93cb17d-1b95-4313-8145-daf78513dc00.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2026/01/data-src-image-b93cb17d-1b95-4313-8145-daf78513dc00.png 1000w, https://www.dailydoseofds.com/content/images/2026/01/data-src-image-b93cb17d-1b95-4313-8145-daf78513dc00.png 1232w" sizes="(min-width: 720px) 720px"></figure><h2 id="server"><strong>Server</strong></h2><p>The MCP Server is the external program or service that actually provides the capabilities (tools, data, etc.) to the application.</p><p>An MCP Server can be thought of as a wrapper around some functionality, which exposes a set of actions or resources in a standardized way so that any MCP Client can invoke them.</p><p>Servers can run locally on the same machine as the Host or remotely on some cloud service since MCP is designed to support both scenarios seamlessly. The key is that the Server advertises what it can do in a standard format (so the client can query and understand available tools) and will execute requests coming from the client, then return results.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/01/data-src-image-4a532ec3-8791-4688-bafc-5bbb043f5cfa.png" class="kg-image" alt="" loading="lazy" width="735" height="734" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/01/data-src-image-4a532ec3-8791-4688-bafc-5bbb043f5cfa.png 600w, https://www.dailydoseofds.com/content/images/2026/01/data-src-image-4a532ec3-8791-4688-bafc-5bbb043f5cfa.png 735w" sizes="(min-width: 720px) 720px"></figure>]]></content:encoded>
                </item>
                <item>
                    <title><![CDATA[Why was MCP created?]]></title>
                    <description><![CDATA[Without MCP, adding a new tool or integrating a new model was a headache.

If you had three AI applications and three external tools, you might end up writing nine different integration modules (each AI x each tool) because there was no common standard. This doesn’t scale.

Developers of]]></description>
                    <link>https://www.dailydoseofds.com/why-was-mcp-created/</link>
                    <guid isPermaLink="false">69cad8cb10b4a900019d0f04</guid>

                        <category><![CDATA[MCP Guidebook]]></category>

                        <dc:creator><![CDATA[Avi Chawla]]></dc:creator>

                    <pubDate>Sun, 04 Jan 2026 02:00:13 +0530</pubDate>


                    <content:encoded><![CDATA[<p>Without MCP, adding a new tool or integrating a new model was a headache.</p><p>If you had three AI applications and three external tools, you might end up writing nine different integration modules (each AI x each tool) because there was no common standard. This doesn’t scale.</p><p>Developers of AI apps were essentially reinventing the wheel each time, and tool providers had to support multiple incompatible APIs to reach different AI platforms.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/01/mcp-only.gif" class="kg-image" alt="" loading="lazy" width="986" height="568" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/01/mcp-only.gif 600w, https://www.dailydoseofds.com/content/images/2026/01/mcp-only.gif 986w" sizes="(min-width: 720px) 720px"></figure><p>Let’s understand this in detail.</p><h2 id="the-problem"><strong>The problem</strong></h2><p>Before MCP, the landscape of connecting AI to external data and actions looked like a patchwork of one-off solutions.</p><p>Either you hard-coded logic for each tool, managed prompt chains that were not robust, or you used vendor-specific plugin frameworks.</p><p>This led to the infamous M×N integration problem.</p><p>Essentially, if you have M different AI applications and N different tools/data sources, you could end up needing M × N custom integrations.</p><p>The diagram below illustrates this complexity: each AI (each “Model”) might require unique code to connect to each external service (database, filesystem, calculator, etc.), leading to spaghetti-like interconnections.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/01/data-src-image-d2e69fa2-7537-4362-b314-f307aef79139.png" class="kg-image" alt="" loading="lazy" width="540" height="275"></figure><h2 id="the-solution"><strong>The solution</strong></h2><p>MCP tackles this by introducing a standard interface in the middle. Instead of M × N direct integrations, we get M + N implementations: each of the M AI applications implements the MCP client side once, and each of the N tools implements an MCP server once.</p><p>Now everyone speaks the same “language”, so to speak, and a new pairing doesn’t require custom code since they already understand each other via MCP.</p><p>The following diagram illustrates this shift.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/01/data-src-image-9c3d2789-188b-459e-b8dc-df5c3a7259be.png" class="kg-image" alt="" loading="lazy" width="973" height="417" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/01/data-src-image-9c3d2789-188b-459e-b8dc-df5c3a7259be.png 600w, https://www.dailydoseofds.com/content/images/2026/01/data-src-image-9c3d2789-188b-459e-b8dc-df5c3a7259be.png 973w" sizes="(min-width: 720px) 720px"></figure><ul><li>On the left (pre-MCP), every model had to wire into every tool.</li><li>On the right (with MCP), each model and tool connects to the MCP layer, drastically simplifying connections. You can also relate this to the translator example we discussed earlier.</li></ul>]]></content:encoded>
                </item>
                <item>
                    <title><![CDATA[What is MCP?]]></title>
                    <description><![CDATA[Imagine you only know English. To get info from a person who only knows:

 * French, you must learn French.
 * German, you must learn German.
 * And so on.

In this setup, learning even 5 languages will be a nightmare for you.

But what if you add a translator that understands all]]></description>
                    <link>https://www.dailydoseofds.com/what-is-mcp/</link>
                    <guid isPermaLink="false">69cad8cb10b4a900019d0f03</guid>

                        <category><![CDATA[MCP Guidebook]]></category>

                        <dc:creator><![CDATA[Avi Chawla]]></dc:creator>

                    <pubDate>Sun, 04 Jan 2026 01:58:24 +0530</pubDate>


                    <content:encoded><![CDATA[<p>Imagine you only know English. To get info from a person who only knows:</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/01/https-3a-2f-2fsubstack-post-media-s3-amazonaws-com-2fpublic-2fimages-2f35b817b9-0cca-415e-ac94-ba2bc1edcca0_1244x844-3.gif" class="kg-image" alt="" loading="lazy" width="573" height="388"></figure><ul><li>French, you must learn French.</li><li>German, you must learn German.</li><li>And so on.</li></ul><p>In this setup, learning even 5 languages will be a nightmare for you.</p><p>But what if you add a translator that understands all languages?</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/01/https-3a-2f-2fsubstack-post-media-s3-amazonaws-com-2fpublic-2fimages-2fe6e1373a-6904-4516-ab8c-bd39b316a401_1572x874-3.gif" class="kg-image" alt="" loading="lazy" width="577" height="320"></figure><p>This is simple, isn't it?</p><p>The translator is like an MCP!</p><p>It lets you (Agents) talk to other people (tools or other capabilities) through a single interface.</p><p>To formalize, while LLMs possess impressive knowledge and reasoning skills, which allow them to perform many complex tasks, their knowledge is limited to their initial training data.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/01/data-src-image-dfb3f989-74a4-44e3-bbca-0b99de802b6c.jpeg" class="kg-image" alt="" loading="lazy" width="1184" height="1100" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/01/data-src-image-dfb3f989-74a4-44e3-bbca-0b99de802b6c.jpeg 600w, https://www.dailydoseofds.com/content/images/size/w1000/2026/01/data-src-image-dfb3f989-74a4-44e3-bbca-0b99de802b6c.jpeg 1000w, https://www.dailydoseofds.com/content/images/2026/01/data-src-image-dfb3f989-74a4-44e3-bbca-0b99de802b6c.jpeg 1184w" sizes="(min-width: 720px) 720px"></figure><p>If they need to access real-time information, they must use external tools and resources on their own.</p><p>Model context protocol (MCP) is a standardized interface and framework that allows AI models to seamlessly interact with external tools, resources, and environments.</p><p>MCP acts as a universal connector for AI systems to capabilities (tools, etc.), similar to how USB-C standardizes connections between electronic devices.</p>]]></content:encoded>
                </item>
                <item>
                    <title><![CDATA[MCP Guidebook]]></title>
                    <description><![CDATA[]]></description>
                    <link>https://www.dailydoseofds.com/mcp-guidebook/</link>
                    <guid isPermaLink="false">69cad8cb10b4a900019d0f02</guid>

                        <category><![CDATA[MCP Guidebook]]></category>

                        <dc:creator><![CDATA[Avi Chawla]]></dc:creator>

                    <pubDate>Sun, 04 Jan 2026 01:53:42 +0530</pubDate>


                    <content:encoded><![CDATA[]]></content:encoded>
                </item>
                <item>
                    <title><![CDATA[10,000+ Success Stories]]></title>
                    <description><![CDATA[Get Started Now]]></description>
                    <link>https://www.dailydoseofds.com/p/74b990f7-82aa-4911-8c15-fd47a0d881f0/</link>
                    <guid isPermaLink="false">69cad8cb10b4a900019d0efa</guid>


                        <dc:creator><![CDATA[Avi Chawla]]></dc:creator>

                    <pubDate>Wed, 31 Dec 2025 03:57:01 +0530</pubDate>


                    <content:encoded><![CDATA[
<!--kg-card-begin: html-->
<h2 style="padding-top: 25px"><b>95% of our Subscribers</b> either landed a new job, built their own company or got promoted!</h2>
<!--kg-card-end: html-->
]]></content:encoded>
                </item>
                <item>
                    <title><![CDATA[Success Cases]]></title>
                    <description><![CDATA[Start Your Success Story]]></description>
                    <link>https://www.dailydoseofds.com/p/11c2951d-dfeb-4235-ba4e-9e9103bd564d/</link>
                    <guid isPermaLink="false">69cad8cb10b4a900019d0efb</guid>


                        <dc:creator><![CDATA[Avi Chawla]]></dc:creator>

                    <pubDate>Wed, 31 Dec 2025 03:56:32 +0530</pubDate>

                        <media:content url="https://www.dailydoseofds.com/content/images/2026/01/placed-at-diagrams-for-website-1.webp" medium="image"/>

                    <content:encoded><![CDATA[<img src="https://www.dailydoseofds.com/content/images/2026/01/placed-at-diagrams-for-website-1.webp" alt="Success Cases"/> ]]></content:encoded>
                </item>
                <item>
                    <title><![CDATA[Master Full-Stack AI Engineering, All in One Place]]></title>
                    <description><![CDATA[Master Full-Stack AI Engineering, All in One Place]]></description>
                    <link>https://www.dailydoseofds.com/p/ec3ab081-5243-466e-9d1d-8b80c597b650/</link>
                    <guid isPermaLink="false">69cad8cb10b4a900019d0ef9</guid>


                        <dc:creator><![CDATA[Avi Chawla]]></dc:creator>

                    <pubDate>Wed, 31 Dec 2025 00:14:57 +0530</pubDate>


                    <content:encoded><![CDATA[
<!--kg-card-begin: html-->
<h2>Master Full-Stack AI Engineering, All in One Place</h2>
<!--kg-card-end: html-->
]]></content:encoded>
                </item>
                <item>
                    <title><![CDATA[Building Blocks of LLMs: Tokenization and Embeddings]]></title>
                    <description><![CDATA[LLMOps Part 2: A detailed walkthrough of tokenization, embeddings, and positional representations, building the foundational translation layer that enables LLMs to process and reason over text.]]></description>
                    <link>https://www.dailydoseofds.com/llmops-crash-course-part-2/</link>
                    <guid isPermaLink="false">69cad8cb10b4a900019d0d94</guid>

                        <category><![CDATA[MLOps/LLMOps Course]]></category>
                        <category><![CDATA[LLMOps Crash Course]]></category>
                        <category><![CDATA[LLMOps]]></category>

                        <dc:creator><![CDATA[Avi Chawla]]></dc:creator>

                    <pubDate>Sun, 28 Dec 2025 00:18:00 +0530</pubDate>

                        <media:content url="https://www.dailydoseofds.com/content/images/2025/12/Sanyog-MLOps_-_I-3.png" medium="image"/>

                    <content:encoded><![CDATA[<img src="https://www.dailydoseofds.com/content/images/2025/12/Sanyog-MLOps_-_I-3.png" alt="Building Blocks of LLMs: Tokenization and Embeddings"/> <h2 id="recap">Recap</h2><p>Before we dive into Part 2 of the LLMOps phase of our MLOps/LLMOps crash course, let’s briefly recap what we covered in the previous part of this course.</p><p>In Part 1, we explored the fundamental ideas behind AI engineering and LLMOps. We began by defining what LLMOps is and examining the goals it aims to achieve.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2025/12/image-154.png" class="kg-image" alt="" loading="lazy" width="988" height="522" srcset="https://www.dailydoseofds.com/content/images/size/w600/2025/12/image-154.png 600w, https://www.dailydoseofds.com/content/images/2025/12/image-154.png 988w" sizes="(min-width: 720px) 720px"></figure><p>Next, we discussed foundation models and the AI application stack, focusing on its three layers.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2025/12/image-155.png" class="kg-image" alt="" loading="lazy" width="929" height="423" srcset="https://www.dailydoseofds.com/content/images/size/w600/2025/12/image-155.png 600w, https://www.dailydoseofds.com/content/images/2025/12/image-155.png 929w" sizes="(min-width: 720px) 720px"></figure><p>After that, we explored several fundamental LLM concepts, including what “large” means in the context of LLMs, different types of language models, emergent abilities, prompting methods (zero-shot and few-shot), a brief overview of attention, and the limitations of LLMs.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2025/12/image-156.png" class="kg-image" alt="" loading="lazy" width="1000" height="223" srcset="https://www.dailydoseofds.com/content/images/size/w600/2025/12/image-156.png 600w, https://www.dailydoseofds.com/content/images/2025/12/image-156.png 1000w" sizes="(min-width: 720px) 720px"></figure><p>Next, we examined the shift from traditional machine learning models to foundation model engineering, highlighting key changes such as adaptation strategies, increased compute demands, and new evaluation challenges.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2025/12/image-157.png" class="kg-image" alt="" loading="lazy" width="992" height="327" srcset="https://www.dailydoseofds.com/content/images/size/w600/2025/12/image-157.png 600w, https://www.dailydoseofds.com/content/images/2025/12/image-157.png 992w" sizes="(min-width: 720px) 720px"></figure><p>Moving forward, we examined the key levers of AI engineering: instructions, context, and the model itself. We begin by optimizing prompts; if that is insufficient, we enrich the system with supporting context. Only when these approaches fall short do we turn to modifying the model.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2025/12/image-158.png" class="kg-image" alt="" loading="lazy" width="1000" height="424" srcset="https://www.dailydoseofds.com/content/images/size/w600/2025/12/image-158.png 600w, https://www.dailydoseofds.com/content/images/2025/12/image-158.png 1000w" sizes="(min-width: 720px) 720px"></figure><p>Finally, we took a brief look at the key differences between MLOps and LLMOps.</p><p>By the end of Part 1, we had a clear understanding that, despite differences from MLOps, foundation model engineering and LLM-based application development are fundamentally systems engineering disciplines.</p><p>If you haven’t yet studied Part 1, we strongly recommend reviewing it first, as it establishes the conceptual foundation essential for understanding the material we’re about to cover.</p><p>Read it here:</p><figure class="kg-card kg-bookmark-card"><a class="kg-bookmark-container" href="https://www.dailydoseofds.com/llmops-crash-course-part-1/"><div class="kg-bookmark-content"><div class="kg-bookmark-title">The Complete LLMOps Blueprint: Foundations of AI Engineering and LLMs</div><div class="kg-bookmark-description">LLMOps Crash Course Part 1 (MLOps/LLMOps Phase-II).</div><div class="kg-bookmark-metadata"><img class="kg-bookmark-icon" src="https://www.dailydoseofds.com/content/images/icon/logo-subsatck2-1-175.svg" alt=""><span class="kg-bookmark-author">Daily Dose of Data Science</span><span class="kg-bookmark-publisher">Avi Chawla</span></div></div><div class="kg-bookmark-thumbnail"><img src="https://www.dailydoseofds.com/content/images/thumbnail/Sanyog-MLOps_-_I-1-11.png" alt="" onerror="this.style.display = 'none'"></div></a></figure><p>In this chapter, and the next few, we will cover the fundamental concepts and core building blocks of large language models, analyzing each component and stage in detail.</p><p>As always, every notion will be explained through clear examples and walkthroughs to develop a solid understanding.</p><div class="kg-card kg-callout-card kg-callout-card-blue"><div class="kg-callout-emoji">👉</div><div class="kg-callout-text">This chapter presumes proficiency in Python programming, probability theory, linear algebra, as well as a working knowledge of fundamental ML and NLP terminology.</div></div><p>Let’s begin!</p><hr><h2 id="introduction">Introduction</h2><p>This chapter, and the next few ones, discuss the internal mechanics of large language models.</p><p>We will build a strong mental model of the technical fundamentals of LLMs: from how they convert text into numbers and apply attention, to how they are trained and generate text.</p><p>By the end of the subsequent set of chapters, you should be comfortable with concepts such as tokenization, embeddings, attention mechanisms, pretraining, and how LLMs generate text using various decoding strategies and parameters.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2025/12/image-181.png" class="kg-image" alt="" loading="lazy" width="535" height="803"></figure><p>Additionally, we will discuss the general lifecycle of LLM-based applications, outlining the key stages involved from initial design and development through deployment and ongoing maintenance.</p><p>Let’s begin with the very first step in any language model pipeline: converting raw text into numerical representations usable by the model.</p><hr><h2 id="tokenization">Tokenization</h2><p>At the heart of an LLM’s input processing is tokenization: breaking text into discrete units called tokens, and assigning each a unique token ID, because before a neural network can process human language, that language must be translated into a format the machine can understand: numerical vectors.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2025/12/image-182.png" class="kg-image" alt="" loading="lazy" width="1000" height="249" srcset="https://www.dailydoseofds.com/content/images/size/w600/2025/12/image-182.png 600w, https://www.dailydoseofds.com/content/images/2025/12/image-182.png 1000w" sizes="(min-width: 720px) 720px"></figure><p>Tokenization is one stage of this two-stage critical translation process, with the other stage being embedding, which maps the tokens into a continuous vector space.</p><p>Early techniques relied on word-level tokenization (splitting by spaces) or character-level tokenization. Both approaches have significant limitations.</p><p>Word-level tokenization results in massive vocabulary sizes and cannot handle unseen words, while character-level tokenization produces excessively long sequences that dilute semantic meaning and increase computational cost.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2025/12/image-Photoroom--8--1-1-1.png" class="kg-image" alt="" loading="lazy" width="1024" height="523" srcset="https://www.dailydoseofds.com/content/images/size/w600/2025/12/image-Photoroom--8--1-1-1.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2025/12/image-Photoroom--8--1-1-1.png 1000w, https://www.dailydoseofds.com/content/images/2025/12/image-Photoroom--8--1-1-1.png 1024w" sizes="(min-width: 720px) 720px"></figure><p>Hence, modern LLMs adopt subword tokenization, which strikes an optimal balance. Here, tokens are often words and subwords, depending on frequency of occurrence. For example, the sentence “Transformers are amazing!” might be tokenized into pieces like <code>["Transform", "ers", " are", " amazing", "!"]</code>.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2025/12/image-Photoroom--6--1.png" class="kg-image" alt="" loading="lazy" width="1024" height="523" srcset="https://www.dailydoseofds.com/content/images/size/w600/2025/12/image-Photoroom--6--1.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2025/12/image-Photoroom--6--1.png 1000w, https://www.dailydoseofds.com/content/images/2025/12/image-Photoroom--6--1.png 1024w" sizes="(min-width: 720px) 720px"></figure><p>More formally, the reasons behind using subword units are:</p><h3 id="subwords-capture-meaning">Subwords capture meaning</h3><p>Unlike single characters, subword tokens can represent meaningful chunks of words. For instance, the word “cooking” can be split into tokens “cook” and “ing”, each carrying part of the meaning.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2025/12/image-183-1.png" class="kg-image" alt="" loading="lazy" width="727" height="478" srcset="https://www.dailydoseofds.com/content/images/size/w600/2025/12/image-183-1.png 600w, https://www.dailydoseofds.com/content/images/2025/12/image-183-1.png 727w" sizes="(min-width: 720px) 720px"></figure><p>A purely character-based model would have to learn “cook” vs “cooking” relationships from scratch, whereas subwords provide a head start by preserving linguistic structure.</p><h3 id="vocabulary-efficiency">Vocabulary efficiency</h3><p>There are far fewer unique subword tokens than full words. Using subword tokens reduces the vocabulary size dramatically compared to word-level tokenizers, which makes the model more efficient to train and use.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2025/12/image-184-1.png" class="kg-image" alt="" loading="lazy" width="492" height="452"></figure><p>For example, instead of needing a vocabulary containing every form of a word (run, runs, running, ran, etc.), a tokenizer might include base tokens like “run” and suffix tokens like “ning” or past tense markers. This keeps the vocabulary (and therefore model size) manageable.</p><p>Most LLMs have vocabularies on the order of tens to hundreds of thousands of tokens, far smaller than the number of possible words.</p><h3 id="handling-unknown-or-rare-words">Handling unknown or rare words</h3><p>Subword tokenization naturally handles out-of-vocabulary words better by breaking them into pieces. If the model encounters a new word it wasn’t trained on, it can split it into familiar segments rather than treating it as an entirely unknown token.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2025/12/image-185-1-1-1-1.png" class="kg-image" alt="" loading="lazy" width="656" height="108" srcset="https://www.dailydoseofds.com/content/images/size/w600/2025/12/image-185-1-1-1-1.png 600w, https://www.dailydoseofds.com/content/images/2025/12/image-185-1-1-1-1.png 656w"></figure><p>This helps the model generalize to new words by understanding their components.</p><p>Now that we understand the basics of subword tokenization, let's take a quick look at the key techniques we use to implement subword tokenization.</p><h3 id="subword-tokenization-algorithms">Subword tokenization algorithms</h3><p>The three dominant algorithms in this space are byte-pair encoding (BPE), WordPiece, and Unigram tokenization.</p><p>These algorithms build a vocabulary of tokens by starting from characters and iteratively merging common sequences to form longer tokens. The result is a fixed vocabulary where each token may be a word or a frequent subword.</p><h4 id="byte-pair-encoding-bpe">Byte-Pair Encoding (BPE)</h4><p>Byte-Pair Encoding (BPE) is the standard for models like GPT-2, GPT-3, GPT-4, and the Llama family. It is a frequency-based compression algorithm that iteratively merges the most frequent adjacent pairs of symbols into new tokens.</p><h5 id="mechanism">Mechanism</h5><ul><li>Initialization: The vocabulary is initialized with all unique characters in the corpus.</li><li>Statistical Counting: The algorithm scans the corpus and counts the frequency of all adjacent symbol pairs (e.g., "e" followed by "r").</li><li>Merge Operation: The most frequent pair is merged into a new token (e.g., "er"). This new token is added to the vocabulary.</li><li>Iteration: This process repeats until a pre-defined vocabulary size (hyperparameter) is reached.</li></ul>

<aside data-sx-content-cta class="relative p-0.5 js-toc-ignore">
        <div data-sx-content-cta-blur class="absolute w-full start-0 -top-0.5 -translate-y-full bg-linear-to-b to-70% from-black/0 to-background dark:to-background-dark pointer-events-none">
            &nbsp; <br> &nbsp; <br> &nbsp;
        </div>

    <div class="relative flex flex-col items-center text-center w-full px-4 py-10 lg:px-10 text-base rounded-xl sm:rounded-2xl">
        <div class="dark:hidden absolute -inset-px -z-10 rounded-xl bg-white shadow-pretty-sm transition-shadow duration-200"></div>
        <span class="hidden dark:block absolute -z-10 -inset-px rounded-inherit bg-radial-[50%_100%_at_50%_0%] from-white/10"></span>        
        <div class="hidden dark:block absolute -inset-px -z-10 rounded-xl bg-white/1 transition-colors duration-500 inset-shadow-md inset-shadow-white/2"></div>
        
        <div class="hidden dark:block absolute -z-10 -inset-0.5 mask-to-b mask-position-to-70% mask-opacity-to-50%">
            <div class="hidden dark:block absolute -z-10 inset-px rounded-xl border border-gray-50/10"></div>
        </div>                                                
        <div class="hidden dark:block absolute -z-10 -inset-px rounded-xl border border-gray-50/5 opacity-0"></div>

        <div class="hidden dark:block absolute -top-px start-1/2 -translate-x-1/2 w-1/2 h-px bg-linear-to-r from-white/0 to-white/0 via-gray-400/50"></div>

        <h2 class="text-gray-900 dark:text-gray-100 text-pretty text-xl sm:text-2xl leading-snug tracking-tight">




                                                <span class="hidden first:inline">
                                This lesson is for paying subscribers only
                            </span>                           
                    <span class="hidden first:inline">
                        This post is for paying subscribers only
                    </span>  

        </h2>

            <a href="/membership" class="mt-7 px-4 py-2.5 font-semibold text-sm text-center bg-primary rounded-full shadow-primary dark:shadow-none dark:drop-shadow-primary-sm hover:opacity-90 transition-opacity duration-200 inline-block" style="color: white;">
                Unlock Full Access
            </a>
            <p class="mt-5 text-sm">
                Already have an account? <button data-portal="signin" class="block sm:inline mx-auto font-medium underline decoration-primary decoration-2 underline-offset-2 hover:text-primary transition-colors duration-200">Sign in</button>
            </p>
    </div>
</aside>
]]></content:encoded>
                </item>
                <item>
                    <title><![CDATA[A Practical Deep Dive Into Memory Optimization for Agentic Systems (Part C)]]></title>
                    <description><![CDATA[AI Agents Crash Course—Part 17 (with implementation).]]></description>
                    <link>https://www.dailydoseofds.com/ai-agents-crash-course-part-17-with-implementation/</link>
                    <guid isPermaLink="false">69cad8cb10b4a900019d0d93</guid>

                        <category><![CDATA[AI Agents Course]]></category>
                        <category><![CDATA[Agents]]></category>

                        <dc:creator><![CDATA[Avi Chawla]]></dc:creator>

                    <pubDate>Sun, 21 Dec 2025 00:58:00 +0530</pubDate>

                        <media:content url="https://www.dailydoseofds.com/content/images/2025/12/agent_crash_course--1--2.png" medium="image"/>

                    <content:encoded><![CDATA[<img src="https://www.dailydoseofds.com/content/images/2025/12/agent_crash_course--1--2.png" alt="A Practical Deep Dive Into Memory Optimization for Agentic Systems (Part C)"/> <h2 id="recap">Recap</h2><p>In the previous part (Part 16) of this AI agents crash course, we focused on memory as a first-class design concern for agentic systems.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2025/12/image-133.png" class="kg-image" alt="" loading="lazy" width="1329" height="870" srcset="https://www.dailydoseofds.com/content/images/size/w600/2025/12/image-133.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2025/12/image-133.png 1000w, https://www.dailydoseofds.com/content/images/2025/12/image-133.png 1329w" sizes="(min-width: 720px) 720px"></figure><p>We began by understanding sequential memory, where the entire conversation history is appended and sent to the LLM on every turn.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2025/12/image-134.png" class="kg-image" alt="" loading="lazy" width="1000" height="897" srcset="https://www.dailydoseofds.com/content/images/size/w600/2025/12/image-134.png 600w, https://www.dailydoseofds.com/content/images/2025/12/image-134.png 1000w" sizes="(min-width: 720px) 720px"></figure><p>Next, we introduced sliding window memory, which places a hard bound on context size by retaining only the most recent messages. This significantly stabilizes token usage and latency, but at the cost of forgetting older information once it falls outside the window.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2025/12/image-135.png" class="kg-image" alt="" loading="lazy" width="531" height="1118"></figure><p>To address this limitation, we moved on to summarization-based memory, where older conversation segments are compressed into a running summary while recent turns remain in full detail.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2025/12/image-136.png" class="kg-image" alt="" loading="lazy" width="566" height="1088"></figure><p>We then extended this idea further with compression and consolidation, introducing importance-aware memory management.</p><p>Throughout the chapter, we grounded every concept in a customer support chatbot example, analyzing how each memory strategy affects behavior, latency, and token consumption in realistic workflows.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2025/12/image-137.png" class="kg-image" alt="" loading="lazy" width="650" height="919" srcset="https://www.dailydoseofds.com/content/images/size/w600/2025/12/image-137.png 600w, https://www.dailydoseofds.com/content/images/2025/12/image-137.png 650w"></figure><p>If you haven’t gone through the previous chapter yet, we strongly recommend reading it first, as it lays the foundation for the next step in our memory optimization journey.</p><figure class="kg-card kg-bookmark-card"><a class="kg-bookmark-container" href="https://www.dailydoseofds.com/ai-agents-crash-course-part-16-with-implementation/"><div class="kg-bookmark-content"><div class="kg-bookmark-title">A Practical Deep Dive Into Memory Optimization for Agentic Systems (Part B)</div><div class="kg-bookmark-description">AI Agents Crash Course—Part 16 (with implementation).</div><div class="kg-bookmark-metadata"><img class="kg-bookmark-icon" src="https://www.dailydoseofds.com/content/images/icon/logo-subsatck2-1-168.svg" alt=""><span class="kg-bookmark-author">Daily Dose of Data Science</span><span class="kg-bookmark-publisher">Avi Chawla</span></div></div><div class="kg-bookmark-thumbnail"><img src="https://www.dailydoseofds.com/content/images/thumbnail/agent_crash_course--1--1-4.png" alt="" onerror="this.style.display = 'none'"></div></a></figure><p>In this chapter, we will continue our discussion on memory optimization, learning about:</p><ul><li>Retrieval-based memory</li><li>Hierarchical memory</li><li>OS-like memory</li></ul><p>Let’s continue.</p><hr><h2 id="retrieval-based-memory">Retrieval-based memory</h2><p>So far, all the techniques we discussed so far have been short-term:</p><ul><li>Sequential memory: send the entire conversation so far.</li><li>Sliding window: keep only the last few turns.</li><li>Summarization: compress older parts into a running summary.</li></ul><p>All of these live inside a single thread. Once the thread ends, that memory is gone.</p><p>For real, production-grade agentic systems, that’s not enough. You want your agent to:</p><ul><li>remember a user’s preferences across conversations</li><li>recall past issues for the same user/account</li><li>reuse knowledge learned last week in a new session</li></ul><p>Instead of keeping everything inside the thread, we need to:</p><ul><li>store durable memory items in a long-term store</li><li>retrieve only the relevant ones for the current query</li><li>stitch them into the context alongside the current conversation</li></ul><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2025/12/image-Photoroom--1-.png" class="kg-image" alt="" loading="lazy" width="1024" height="523" srcset="https://www.dailydoseofds.com/content/images/size/w600/2025/12/image-Photoroom--1-.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2025/12/image-Photoroom--1-.png 1000w, https://www.dailydoseofds.com/content/images/2025/12/image-Photoroom--1-.png 1024w" sizes="(min-width: 720px) 720px"></figure><p>In Part 15, we briefly saw how long-term memory works in LangGraph using the store abstraction.</p><figure class="kg-card kg-bookmark-card"><a class="kg-bookmark-container" href="https://www.dailydoseofds.com/ai-agents-crash-course-part-15-with-implementation/"><div class="kg-bookmark-content"><div class="kg-bookmark-title">A Practical Deep Dive Into Memory Optimization for Agentic Systems (Part A)</div><div class="kg-bookmark-description">AI Agents Crash Course—Part 15 (with implementation).</div><div class="kg-bookmark-metadata"><img class="kg-bookmark-icon" src="https://www.dailydoseofds.com/content/images/icon/logo-subsatck2-1-167.svg" alt=""><span class="kg-bookmark-author">Daily Dose of Data Science</span><span class="kg-bookmark-publisher">Avi Chawla</span></div></div><div class="kg-bookmark-thumbnail"><img src="https://www.dailydoseofds.com/content/images/thumbnail/agent_crash_course--1--43.png" alt="" onerror="this.style.display = 'none'"></div></a></figure><p>Here we’ll walk through it step by step, and then wire it into a retrieval-based memory setup for our customer support agent.</p><h3 id="memory-store-in-langgraph">Memory store in LangGraph</h3><p>Checkpointers give us continuity inside a thread. As long as we use the same thread ID, LangGraph will restore the previous state and continue the conversation from where you left off.</p><p>But checkpointers alone cannot:</p><ul><li>share information between different threads</li><li>carry knowledge across sessions or tickets</li><li>build a persistent profile for a user</li></ul><p>Imagine a user who opens three support tickets over a month:</p><ul><li>Ticket 1 – billing issue</li><li>Ticket 2 – access issue</li><li>Ticket 3 – workspace performance issue</li></ul><p>With only checkpointers, each ticket is its own island. The agent has no way to reuse what it learned in Ticket 1 when answering Ticket 2 or 3. </p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2025/12/image-Photoroom--2-.png" class="kg-image" alt="" loading="lazy" width="1024" height="523" srcset="https://www.dailydoseofds.com/content/images/size/w600/2025/12/image-Photoroom--2-.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2025/12/image-Photoroom--2-.png 1000w, https://www.dailydoseofds.com/content/images/2025/12/image-Photoroom--2-.png 1024w" sizes="(min-width: 720px) 720px"></figure><p>This is the exact reason for the need to have stores as an external database that can store important conversations that can be retrieved on demand.</p><p>LangGraph stores long-term memories as JSON documents in a store.</p><p>Each memory is organized under:</p><ul><li>A namespace is like a folder and is represented as a tuple like <code>(user_id, "memories")</code>.</li><li>A key is like a file name within that folder.</li></ul><p>Moreover, you can write and read from the store in any thread. In LangGraph we use the <code>InMemoryStore</code> to implement this.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2025/12/image-Photoroom-2.png" class="kg-image" alt="" loading="lazy" width="1024" height="523" srcset="https://www.dailydoseofds.com/content/images/size/w600/2025/12/image-Photoroom-2.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2025/12/image-Photoroom-2.png 1000w, https://www.dailydoseofds.com/content/images/2025/12/image-Photoroom-2.png 1024w" sizes="(min-width: 720px) 720px"></figure><p>The good thing about LangGraph is that it takes care of all the backend infrastructure required for the store implementation. We don't need to worry about managing it but it is still useful to understand the basics.</p><p>Here’s an implementation of the <code>InMemoryStore</code>:</p>
<!--kg-card-begin: html-->
<div style="overflow:hidden;margin-left:auto;margin-right:auto;border-radius:10px;width:100%;max-width:560px;position:relative"><div style="width:100%;padding-bottom:48.214285714285715%"></div><iframe width="560" height="270" title="" src="https://snappify.com/embed/0a5f02f7-f2f9-48d3-bcf6-4bd725b5f573/ceafa156-a908-4567-9bb3-c9dc8f72c2fe?responsive=1&b=1" allow="clipboard-write" allowfullscreen="" loading="lazy" style="background:linear-gradient(337deg,#654EA3FF,#DA98B4FF);position:absolute;left:0;top:0;width:100%" frameborder="0"></iframe></div>
<!--kg-card-end: html-->
<p></p><p>Here:</p><ul><li><code>InMemoryStore</code> creates a store which is essentially just a dictionary in memory. </li><li>Memories are grouped by a namespace, which is a tuple of strings.</li></ul><p>We use <code>(user_id, "memories")</code> here, but you could just as well use <code>(project_id, "docs")</code> or <code>(team_id, "preferences")</code> whatever suits your use case.</p><p>To save a memory, we use the <code>put</code> method:</p>
<!--kg-card-begin: html-->
<div style="overflow:hidden;margin-left:auto;margin-right:auto;border-radius:10px;width:100%;max-width:609px;position:relative"><div style="width:100%;padding-bottom:32.67651888341544%"></div><iframe width="609" height="199" title="" src="https://snappify.com/embed/0a5f02f7-f2f9-48d3-bcf6-4bd725b5f573/07a4816c-1414-49ff-8cf1-432ac4ff0af2?responsive=1&b=1" allow="clipboard-write" allowfullscreen="" loading="lazy" style="background:linear-gradient(337deg,#654EA3FF,#DA98B4FF);position:absolute;left:0;top:0;width:100%" frameborder="0"></iframe></div>
<!--kg-card-end: html-->
<p></p><p>In this snippet:</p><ul><li><code>memory_id</code> is a unique key inside this namespace.</li><li><code>memory</code> is any JSON-serializable document such as a simple dictionary.</li><li><code>put</code> stores this <code>(key, value)</code> pair under the given namespace.</li></ul><p>To read memories back, we use the <code>search</code> method:</p>
<!--kg-card-begin: html-->
<div style="overflow:hidden;margin-left:auto;margin-right:auto;border-radius:10px;width:100%;max-width:609px;position:relative"><div style="width:100%;padding-bottom:53.5303776683087%"></div><iframe width="609" height="326" title="" src="https://snappify.com/embed/0a5f02f7-f2f9-48d3-bcf6-4bd725b5f573/6212fc51-9b95-44d4-8100-1b40b0b98b08?responsive=1&b=1" allow="clipboard-write" allowfullscreen="" loading="lazy" style="background:linear-gradient(337deg,#654EA3FF,#DA98B4FF);position:absolute;left:0;top:0;width:100%" frameborder="0"></iframe></div>
<!--kg-card-end: html-->
<p></p><p>Each memory item returned by <code>search</code> is an <code>Item</code> class with:</p><ul><li><code>value</code>: the memory or user preference saved</li><li><code>key</code>: the unique key used (<code>memory_id</code>)</li><li><code>namespace</code>: the namespace for this memory</li><li><code>created_at</code> / <code>updated_at</code>: timestamps</li></ul><p>That’s the core concept:</p><ul><li><code>put(namespace, key, value)</code> to save a memory</li><li><code>search(namespace, ...)</code> to retrieve a memory</li></ul><p>So far, <code>search</code> returns all items for a namespace and follows a simple retrieval strategy based on traditional keyword search. This doesn’t scale well if you have hundreds or thousands of memories per user.</p><p>In real systems, we want semantic retrieval that first converts text into vector embeddings and store these embeddings in an index. At query time, we embed the query and find the closest memories using a similarity metric.</p><p>Let us first configure our embedding model before going to the implementation part. We will again be using OpenRouter for this.</p>
<!--kg-card-begin: html-->
<div style="overflow:hidden;margin-left:auto;margin-right:auto;border-radius:10px;width:100%;max-width:544px;position:relative"><div style="width:100%;padding-bottom:65.80882352941177%"></div><iframe width="544" height="358" title="" src="https://snappify.com/embed/0a5f02f7-f2f9-48d3-bcf6-4bd725b5f573/f719602e-5d25-45a4-bd5c-5ac4ebd72d66?responsive=1&b=1" allow="clipboard-write" allowfullscreen="" loading="lazy" style="background:linear-gradient(337deg,#654EA3FF,#DA98B4FF);position:absolute;left:0;top:0;width:100%" frameborder="0"></iframe></div>
<!--kg-card-end: html-->
<p></p><p><code>InMemoryStore</code> can be configured with an embedding index like this:</p>
<!--kg-card-begin: html-->
<div style="overflow:hidden;margin-left:auto;margin-right:auto;border-radius:10px;width:100%;max-width:613px;position:relative"><div style="width:100%;padding-bottom:51.876019575856446%"></div><iframe width="613" height="318" title="" src="https://snappify.com/embed/0a5f02f7-f2f9-48d3-bcf6-4bd725b5f573/bfaf20ff-737d-4a15-9613-9f8229ed6fdc?responsive=1&b=1" allow="clipboard-write" allowfullscreen="" loading="lazy" style="background:linear-gradient(337deg,#654EA3FF,#DA98B4FF);position:absolute;left:0;top:0;width:100%" frameborder="0"></iframe></div>
<!--kg-card-end: html-->
<p></p><p>Here:</p><ul><li><code>embed</code> is an embedding function (OpenAI embeddings).</li><li><code>dims</code> is the dimensionality of the embedded vectors.</li><li><code>fields</code> tells the store which fields of our values to embed.<ul><li><code>"food_preference"</code> will embed the value of this key in the dict (i.e. I like pizza).</li><li><code>"$"</code> is a catch-all for the entire object.</li></ul></li></ul><p>Now, when we put memories into the store, vectors are computed and stored behind the scenes. We can then issue semantic queries:</p>
<!--kg-card-begin: html-->
<div style="overflow:hidden;margin-left:auto;margin-right:auto;border-radius:10px;width:100%;max-width:635px;position:relative"><div style="width:100%;padding-bottom:110.23622047244095%"></div><iframe width="635" height="700" title="" src="https://snappify.com/embed/0a5f02f7-f2f9-48d3-bcf6-4bd725b5f573/a918a364-cc65-49dc-b8b0-212d0fcf53fb?responsive=1&b=1" allow="clipboard-write" allowfullscreen="" loading="lazy" style="background:linear-gradient(337deg,#654EA3FF,#DA98B4FF);position:absolute;left:0;top:0;width:100%" frameborder="0"></iframe></div>
<!--kg-card-end: html-->
<p></p><p>Here:</p><ul><li><code>query</code> is a natural language search string.</li><li><code>limit</code> controls how many top matches you want.</li><li>The store embeds the query, computes similarity with stored embeddings, and returns the best matches.</li></ul><p>We can also control which parts of the memories get embedded by configuring the&nbsp;<code>fields</code>&nbsp;parameter or by specifying the&nbsp;<code>index</code>&nbsp;parameter when storing memories:</p>
<!--kg-card-begin: html-->
<div style="overflow:hidden;margin-left:auto;margin-right:auto;border-radius:10px;width:100%;max-width:613px;position:relative"><div style="width:100%;padding-bottom:58.401305057096245%"></div><iframe width="613" height="358" title="" src="https://snappify.com/embed/0a5f02f7-f2f9-48d3-bcf6-4bd725b5f573/32ab8874-293a-4173-87d5-2b337865019a?responsive=1&b=1" allow="clipboard-write" allowfullscreen="" loading="lazy" style="background:linear-gradient(337deg,#654EA3FF,#DA98B4FF);position:absolute;left:0;top:0;width:100%" frameborder="0"></iframe></div>
<!--kg-card-end: html-->
<p></p><p>This is the core of retrieval-based memory systems:</p><ul><li>define what you save</li><li>define how it is embedded</li><li>ask semantic questions and get back relevant chunks of memory</li></ul><p>Before we start to implement this strategy for our support agent it is important to note that the <code>InMemoryStore</code> is prefect for local experiments, unit tests and small prototypes but not enough for a production-grade app. </p><p>For production workloads, we almost always want a robust vector database backend that can handle millions of memory items, is scalable with efficient read/write operations and supports low-latency search and retrieval.</p><h3 id="implementation">Implementation</h3><p>Now that the store and semantic search pieces are clear, we can design a retrieval-based memory layer for our customer support agent.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2025/12/image-70.png" class="kg-image" alt="" loading="lazy" width="588" height="504"></figure><p>Here's the workflow:</p><ul><li>We add important long-term facts about a user into the store (plan, workspace names, previous issues, etc.).</li><li>When the user opens a new ticket and asks a question, the agent:<ul><li>looks up these long-term memories via semantic search</li><li>injects the relevant ones into the prompt</li><li>answers using both the current conversation and the retrieved context</li></ul></li></ul><p>We’ll keep the design simple and focused on retrieval. Let's define our state:</p>

<aside data-sx-content-cta class="relative p-0.5 js-toc-ignore">
        <div data-sx-content-cta-blur class="absolute w-full start-0 -top-0.5 -translate-y-full bg-linear-to-b to-70% from-black/0 to-background dark:to-background-dark pointer-events-none">
            &nbsp; <br> &nbsp; <br> &nbsp;
        </div>

    <div class="relative flex flex-col items-center text-center w-full px-4 py-10 lg:px-10 text-base rounded-xl sm:rounded-2xl">
        <div class="dark:hidden absolute -inset-px -z-10 rounded-xl bg-white shadow-pretty-sm transition-shadow duration-200"></div>
        <span class="hidden dark:block absolute -z-10 -inset-px rounded-inherit bg-radial-[50%_100%_at_50%_0%] from-white/10"></span>        
        <div class="hidden dark:block absolute -inset-px -z-10 rounded-xl bg-white/1 transition-colors duration-500 inset-shadow-md inset-shadow-white/2"></div>
        
        <div class="hidden dark:block absolute -z-10 -inset-0.5 mask-to-b mask-position-to-70% mask-opacity-to-50%">
            <div class="hidden dark:block absolute -z-10 inset-px rounded-xl border border-gray-50/10"></div>
        </div>                                                
        <div class="hidden dark:block absolute -z-10 -inset-px rounded-xl border border-gray-50/5 opacity-0"></div>

        <div class="hidden dark:block absolute -top-px start-1/2 -translate-x-1/2 w-1/2 h-px bg-linear-to-r from-white/0 to-white/0 via-gray-400/50"></div>

        <h2 class="text-gray-900 dark:text-gray-100 text-pretty text-xl sm:text-2xl leading-snug tracking-tight">




                                                <span class="hidden first:inline">
                                This lesson is for paying subscribers only
                            </span>                           
                    <span class="hidden first:inline">
                        This post is for paying subscribers only
                    </span>  

        </h2>

            <a href="/membership" class="mt-7 px-4 py-2.5 font-semibold text-sm text-center bg-primary rounded-full shadow-primary dark:shadow-none dark:drop-shadow-primary-sm hover:opacity-90 transition-opacity duration-200 inline-block" style="color: white;">
                Unlock Full Access
            </a>
            <p class="mt-5 text-sm">
                Already have an account? <button data-portal="signin" class="block sm:inline mx-auto font-medium underline decoration-primary decoration-2 underline-offset-2 hover:text-primary transition-colors duration-200">Sign in</button>
            </p>
    </div>
</aside>
]]></content:encoded>
                </item>
                <item>
                    <title><![CDATA[A Practical Deep Dive Into Memory Optimization for Agentic Systems (Part B)]]></title>
                    <description><![CDATA[AI Agents Crash Course—Part 16 (with implementation).]]></description>
                    <link>https://www.dailydoseofds.com/ai-agents-crash-course-part-16-with-implementation/</link>
                    <guid isPermaLink="false">69cad8cb10b4a900019d0d8d</guid>

                        <category><![CDATA[AI Agents Course]]></category>
                        <category><![CDATA[Agents]]></category>

                        <dc:creator><![CDATA[Avi Chawla]]></dc:creator>

                    <pubDate>Sun, 14 Dec 2025 00:55:00 +0530</pubDate>

                        <media:content url="https://www.dailydoseofds.com/content/images/2025/12/agent_crash_course--1--1.png" medium="image"/>

                    <content:encoded><![CDATA[<img src="https://www.dailydoseofds.com/content/images/2025/12/agent_crash_course--1--1.png" alt="A Practical Deep Dive Into Memory Optimization for Agentic Systems (Part B)"/> <h2 id="recap">Recap</h2><p>In Part 15, we deliberately started from the ground up discussing how large language models are inherently stateless.</p><p>The model does not remember what happened earlier unless you show it again in the prompt. All of the memory we see in real systems is a product of managing context outside the model.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2025/12/image-121.png" class="kg-image" alt="" loading="lazy" width="1000" height="363" srcset="https://www.dailydoseofds.com/content/images/size/w600/2025/12/image-121.png 600w, https://www.dailydoseofds.com/content/images/2025/12/image-121.png 1000w" sizes="(min-width: 720px) 720px"></figure><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2025/12/image-122.png" class="kg-image" alt="" loading="lazy" width="1000" height="389" srcset="https://www.dailydoseofds.com/content/images/size/w600/2025/12/image-122.png 600w, https://www.dailydoseofds.com/content/images/2025/12/image-122.png 1000w" sizes="(min-width: 720px) 720px"></figure><p>To make that concrete, we built a truly stateless agent in LangGraph that accepts user input and outputs a response. Each call built a fresh prompt from scratch using only the current input, sent it to the LLM, and returned a one-off answer. </p><p>The moment we asked a follow-up question, the agent had no idea about the previous query or its own response. Every request was independent. There was no conversation history and no concept of memory at all.</p><p>From there, we layered in the LangGraph mental model that we used throughout the article:</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2025/11/image-55.png" class="kg-image" alt="" loading="lazy" width="856" height="386" srcset="https://www.dailydoseofds.com/content/images/size/w600/2025/11/image-55.png 600w, https://www.dailydoseofds.com/content/images/2025/11/image-55.png 856w" sizes="(min-width: 720px) 720px"></figure><ul><li>State as the single object that flows through the graph, updated at each step.</li><li>Nodes as small, focused functions that read from state and return updates.</li><li>Edges as the control flow that decides which node runs next.</li><li>Conditional edges for loops and branches.</li></ul><p>Once that was clear, we introduced the memory side of the LangGraph ecosystem.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2025/12/image-123.png" class="kg-image" alt="" loading="lazy" width="1000" height="323" srcset="https://www.dailydoseofds.com/content/images/size/w600/2025/12/image-123.png 600w, https://www.dailydoseofds.com/content/images/2025/12/image-123.png 1000w" sizes="(min-width: 720px) 720px"></figure><p>On the short-term side, we moved from a simple state to a <code>MessagesState</code> that holds a growing list of messages storing the conversation history. We wired this into a node and compiled the graph with a checkpointer plus a thread ID. </p><p>LangGraph then started persisting the conversation for that thread as a series of checkpoints, so that each turn could see the full interaction so far without us manually passing the history around.</p><p>On the long-term side, we looked at the store abstraction in LangGraph for durable memory. Instead of keeping everything inside a single thread, we wrote JSON documents into a store and then retrieved them on future runs. </p><p>This gave us cross-session memory for things like user preferences, project metadata, and past decisions. We also saw how threads, checkpoints, and stores were all tied together through the config layer.</p><p>By the end of Part 15, we had not optimized anything. But we did answer a more fundamental question, "Where does memory actually live in a LangGraph-based system, and how does it behave over time?"</p><p>If you haven’t explored Part 15 yet, we strongly recommend going through it first, since it sets the foundations and flow for what's about to come.</p><p>Read it here:</p><figure class="kg-card kg-bookmark-card"><a class="kg-bookmark-container" href="https://www.dailydoseofds.com/ai-agents-crash-course-part-15-with-implementation/"><div class="kg-bookmark-content"><div class="kg-bookmark-title">A Practical Deep Dive Into Memory Optimization for Agentic Systems (Part A)</div><div class="kg-bookmark-description">AI Agents Crash Course—Part 15 (with implementation).</div><div class="kg-bookmark-metadata"><img class="kg-bookmark-icon" src="https://www.dailydoseofds.com/content/images/icon/logo-subsatck2-1-166.svg" alt=""><span class="kg-bookmark-author">Daily Dose of Data Science</span><span class="kg-bookmark-publisher">Avi Chawla</span></div></div><div class="kg-bookmark-thumbnail"><img src="https://www.dailydoseofds.com/content/images/thumbnail/agent_crash_course--1--42.png" alt="" onerror="this.style.display = 'none'"></div></a></figure><p>In this chapter, we’ll understand and learn about memory optimization in agentic workflows, with necessary theoretical details and hands-on examples.</p><p>As always, every notion will be explained through clear examples and walkthroughs to develop a solid understanding.</p><p>Let’s begin!</p><hr><h2 id="introduction">Introduction</h2><p>To quickly reiterate, Part 15 was about understanding the LangGraph ecosystem, how it represents workflows as graphs over a shared state, and how threads, checkpoints, and stores combine to give you short-term and long-term memory layers.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2025/11/image-101.png" class="kg-image" alt="" loading="lazy" width="1329" height="870" srcset="https://www.dailydoseofds.com/content/images/size/w600/2025/11/image-101.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2025/11/image-101.png 1000w, https://www.dailydoseofds.com/content/images/2025/11/image-101.png 1329w" sizes="(min-width: 720px) 720px"></figure><p>This and the upcoming part (Parts 16 and 17) are about taking that foundation and putting it to use in understanding memory optimization strategies in depth. Here we start using that understanding to build agents that are not just stateful, but efficient and production-ready.</p><p>Once we move beyond demo scripts and into production, stuffing everything into the context window stops working almost immediately. Token limits are real. Costs grow fast. High latency means slow responses and poor user experience.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2025/12/image-124-1-1.png" class="kg-image" alt="" loading="lazy" width="965" height="415" srcset="https://www.dailydoseofds.com/content/images/size/w600/2025/12/image-124-1-1.png 600w, https://www.dailydoseofds.com/content/images/2025/12/image-124-1-1.png 965w" sizes="(min-width: 720px) 720px"></figure><p>Therefore, it is quite clear that throwing more tokens at the model does not automatically give us better behavior.</p><p>So the question is not “How do I give the model more memory?” The question is “How do I give the model the right memory at the right time, for optimum performance?”</p><p>That is what memory optimization is about.</p><p>In this chapter, we will treat memory as a first-class design problem and walk through a set of practical strategies to manage it in LangGraph. For each strategy, we will focus on three things:</p><ul><li>Intuition: What problem it solves, where it fits in the stack, and what it trades off.</li><li>Implementation: How to express it using LangGraph’s building blocks: state, nodes, edges, threads, checkpoints, and stores.</li><li>Behavior: How it changes cost, latency, and the quality of responses in realistic flows.</li></ul><p>We will keep everything grounded and run a single example of a customer support chatbot for a SaaS product.</p><p>This agent will need to:</p><ul><li>Handle long, messy support threads with follow-ups and clarifications</li><li>Refer back to earlier messages in the same ticket without re-asking everything</li><li>Remember user-level information across tickets (plans, preferences, past issues)</li><li>Stay within reasonable latencies</li></ul><p>We will be looking at an example of a customer support agent and evolve our agentic system step by step:</p><ul><li>Start with a sequential memory baseline where we send the entire conversation to the LLM on every turn.</li><li>Introduce sliding windows to bound how much recent context we carry into each call.</li><li>Add summarization so that older parts of the conversation are compressed rather than dropped.</li><li>Layer in retrieval over a long-term store so we can pull back only the most relevant past interactions.</li><li>Move to hierarchical memory where session-level context, user profiles, and product knowledge live in clearly separate tiers.</li><li>Add OS-like memory management, where we treat context as a limited budget and explicitly choose what stays in active vs passive memory states.</li></ul><p>All of this will be implemented in LangGraph with code you can adapt and reuse for developing your own agents.</p><p>To summarize our system evolution as a diagram:</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2025/12/image-125-1.png" class="kg-image" alt="" loading="lazy" width="650" height="919" srcset="https://www.dailydoseofds.com/content/images/size/w600/2025/12/image-125-1.png 600w, https://www.dailydoseofds.com/content/images/2025/12/image-125-1.png 650w"></figure><p>To provide you an in depth understanding of the concepts, we would build these memory frameworks using only the fundamental components of LangGraph discussed previously.</p><p>The same strategies could also be implemented using more abstract components but we won't use them for this tutorial.</p><p>By the end of this and the next part, you should be able to look at your own agent and choose the right memory strategy based on the decision frameworks we will discuss.</p><hr><h2 id="sequential-memory">Sequential memory</h2><p>In Part 15, we ended with a simple but powerful idea wherein we store all messages inside a list and persist that state with a checkpointer and a unique thread ID. This  effectively gives us an agent with short-term memory inside a conversation.</p><p>That exact sequential pattern is the first technique we shall discuss. </p><p>In the baseline agent with no memory, the state consisted of just the user query and the LLM response, and every call built a prompt from scratch with only the current input. As a result, the agent could not recall any prior information when answering the next user query.</p><p>In the sequential approach, the state holds a growing list of messages, and we always send the full conversation history to the model on each turn. This creates a linear chain of memory, preserving everything that has been said so far.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2025/12/image-62.png" class="kg-image" alt="" loading="lazy" width="583" height="414"></figure><p>This gives us an agent that can handle follow-ups correctly. </p><p>This strategy is the simplest of all and will serve as a reference implementation to compare all other memory strategies against.</p><p>Before we dive into the code, let’s build an intuition for how this works in plain terms:</p><ul><li>The user starts a conversation with the AI agent.</li><li>The agent generates a response.</li><li>This user-AI interaction is saved as a single block of text. Let's call it a turn.</li><li>For each subsequent turn, the agent takes the entire conversation history (Turn 1 + Turn 2 + Turn 3, and so on) and combines it with the new user query.</li><li>This massive block of text is sent to the LLM to generate the next response.</li></ul><p>Here's a diagram, summarizing the flow:</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2025/12/image-126-1-1.png" class="kg-image" alt="" loading="lazy" width="1024" height="919" srcset="https://www.dailydoseofds.com/content/images/size/w600/2025/12/image-126-1-1.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2025/12/image-126-1-1.png 1000w, https://www.dailydoseofds.com/content/images/2025/12/image-126-1-1.png 1024w" sizes="(min-width: 720px) 720px"></figure><h3 id="implementation">Implementation</h3><p>Now, let's look at how we can implement this in LangGraph.</p><p>Before moving on, let's first complete our setup and install the required dependencies.</p>
<!--kg-card-begin: html-->
<div style="overflow:hidden;margin-left:auto;margin-right:auto;border-radius:10px;width:100%;max-width:578px;position:relative"><div style="width:100%;padding-bottom:29.238754325259514%"></div><iframe width="578" height="169" title="" src="https://snappify.com/embed/0a5f02f7-f2f9-48d3-bcf6-4bd725b5f573/3257e8b5-d19a-4181-95bd-52fca263d25e?responsive=1&b=1" allow="clipboard-write" allowfullscreen="" loading="lazy" style="background:linear-gradient(337deg,#654EA3FF,#DA98B4FF);position:absolute;left:0;top:0;width:100%" frameborder="0"></iframe></div>
<!--kg-card-end: html-->
<p></p><p>We will use OpenRouter as our main LLM provider for this tutorial. It provides a single API to connect to all the LLM providers out there.</p><p>Go to OpenRouter, create an account and get your API key, and store it in a&nbsp;<code>.env</code>&nbsp;file:</p>
<!--kg-card-begin: html-->
<div style="overflow:hidden;margin-left:auto;margin-right:auto;border-radius:10px;width:100%;max-width:386px;position:relative"><div style="width:100%;padding-bottom:43.78238341968912%"></div><iframe width="386" height="169" title="" src="https://snappify.com/embed/0a5f02f7-f2f9-48d3-bcf6-4bd725b5f573/8ebbfa28-af71-40f8-90d1-7f1a4ab9c827?responsive=1&b=1" allow="clipboard-write" allowfullscreen="" loading="lazy" style="background:linear-gradient(337deg,#654EA3FF,#DA98B4FF);position:absolute;left:0;top:0;width:100%" frameborder="0"></iframe></div>
<!--kg-card-end: html-->
<p></p><p>Next, we load this key into our environment as follows:</p>
<!--kg-card-begin: html-->
<div style="overflow:hidden;margin-left:auto;margin-right:auto;border-radius:10px;width:100%;max-width:382px;position:relative"><div style="width:100%;padding-bottom:57.59162303664922%"></div><iframe width="382" height="220" title="" src="https://snappify.com/embed/0a5f02f7-f2f9-48d3-bcf6-4bd725b5f573/fb58b327-a6eb-4601-b24c-add044228b6a?responsive=1&b=1" allow="clipboard-write" allowfullscreen="" loading="lazy" style="background:linear-gradient(337deg,#654EA3FF,#DA98B4FF);position:absolute;left:0;top:0;width:100%" frameborder="0"></iframe></div>
<!--kg-card-end: html-->
<p></p><p>OpenRouter is compatible with the OpenAI chat completions API, so we can use&nbsp;<code>ChatOpenAI</code>&nbsp;with a custom&nbsp;<code>base_url</code>&nbsp;to define our LLM.</p>
<!--kg-card-begin: html-->
<div style="overflow:hidden;margin-left:auto;margin-right:auto;border-radius:10px;width:100%;max-width:534px;position:relative"><div style="width:100%;padding-bottom:51.68539325842697%"></div><iframe width="534" height="276" title="" src="https://snappify.com/embed/0a5f02f7-f2f9-48d3-bcf6-4bd725b5f573/4e19e19d-cd31-4c57-8be0-6ed0de873eb4?responsive=1&b=1" allow="clipboard-write" allowfullscreen="" loading="lazy" style="background:linear-gradient(337deg,#654EA3FF,#DA98B4FF);position:absolute;left:0;top:0;width:100%" frameborder="0"></iframe></div>
<!--kg-card-end: html-->
<p></p><p><code>model</code>&nbsp;is the model identifier from OpenRouter. You can swap this out for Claude, Gemini, or any other open source model if you like.</p><p>Now we are all set to get on with the code.</p><p>Let's first define our state as follows:</p>
<!--kg-card-begin: html-->
<div style="overflow:hidden;margin-left:auto;margin-right:auto;border-radius:10px;width:100%;max-width:622px;position:relative"><div style="width:100%;padding-bottom:41.80064308681672%"></div><iframe width="622" height="260" title="" src="https://snappify.com/embed/0a5f02f7-f2f9-48d3-bcf6-4bd725b5f573/69514886-7a91-43d6-990f-0959d0926980?responsive=1&b=1" allow="clipboard-write" allowfullscreen="" loading="lazy" style="background:linear-gradient(337deg,#654EA3FF,#DA98B4FF);position:absolute;left:0;top:0;width:100%" frameborder="0"></iframe></div>
<!--kg-card-end: html-->
<p></p><p>Here:</p><ul><li><code>messages</code> will hold the entire conversation history: user messages, AI replies, tool calls, tool results.</li><li><code>operator.add</code> tells LangGraph to append new messages instead of overwriting the list.</li></ul><p>This is exactly what we need for the conversation to grow over time and gives us short-term memory within a single thread.</p><p>Next, we set up our LLM node for response generation:</p>
<!--kg-card-begin: html-->
<div style="overflow:hidden;margin-left:auto;margin-right:auto;border-radius:10px;width:100%;max-width:699px;position:relative"><div style="width:100%;padding-bottom:49.64234620886982%"></div><iframe width="699" height="347" title="" src="https://snappify.com/embed/0a5f02f7-f2f9-48d3-bcf6-4bd725b5f573/c720d986-9704-41fc-9a07-3dcd3b52f4e8?responsive=1&b=1" allow="clipboard-write" allowfullscreen="" loading="lazy" style="background:linear-gradient(337deg,#654EA3FF,#DA98B4FF);position:absolute;left:0;top:0;width:100%" frameborder="0"></iframe></div>
<!--kg-card-end: html-->
<p></p><p>Here:</p><ul><li>The node reads <code>state["messages"]</code>, prepends a system message, and sends the entire history to the LLM.</li><li>LLM returns a single new message, which will be appended to the conversation list due to <code>operator.add</code>.</li></ul><p>Now we compile the graph and enable checkpointing to persist state between calls:</p>
<!--kg-card-begin: html-->
<div style="overflow:hidden;margin-left:auto;margin-right:auto;border-radius:10px;width:100%;max-width:682px;position:relative"><div style="width:100%;padding-bottom:53.81231671554253%"></div><iframe width="682" height="367" title="" src="https://snappify.com/embed/0a5f02f7-f2f9-48d3-bcf6-4bd725b5f573/ac605103-4036-4657-af99-c4ada6ca1fcc?responsive=1&b=1" allow="clipboard-write" allowfullscreen="" loading="lazy" style="background:linear-gradient(337deg,#654EA3FF,#DA98B4FF);position:absolute;left:0;top:0;width:100%" frameborder="0"></iframe></div>
<!--kg-card-end: html-->
<p></p><p>Here we are creating a simple graph:</p><ul><li>It has a single node,&nbsp;<code>chat_llm</code>, which handles one turn of conversation.</li><li>The graph starts at&nbsp;<code>START</code>, runs&nbsp;<code>chat_llm</code>, and then reaches&nbsp;<code>END</code>.</li><li>We compile it with an&nbsp;<code>InMemorySaver</code>&nbsp;checkpoint. This is what allows LangGraph to save and restore the conversation state across multiple calls.</li></ul><h3 id="sample-conversation">Sample conversation</h3><p>We now have a full-history, sequential memory agent in LangGraph.</p><p>But what good is our agent if it can't answer user queries reliably. So, now let's test by simulating a simple support ticket flow for our customer support agent.</p><p>To quantify the impact of this strategy and to easily compare different memory strategies later, we will analyze their token usage and latency as we go along the conversation.</p><p>Turn 1:</p>
<!--kg-card-begin: html-->
<div style="overflow:hidden;margin-left:auto;margin-right:auto;border-radius:10px;width:100%;max-width:635px;position:relative"><div style="width:100%;padding-bottom:61.732283464566926%"></div><iframe width="635" height="392" title="" src="https://snappify.com/embed/0a5f02f7-f2f9-48d3-bcf6-4bd725b5f573/0bd4a30f-d776-491a-babf-72cf5330d51f?responsive=1&b=1" allow="clipboard-write" allowfullscreen="" loading="lazy" style="background:linear-gradient(337deg,#654EA3FF,#DA98B4FF);position:absolute;left:0;top:0;width:100%" frameborder="0"></iframe></div>
<!--kg-card-end: html-->
<p></p><p>For this turn:</p><ul><li><code>messages</code> contains one <code>HumanMessage</code> describing the issue.</li><li><code>chat_llm_node</code> returns an answer.</li><li>Checkpointer saves a checkpoint for <code>thread_id="ticket-seq"</code> containing both the user message and the assistant's reply.</li></ul><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2025/12/image-4.png" class="kg-image" alt="" loading="lazy" width="929" height="405" srcset="https://www.dailydoseofds.com/content/images/size/w600/2025/12/image-4.png 600w, https://www.dailydoseofds.com/content/images/2025/12/image-4.png 929w" sizes="(min-width: 720px) 720px"></figure><p>Turn 2:</p>

<aside data-sx-content-cta class="relative p-0.5 js-toc-ignore">
        <div data-sx-content-cta-blur class="absolute w-full start-0 -top-0.5 -translate-y-full bg-linear-to-b to-70% from-black/0 to-background dark:to-background-dark pointer-events-none">
            &nbsp; <br> &nbsp; <br> &nbsp;
        </div>

    <div class="relative flex flex-col items-center text-center w-full px-4 py-10 lg:px-10 text-base rounded-xl sm:rounded-2xl">
        <div class="dark:hidden absolute -inset-px -z-10 rounded-xl bg-white shadow-pretty-sm transition-shadow duration-200"></div>
        <span class="hidden dark:block absolute -z-10 -inset-px rounded-inherit bg-radial-[50%_100%_at_50%_0%] from-white/10"></span>        
        <div class="hidden dark:block absolute -inset-px -z-10 rounded-xl bg-white/1 transition-colors duration-500 inset-shadow-md inset-shadow-white/2"></div>
        
        <div class="hidden dark:block absolute -z-10 -inset-0.5 mask-to-b mask-position-to-70% mask-opacity-to-50%">
            <div class="hidden dark:block absolute -z-10 inset-px rounded-xl border border-gray-50/10"></div>
        </div>                                                
        <div class="hidden dark:block absolute -z-10 -inset-px rounded-xl border border-gray-50/5 opacity-0"></div>

        <div class="hidden dark:block absolute -top-px start-1/2 -translate-x-1/2 w-1/2 h-px bg-linear-to-r from-white/0 to-white/0 via-gray-400/50"></div>

        <h2 class="text-gray-900 dark:text-gray-100 text-pretty text-xl sm:text-2xl leading-snug tracking-tight">




                                                <span class="hidden first:inline">
                                This lesson is for paying subscribers only
                            </span>                           
                    <span class="hidden first:inline">
                        This post is for paying subscribers only
                    </span>  

        </h2>

            <a href="/membership" class="mt-7 px-4 py-2.5 font-semibold text-sm text-center bg-primary rounded-full shadow-primary dark:shadow-none dark:drop-shadow-primary-sm hover:opacity-90 transition-opacity duration-200 inline-block" style="color: white;">
                Unlock Full Access
            </a>
            <p class="mt-5 text-sm">
                Already have an account? <button data-portal="signin" class="block sm:inline mx-auto font-medium underline decoration-primary decoration-2 underline-offset-2 hover:text-primary transition-colors duration-200">Sign in</button>
            </p>
    </div>
</aside>
]]></content:encoded>
                </item>
                <item>
                    <title><![CDATA[Foundations of AI Engineering and LLMs]]></title>
                    <description><![CDATA[LLMOps Part 1: An overview of AI engineering and LLMOps, and the core dimensions that define modern AI systems.]]></description>
                    <link>https://www.dailydoseofds.com/llmops-crash-course-part-1/</link>
                    <guid isPermaLink="false">69cad8cb10b4a900019d0d91</guid>

                        <category><![CDATA[MLOps/LLMOps Course]]></category>
                        <category><![CDATA[LLMOps Crash Course]]></category>
                        <category><![CDATA[LLMOps]]></category>
                        <category><![CDATA[MLOps]]></category>

                        <dc:creator><![CDATA[Avi Chawla]]></dc:creator>

                    <pubDate>Sun, 07 Dec 2025 23:41:00 +0530</pubDate>

                        <media:content url="https://www.dailydoseofds.com/content/images/2025/12/Sanyog-MLOps_-_I-1.png" medium="image"/>

                    <content:encoded><![CDATA[<img src="https://www.dailydoseofds.com/content/images/2025/12/Sanyog-MLOps_-_I-1.png" alt="Foundations of AI Engineering and LLMs"/> <h2 id="introduction">Introduction</h2><p>So, while learning MLOps, we explored traditional machine learning models and systems.</p><p>We learned how to take them from experimentation to production using the principles of MLOps.</p><p>Now, a new question emerges:</p><p>What happens when the “model” is no longer a custom-trained classifier, but a massive foundation model like Llama, GPT, or Claude?</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2025/12/image-88-1.png" class="kg-image" alt="" loading="lazy" width="1024" height="360" srcset="https://www.dailydoseofds.com/content/images/size/w600/2025/12/image-88-1.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2025/12/image-88-1.png 1000w, https://www.dailydoseofds.com/content/images/2025/12/image-88-1.png 1024w" sizes="(min-width: 720px) 720px"></figure><p>Are the same principles enough?</p><p>Not quite.</p><p>Modern AI applications are increasingly powered by large language models (LLMs), which are systems that can generate text, reason over documents, call tools, write code, analyze data, and even act as autonomous agents.</p><p>These models introduce an entirely new set of engineering challenges that traditional MLOps does not fully address.</p><p>This is where AI engineering and LLMOps come in.</p><p>AI engineering, and specifically LLMOps (Large Language Model Operations), are the specialized practices for managing and maintaining LLMs and LLM-based applications in production, ensuring they remain reliable, accurate, secure, and cost-effective.</p><p>LLMOps aims to manage language models and the applications built on them by drawing inspiration from MLOps. It applies reliable software engineering and DevOps practices to LLM-based systems, ensuring that all components work together seamlessly to deliver value.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2025/12/image-86-1.png" class="kg-image" alt="" loading="lazy" width="988" height="522" srcset="https://www.dailydoseofds.com/content/images/size/w600/2025/12/image-86-1.png 600w, https://www.dailydoseofds.com/content/images/2025/12/image-86-1.png 988w" sizes="(min-width: 720px) 720px"></figure><div class="kg-card kg-callout-card kg-callout-card-blue"><div class="kg-callout-emoji">👉</div><div class="kg-callout-text">Note on terminology: In this series, we might use AI engineering and LLMOps somewhat interchangeably at places, but there’s a slight distinction: AI engineering is the broader discipline of building real-world AI applications, while LLMOps is the operational subset of it focused on optimizing, deploying, and maintaining LLM-powered systems in production.</div></div><p>Now that we are starting the LLMOps phase of our MLOps and LLMOps crash course, the aim is to provide you with a thorough explanation and systems-level thinking to build AI applications for production settings.</p><p>Just as in the MLOps phase, each chapter will clearly explain necessary concepts, provide examples, diagrams, and implementations.</p><p>As we progress, we will see how we can develop the critical thinking required for taking our applications to the next stage and what exactly the framework should be for that.</p><p>We’ll begin with the fundamentals of AI engineering and LLMOps, incorporate related concepts, and progressively deepen our understanding with each chapter.</p><p>As for this part, we will focus on laying the foundation by exploring what LLMOps is, why it matters, and how it shapes the development of modern AI systems.</p><div class="kg-card kg-callout-card kg-callout-card-blue"><div class="kg-callout-emoji">👉</div><div class="kg-callout-text">This course presumes proficiency in Python programming, probability theory, and software engineering practices, as well as a working knowledge of fundamental ML and NLP terminology. We also encourage you to check our 18-part MLOps material, and as you progress through this course, we will share relevant links whenever they are needed.</div></div><p>Let's begin!</p><hr><h2 id="fundamentals-of-ai-engineering-llms">Fundamentals of AI engineering &amp; LLMs</h2><p>Large language models (LLMs) and foundation models, in general, are reshaping the way modern AI systems are built.</p><p>Out of this, AI engineering has emerged as a distinct discipline, one that focuses on building practical applications powered by AI models (especially large pre-trained models).</p><p>It evolved out of traditional machine learning engineering as companies moved from training bespoke ML models to harnessing powerful foundation models developed by others.</p><div class="kg-card kg-callout-card kg-callout-card-blue"><div class="kg-callout-emoji">👉</div><div class="kg-callout-text">Foundation models are&nbsp;massive, pre-trained AI systems that learn broad patterns from huge datasets, serving as versatile "base models" adaptable to many specific tasks. It is worth noting that LLMs&nbsp;are a specific&nbsp;type&nbsp;of foundation model, specialized mainly for language tasks.</div></div><p>In essence, AI engineering blends software engineering, data engineering, and ML to develop, deploy, and maintain AI-driven systems that are reliable and scalable in real-world conditions.</p><p>At first glance, an “AI Engineer” might sound like a rebranding of an ML engineer, and indeed, there is significant overlap, but there are important distinctions.</p><p>AI engineering emphasizes using and adapting existing models (like open-source LLMs or API models) to solve problems, whereas classical ML engineering often centers on training models from scratch on curated data.</p><p>AI engineering also deals with the engineering challenges of integrating AI into products: handling data pipelines, model serving infrastructure, continuous evaluation, and iteration based on user feedback.</p><p>The role sits at the intersection of software development and machine learning, requiring knowledge of both deploying software systems and understanding AI model behavior.</p><h3 id="the-ai-application-stack">The AI application stack</h3><p>AI engineering can be thought of in terms of a layered stack of responsibilities, much like traditional software systems. At a high level, any AI-driven application involves three layers:</p><ul><li>The application itself (user interface and logic integrating AI)</li><li>The model or model development layer</li><li>The infrastructure layer that supports serving and operations</li></ul><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2025/12/image-82-1.png" class="kg-image" alt="" loading="lazy" width="929" height="423" srcset="https://www.dailydoseofds.com/content/images/size/w600/2025/12/image-82-1.png 600w, https://www.dailydoseofds.com/content/images/2025/12/image-82-1.png 929w" sizes="(min-width: 720px) 720px"></figure><h4 id="application-development-top-layer">Application development (top layer)</h4><p>At this layer, engineers build the features and interfaces that end-users interact with, powered by AI under the hood.</p><p>With powerful models readily available via libraries or APIs, much of the work here involves prompting the model effectively and supplying any additional context the model needs.</p><p>Because the model’s outputs directly affect user experience, rigorous evaluation is crucial at this layer. AI engineers must also design intuitive interfaces and handle product considerations (for example, how users provide input to the LLM and how the AI responses are presented).</p><p>This layer has seen explosive growth in the last couple of years, as it’s easier than ever to plug an existing model into an app or workflow.</p><h4 id="model-development-middle-layer">Model development (middle layer)</h4><p>This layer is traditionally the domain of ML engineers and researchers. It includes choosing model architectures, training models on data, fine-tuning pre-trained models, and optimizing models for efficiency.</p><p>When working with foundation models, model development typically involves adapting an existing pretrained model rather than training one from scratch. It can involve tasks like fine-tuning an LLM on domain-specific data and performing optimization (e.g., quantizing or compressing).</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2025/12/image-105.png" class="kg-image" alt="" loading="lazy" width="567" height="170"></figure><p>Data is a central piece here: preparing datasets for fine-tuning or evaluating models, which might include labeling data or filtering and augmenting existing corpora.</p><p>Hence, even though foundation models are used as a starting point, understanding how models learn (e.g., knowledge of training algorithms, loss functions, etc.) remains valuable in troubleshooting and improving them as per the desired use case.</p><h4 id="infrastructure-bottom-layer">Infrastructure (bottom layer)</h4><p>At the base, AI engineering relies on robust infrastructure to deploy and operate models.</p><p>This includes the serving stack (how you host the model and expose it), managing computational resources (typically means provisioning GPUs or other accelerators), and monitoring the system’s health and performance.</p><p>It also spans data storage and pipelines (for example, a vector database to store embeddings for retrieval), as well as observability tooling to track usage and detect issues like downtime or degraded output quality.</p><hr><p>Based on the three layers and whatever we've learned so far, one key point is that many fundamentals of Ops have not changed even as we transition to using LLMs. Similar to any Ops lifecycle, we still need to solve real business problems, define success and performance metrics, monitor, iterate with feedback, and optimize for performance and cost.</p><p>However, on top of those fundamentals, AI engineering introduces new techniques and challenges unique to working with powerful pre-trained models.</p><p>Next, let’s take a look at some fundamental points that will help us understand what large language models (LLMs) really are.</p><h3 id="llm-basics">LLM basics</h3><p>Large language models (LLMs) are a type of AI model designed to understand and generate human-like text. They are essentially advanced predictors: given some input text (a “prompt”), an LLM produces a continuation of that text.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2025/12/image-113.png" class="kg-image" alt="" loading="lazy" width="2000" height="446" srcset="https://www.dailydoseofds.com/content/images/size/w600/2025/12/image-113.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2025/12/image-113.png 1000w, https://www.dailydoseofds.com/content/images/size/w1600/2025/12/image-113.png 1600w, https://www.dailydoseofds.com/content/images/2025/12/image-113.png 2334w" sizes="(min-width: 720px) 720px"></figure><p>Under the hood, most state-of-the-art LLMs are built on the transformer architecture, a neural network design introduced in 2017 (“Attention Is All You Need”) that enables scaling to very high parameter counts and effective learning from sequential data like text.</p><figure class="kg-card kg-bookmark-card"><a class="kg-bookmark-container" href="https://arxiv.org/abs/1706.03762?ref=dailydoseofds.com"><div class="kg-bookmark-content"><div class="kg-bookmark-title">Attention Is All You Need</div><div class="kg-bookmark-description">The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.</div><div class="kg-bookmark-metadata"><img class="kg-bookmark-icon" src="https://www.dailydoseofds.com/content/images/icon/apple-touch-icon-22.png" alt=""><span class="kg-bookmark-author">arXiv.org</span><span class="kg-bookmark-publisher">Ashish Vaswani</span></div></div><div class="kg-bookmark-thumbnail"><img src="https://www.dailydoseofds.com/content/images/thumbnail/arxiv-logo-fb-1.png" alt="" onerror="this.style.display = 'none'"></div></a></figure><p>Several characteristics and terminology related to LLMs:</p><h4 id="scale-the-%E2%80%9Clarge%E2%80%9D-in-llm">Scale (the “large” in LLM)</h4><p>LLMs have a huge number of parameters (weight values in the neural network). This number can range from hundreds of billions to trillions.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.dailydoseofds.com/content/images/2025/12/image-109.png" class="kg-image" alt="" loading="lazy" width="790" height="388" srcset="https://www.dailydoseofds.com/content/images/size/w600/2025/12/image-109.png 600w, https://www.dailydoseofds.com/content/images/2025/12/image-109.png 790w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Source: Siemens</span></figcaption></figure><p>There isn’t a strict threshold for what counts as “large”, but generally, it implies models with at least several billion parameters.</p><p>The term is relative and keeps evolving (each year’s “large” might be “medium” a couple of years later), but it contrasts these models with earlier “small” language models (like older RNN-based models or word embedding models with millions of parameters).</p><h4 id="training-on-massive-text-corpora">Training on massive text corpora</h4><p>LLMs are trained in an unsupervised manner on very large text datasets, essentially everything from books, articles, websites (Common Crawl data), Wikipedia, forums, etc., up until a certain cut-off date.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2025/12/image-114.png" class="kg-image" alt="" loading="lazy" width="2000" height="655" srcset="https://www.dailydoseofds.com/content/images/size/w600/2025/12/image-114.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2025/12/image-114.png 1000w, https://www.dailydoseofds.com/content/images/size/w1600/2025/12/image-114.png 1600w, https://www.dailydoseofds.com/content/images/size/w2400/2025/12/image-114.png 2400w" sizes="(min-width: 720px) 720px"></figure><p>The training objective is often to predict the next word in a sentence (more formally, next token, since text is tokenized into sub-word units). By learning to predict next tokens, these models learn grammar, facts, reasoning patterns, and even some world knowledge encoded in text.</p><p>The training process involves reading billions of sentences and adjusting weights to minimize prediction error. Through this, LLMs develop a statistical model of language that can be surprisingly adept at many tasks.</p><h4 id="generative-and-autoregressive">Generative and autoregressive</h4><p>Most LLMs (like the GPT series) are autoregressive transformers, meaning they generate text one token at a time, each time considering the previous tokens (the prompt plus what they’ve generated so far) to predict the next.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2025/12/image-115.png" class="kg-image" alt="" loading="lazy" width="2000" height="630" srcset="https://www.dailydoseofds.com/content/images/size/w600/2025/12/image-115.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2025/12/image-115.png 1000w, https://www.dailydoseofds.com/content/images/size/w1600/2025/12/image-115.png 1600w, https://www.dailydoseofds.com/content/images/size/w2400/2025/12/image-115.png 2400w" sizes="(min-width: 720px) 720px"></figure><p>This allows them to generate free-form text of arbitrary length. They can also be directed to produce specific formats (JSON, code, lists) via appropriate prompting. LLMs fall under the Generative AI category as well, since they create new content rather than just predicting a label or category.</p><p>Another kind of LMs is masked language models.</p><p>A masked language model predicts missing tokens anywhere in a sequence, using the context from both before and after the missing tokens. In essence, a masked language model is trained to be able to fill in the blank. A well-known example of a masked language model is bidirectional encoder representations from transformers, or BERT.</p><p>Masked language models are commonly used for non-generative tasks such as sentiment analysis. They are also useful for tasks requiring an understanding of the overall context, like code debugging, where a model needs to understand both the preceding and following code to identify errors.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2025/12/image-85-1.png" class="kg-image" alt="" loading="lazy" width="1024" height="518" srcset="https://www.dailydoseofds.com/content/images/size/w600/2025/12/image-85-1.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2025/12/image-85-1.png 1000w, https://www.dailydoseofds.com/content/images/2025/12/image-85-1.png 1024w" sizes="(min-width: 720px) 720px"></figure><div class="kg-card kg-callout-card kg-callout-card-blue"><div class="kg-callout-emoji">👉</div><div class="kg-callout-text">In this series, unless explicitly stated, language model (or large language model) will refer to an autoregressive model.</div></div><h4 id="emergent-abilities">Emergent abilities</h4><p>One intriguing aspect discovered is that as LMs get larger and are trained on more data, they start exhibiting emergent behavior, i.e., capabilities that smaller models did not have, seemingly appearing at a certain scale.</p><p>For example, the ability to do multi-step arithmetic, logical reasoning in chain-of-thought, or follow certain complex instructions often only becomes reliable in the larger models.</p><p>These emergent abilities are a major reason why LLMs took the world by storm. At a certain size and training breadth, the model is not just a mimic of text, but can perform non-trivial reasoning and problem-solving. It’s still a statistical machine, but it effectively learned algorithms from data.</p><h4 id="few-shot-and-zero-shot-learning">Few-shot and zero-shot learning</h4><p>Before LLMs, if you wanted a model to do something like summarization, you’d train it specifically for that. LLMs introduced the ability to do tasks zero-shot (no examples, just an instruction in plain language) or few-shot (provide a few examples in the prompt).</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2025/12/image-111.png" class="kg-image" alt="" loading="lazy" width="1236" height="793" srcset="https://www.dailydoseofds.com/content/images/size/w600/2025/12/image-111.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2025/12/image-111.png 1000w, https://www.dailydoseofds.com/content/images/2025/12/image-111.png 1236w" sizes="(min-width: 720px) 720px"></figure><p>For instance, you can paste an article and say “TL;DR:” and the LLM will attempt a summary, even if it was never explicitly trained to summarize, because it has seen enough text to infer what “TL;DR” means and how summaries look.</p><p>This was a revolutionary shift in how we interact with models: we don’t always need a dedicated model per task; one sufficiently large model can handle myriad tasks given the right prompt.</p><p>This is why prompt engineering became important; the model already has the capability, we just have to prompt it correctly to activate that capability.</p><h4 id="transformers-and-attention">Transformers and attention</h4><p>For a bit of the technical underpinnings, transformers use a mechanism called self-attention, which allows the model to weigh the relevance of different words in the input relative to each other when producing an output.</p><p>This means the model can capture long-range dependencies in language (e.g., understanding a pronoun reference that was several sentences back, or the theme of a paragraph).</p><p>Transformers also lend themselves to parallel computation, which made it feasible to train extremely large models using modern computing (GPUs/TPUs). This architecture replaced older recurrent neural network approaches that couldn’t scale as well.</p><h4 id="limitations">Limitations</h4><p>It’s important to remember LLMs don’t truly “understand” in a human sense. They predict text based on patterns, i.e., basically, they are probabilistic. This means they can be right for the wrong reasons and wrong with high confidence.</p><p>For example, an LLM might generate a very coherent-sounding but completely made-up answer to a factual question (hallucination). They have no inherent truth-checking mechanism; that’s why providing context or integrating tools is often needed for high-stakes applications.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.dailydoseofds.com/content/images/2025/12/image-108-1.png" class="kg-image" alt="" loading="lazy" width="2000" height="799" srcset="https://www.dailydoseofds.com/content/images/size/w600/2025/12/image-108-1.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2025/12/image-108-1.png 1000w, https://www.dailydoseofds.com/content/images/size/w1600/2025/12/image-108-1.png 1600w, https://www.dailydoseofds.com/content/images/2025/12/image-108-1.png 2000w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Source: SuperAnnotate</span></figcaption></figure><p>It’s also worth noting that making a model larger yields diminishing returns at some point. The jump from 100M to 10B parameters yields a bigger improvement than the jump from 10B to 50B, for example.</p><p>So just because an LLM is extremely large doesn’t always mean it’s the best choice. There might be sweet spots in the size vs performance vs cost trade-off. Engineers often choose the smallest model that achieves the needed performance to keep latency/cost down.</p><p>For instance, if a 7B model can do a task with 95% success and a 70B model can do it with 97%, one might stick with 7B for production due to the huge difference in resource requirements, unless that extra 2% is mission-critical.</p><p>In summary, an LLM is like an extremely knowledgeable but somewhat alien being: it has read a lot and can produce answers on almost anything, often writing more fluently than a human, but it might not always be reliable or know its own gaps. It’s our job to coax the best out of it with instructions and context, and curtail its weaknesses with evaluations.</p><hr><h2 id="the-shift-from-traditional-ml-models-to-foundation-model-engineering">The shift from traditional ML models to foundation model engineering</h2><p>Traditionally, deploying an AI solution usually means developing a bespoke ML model for the task: gathering a labeled dataset, training a model, and integrating it into an application.</p><p>This “classical” ML engineering was very model-centric; you’d often start from scratch or from a small pre-trained base, forming the data → model → product flow.</p>

<aside data-sx-content-cta class="relative p-0.5 js-toc-ignore">
        <div data-sx-content-cta-blur class="absolute w-full start-0 -top-0.5 -translate-y-full bg-linear-to-b to-70% from-black/0 to-background dark:to-background-dark pointer-events-none">
            &nbsp; <br> &nbsp; <br> &nbsp;
        </div>

    <div class="relative flex flex-col items-center text-center w-full px-4 py-10 lg:px-10 text-base rounded-xl sm:rounded-2xl">
        <div class="dark:hidden absolute -inset-px -z-10 rounded-xl bg-white shadow-pretty-sm transition-shadow duration-200"></div>
        <span class="hidden dark:block absolute -z-10 -inset-px rounded-inherit bg-radial-[50%_100%_at_50%_0%] from-white/10"></span>        
        <div class="hidden dark:block absolute -inset-px -z-10 rounded-xl bg-white/1 transition-colors duration-500 inset-shadow-md inset-shadow-white/2"></div>
        
        <div class="hidden dark:block absolute -z-10 -inset-0.5 mask-to-b mask-position-to-70% mask-opacity-to-50%">
            <div class="hidden dark:block absolute -z-10 inset-px rounded-xl border border-gray-50/10"></div>
        </div>                                                
        <div class="hidden dark:block absolute -z-10 -inset-px rounded-xl border border-gray-50/5 opacity-0"></div>

        <div class="hidden dark:block absolute -top-px start-1/2 -translate-x-1/2 w-1/2 h-px bg-linear-to-r from-white/0 to-white/0 via-gray-400/50"></div>

        <h2 class="text-gray-900 dark:text-gray-100 text-pretty text-xl sm:text-2xl leading-snug tracking-tight">




                                                <span class="hidden first:inline">
                                This lesson is for paying subscribers only
                            </span>                           
                    <span class="hidden first:inline">
                        This post is for paying subscribers only
                    </span>  

        </h2>

            <a href="/membership" class="mt-7 px-4 py-2.5 font-semibold text-sm text-center bg-primary rounded-full shadow-primary dark:shadow-none dark:drop-shadow-primary-sm hover:opacity-90 transition-opacity duration-200 inline-block" style="color: white;">
                Unlock Full Access
            </a>
            <p class="mt-5 text-sm">
                Already have an account? <button data-portal="signin" class="block sm:inline mx-auto font-medium underline decoration-primary decoration-2 underline-offset-2 hover:text-primary transition-colors duration-200">Sign in</button>
            </p>
    </div>
</aside>
]]></content:encoded>
                </item>
                <item>
                    <title><![CDATA[A Practical Guide to Integrate Evaluation and Observability into LLM Apps]]></title>
                    <description><![CDATA[A comprehensive guide to Opik, an open-source LLM evaluation and observability framework.]]></description>
                    <link>https://www.dailydoseofds.com/a-practical-guide-to-integrate-evaluation-and-observability-into-llm-apps/</link>
                    <guid isPermaLink="false">69cad8cb10b4a900019d0cf9</guid>

                        <category><![CDATA[LLM and Fine-tuning]]></category>
                        <category><![CDATA[LLMs]]></category>

                        <dc:creator><![CDATA[Avi Chawla]]></dc:creator>

                    <pubDate>Sat, 06 Dec 2025 17:27:00 +0530</pubDate>

                        <media:content url="https://www.dailydoseofds.com/content/images/2025/01/CometML.png" medium="image"/>

                    <content:encoded><![CDATA[<img src="https://www.dailydoseofds.com/content/images/2025/01/CometML.png" alt="A Practical Guide to Integrate Evaluation and Observability into LLM Apps"/> <h2 id="introduction">Introduction</h2><p>In previous articles, we’ve covered the foundational aspects of building and optimizing LLM-powered systems, including topics like Retrieval-Augmented Generation (RAG), multimodal integration, and model evaluation techniques. </p><figure class="kg-card kg-bookmark-card"><a class="kg-bookmark-container" href="https://www.dailydoseofds.com/a-crash-course-on-building-rag-systems-part-1-with-implementations/"><div class="kg-bookmark-content"><div class="kg-bookmark-title">Foundations of RAG systems</div><div class="kg-bookmark-description">A practical and beginner-friendly crash course on building RAG apps (with implementations).</div><div class="kg-bookmark-metadata"><img class="kg-bookmark-icon" src="https://www.dailydoseofds.com/content/images/icon/dailyds-logo-transparent.png" alt=""><span class="kg-bookmark-author">Daily Dose of Data Science</span><span class="kg-bookmark-publisher">Akshay Pachaar</span></div></div><div class="kg-bookmark-thumbnail"><img src="https://www.dailydoseofds.com/content/images/thumbnail/RAG_building-9.png" alt="" onerror="this.style.display = 'none'"></div></a></figure><p>These discussions laid the groundwork for building systems capable of sophisticated tasks.</p><p>However, as these systems scale, ensuring their performance remains robust and trustworthy becomes the real challenge. That is why building reliable and impactful LLM-powered systems requires more than just deploying a model.</p><p>It demands continuous evaluation to ensure quality and observability so that we can identify issues post-deployment.</p><p>Thus, this article will help you develop skills to examine how evaluation and observability can work together seamlessly to make LLM applications more robust and practical:</p><ul><li>Evaluation ensures that the system delivers accurate, consistent, and task-relevant results.</li><li>Observability offers a real-time view into the system’s inner workings, capturing everything from model drift to bottlenecks in performance.</li></ul><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2025/01/image-68.png" class="kg-image" alt="" loading="lazy" width="2000" height="492" srcset="https://www.dailydoseofds.com/content/images/size/w600/2025/01/image-68.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2025/01/image-68.png 1000w, https://www.dailydoseofds.com/content/images/size/w1600/2025/01/image-68.png 1600w, https://www.dailydoseofds.com/content/images/size/w2400/2025/01/image-68.png 2400w" sizes="(min-width: 720px) 720px"></figure><hr><p>Technically speaking, the focus of this article is Opik by CometML, which is an open-source framework that provides a specialized platform to address exactly these concerns.</p><figure class="kg-card kg-bookmark-card"><a class="kg-bookmark-container" href="https://github.com/comet-ml/opik?ref=dailydoseofds.com"><div class="kg-bookmark-content"><div class="kg-bookmark-title">GitHub - comet-ml/opik: From RAG chatbots to code assistants to complex agentic pipelines and beyond, build LLM systems that run better, faster, and cheaper with tracing, evaluations, and dashboards.</div><div class="kg-bookmark-description">From RAG chatbots to code assistants to complex agentic pipelines and beyond, build LLM systems that run better, faster, and cheaper with tracing, evaluations, and dashboards. - comet-ml/opik</div><div class="kg-bookmark-metadata"><img class="kg-bookmark-icon" src="https://www.dailydoseofds.com/content/images/icon/pinned-octocat-093da3e6fa40-4.svg" alt=""><span class="kg-bookmark-author">GitHub</span><span class="kg-bookmark-publisher">comet-ml</span></div></div><div class="kg-bookmark-thumbnail"><img src="https://opengraph.githubassets.com/8de76a02be4c947db2b464fcc3a5a99ac15cb4359fe4b7ef4535e63ef49cedbb/comet-ml/opik" alt="" onerror="this.style.display = 'none'"></div></a></figure><p>With Opik, we can:</p><ul><li>Track LLM performance across multiple metrics, including relevance, factuality, and coherence.</li><li>Monitor system behavior in real-time, identifying areas for improvement.</li><li>Seamlessly integrate with popular LLMs (like OpenAI or Ollama) and tools (like LlamaIndex).</li></ul><p>This is possible because Opik offers several tools to track, evaluate, and monitor machine learning systems in production, enabling teams to bridge the gap between development and real-world deployment.</p><p>And a great part about Opik is that whether you’re using OpenAI, Ollama, or any other LLM provider, Opik makes it straightforward to monitor your applications, log every interaction, and evaluate results against predefined metrics.</p><p>Here's the outline for this article:</p><ul><li>We'll start by setting up the environment and configuring Opik.</li><li>Next, we are going to trace a simple Python function so that you understand how Opik works.</li><li>After that, we are going to track a simple LLM core so you can use the same for the AI applications that you're building.</li><li>Moving on, we are going to build a RAG application using LlamaIndex, which we shall evaluate and monitor this application in our Opik dashboard.</li></ul><p>Let's begin!</p><p>If you prefer to follow along with a video tutorial instead, you can watch this video below:</p><figure class="kg-card kg-video-card kg-width-regular" data-kg-thumbnail="https://storage.ghost.io/c/3f/df/3fdf6ed2-17ac-4b12-a693-8078bd13e748/content/media/2025/01/comet-video_thumb.jpg" data-kg-custom-thumbnail="">
            <div class="kg-video-container">
                <video src="https://storage.ghost.io/c/3f/df/3fdf6ed2-17ac-4b12-a693-8078bd13e748/content/media/2025/01/comet-video.mp4" poster="https://img.spacergif.org/v1/1280x688/0a/spacer.png" width="1280" height="688" playsinline="" preload="metadata" style="background: transparent url('https://storage.ghost.io/c/3f/df/3fdf6ed2-17ac-4b12-a693-8078bd13e748/content/media/2025/01/comet-video_thumb.jpg') 50% 50% / cover no-repeat;"></video>
                <div class="kg-video-overlay">
                    <button class="kg-video-large-play-icon" aria-label="Play video">
                        <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24">
                            <path d="M23.14 10.608 2.253.164A1.559 1.559 0 0 0 0 1.557v20.887a1.558 1.558 0 0 0 2.253 1.392L23.14 13.393a1.557 1.557 0 0 0 0-2.785Z"></path>
                        </svg>
                    </button>
                </div>
                <div class="kg-video-player-container">
                    <div class="kg-video-player">
                        <button class="kg-video-play-icon" aria-label="Play video">
                            <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24">
                                <path d="M23.14 10.608 2.253.164A1.559 1.559 0 0 0 0 1.557v20.887a1.558 1.558 0 0 0 2.253 1.392L23.14 13.393a1.557 1.557 0 0 0 0-2.785Z"></path>
                            </svg>
                        </button>
                        <button class="kg-video-pause-icon kg-video-hide" aria-label="Pause video">
                            <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24">
                                <rect x="3" y="1" width="7" height="22" rx="1.5" ry="1.5"></rect>
                                <rect x="14" y="1" width="7" height="22" rx="1.5" ry="1.5"></rect>
                            </svg>
                        </button>
                        <span class="kg-video-current-time">0:00</span>
                        <div class="kg-video-time">
                            /<span class="kg-video-duration">14:57</span>
                        </div>
                        <input type="range" class="kg-video-seek-slider" max="100" value="0">
                        <button class="kg-video-playback-rate" aria-label="Adjust playback speed">1×</button>
                        <button class="kg-video-unmute-icon" aria-label="Unmute">
                            <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24">
                                <path d="M15.189 2.021a9.728 9.728 0 0 0-7.924 4.85.249.249 0 0 1-.221.133H5.25a3 3 0 0 0-3 3v2a3 3 0 0 0 3 3h1.794a.249.249 0 0 1 .221.133 9.73 9.73 0 0 0 7.924 4.85h.06a1 1 0 0 0 1-1V3.02a1 1 0 0 0-1.06-.998Z"></path>
                            </svg>
                        </button>
                        <button class="kg-video-mute-icon kg-video-hide" aria-label="Mute">
                            <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24">
                                <path d="M16.177 4.3a.248.248 0 0 0 .073-.176v-1.1a1 1 0 0 0-1.061-1 9.728 9.728 0 0 0-7.924 4.85.249.249 0 0 1-.221.133H5.25a3 3 0 0 0-3 3v2a3 3 0 0 0 3 3h.114a.251.251 0 0 0 .177-.073ZM23.707 1.706A1 1 0 0 0 22.293.292l-22 22a1 1 0 0 0 0 1.414l.009.009a1 1 0 0 0 1.405-.009l6.63-6.631A.251.251 0 0 1 8.515 17a.245.245 0 0 1 .177.075 10.081 10.081 0 0 0 6.5 2.92 1 1 0 0 0 1.061-1V9.266a.247.247 0 0 1 .073-.176Z"></path>
                            </svg>
                        </button>
                        <input type="range" class="kg-video-volume-slider" max="100" value="100">
                    </div>
                </div>
            </div>
            
        </figure><p>Below, we have the text version of the above video, along with detailed and step-by-step code instructions.</p><hr><h2 id="installation-and-setup">Installation and setup</h2><p>We'll start by creating an account on <a href="https://www.comet.com/site/?ref=dailydoseofds.com"><strong>comet.com</strong></a>.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2025/01/Screenshot-2025-01-16-at-11.13.13-PM.png" class="kg-image" alt="" loading="lazy" width="1591" height="1170" srcset="https://www.dailydoseofds.com/content/images/size/w600/2025/01/Screenshot-2025-01-16-at-11.13.13-PM.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2025/01/Screenshot-2025-01-16-at-11.13.13-PM.png 1000w, https://www.dailydoseofds.com/content/images/2025/01/Screenshot-2025-01-16-at-11.13.13-PM.png 1591w" sizes="(min-width: 720px) 720px"></figure><div class="kg-card kg-callout-card kg-callout-card-blue"><div class="kg-callout-emoji">💡</div><div class="kg-callout-text">Since Opik is completely open source, you can also self-host it on your own infrastructure. The installation guide for this is provided here: <a href="https://www.comet.com/docs/opik/self-host/overview/?from=llm&utm_source=opik&utm_medium=colab&utm_content=llamaindex&utm_campaign=opik"><b><strong style="white-space: pre-wrap;">Self-host Opik</strong></b></a>. <br><br>You can just click on the link above and follow the instructions.<br><br>In this case, for simplicity, we'll be using a hosted version of Opik on their platform. Nonetheless, the fundamentals are going to remain the same.</div></div><p>Once you will create an account, it will give you two options to choose from—select LLM evaluation (Opik):</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2025/01/Screenshot-2025-01-16-at-11.18.14-PM.png" class="kg-image" alt="" loading="lazy" width="1004" height="668" srcset="https://www.dailydoseofds.com/content/images/size/w600/2025/01/Screenshot-2025-01-16-at-11.18.14-PM.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2025/01/Screenshot-2025-01-16-at-11.18.14-PM.png 1000w, https://www.dailydoseofds.com/content/images/2025/01/Screenshot-2025-01-16-at-11.18.14-PM.png 1004w" sizes="(min-width: 720px) 720px"></figure><p>Once done, you will find yourself in this dashboard, where you can also find your API key on the right:</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2025/01/Screenshot-2025-01-16-at-11.19.15-PM.png" class="kg-image" alt="" loading="lazy" width="1383" height="917" srcset="https://www.dailydoseofds.com/content/images/size/w600/2025/01/Screenshot-2025-01-16-at-11.19.15-PM.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2025/01/Screenshot-2025-01-16-at-11.19.15-PM.png 1000w, https://www.dailydoseofds.com/content/images/2025/01/Screenshot-2025-01-16-at-11.19.15-PM.png 1383w" sizes="(min-width: 720px) 720px"></figure><p>Next, in your current working directory, create a <code>.env</code> file:</p>
<!--kg-card-begin: html-->
<div style="overflow:hidden;margin-left:auto;margin-right:auto;border-radius:10px;width:100%;max-width:334px;position:relative"><div style="width:100%;padding-bottom:67.66467065868264%"></div><iframe width="334" height="226" title="" src="https://snappify.com/embed/79de4334-824c-4dc1-ad1d-03fc50c154bc/a2d1eaec-83db-4bbc-ad20-2dad67ef5b7b?responsive=1&b=1" allow="clipboard-write" allowfullscreen="" loading="lazy" style="background:linear-gradient(337deg,#654EA3FF,#DA98B4FF);position:absolute;left:0;top:0;width:100%" frameborder="0"></iframe></div>
<!--kg-card-end: html-->
<p></p><p>Copy the API key shown in your dashboard and paste it as follows:</p>
<!--kg-card-begin: html-->
<div style="overflow:hidden;margin-left:auto;margin-right:auto;border-radius:10px;width:100%;max-width:530px;position:relative"><div style="width:100%;padding-bottom:31.886792452830186%"></div><iframe width="530" height="169" title="" src="https://snappify.com/embed/79de4334-824c-4dc1-ad1d-03fc50c154bc/5b5f533b-761c-47bb-8d3a-45c2c2804d42?responsive=1&b=1" allow="clipboard-write" allowfullscreen="" loading="lazy" style="background:linear-gradient(337deg,#654EA3FF,#DA98B4FF);position:absolute;left:0;top:0;width:100%" frameborder="0"></iframe></div>
<!--kg-card-end: html-->
<p></p><p>To configure Opik, run the following code.</p>
<!--kg-card-begin: html-->
<div style="overflow:hidden;margin-left:auto;margin-right:auto;border-radius:10px;width:100%;max-width:498px;position:relative"><div style="width:100%;padding-bottom:45.18072289156627%"></div><iframe width="498" height="225" title="" src="https://snappify.com/embed/79de4334-824c-4dc1-ad1d-03fc50c154bc/b9fecddc-6f50-4daa-aa8c-947b8f192a7d?responsive=1&b=1" allow="clipboard-write" allowfullscreen="" loading="lazy" style="background:linear-gradient(337deg,#654EA3FF,#DA98B4FF);position:absolute;left:0;top:0;width:100%" frameborder="0"></iframe></div>
<!--kg-card-end: html-->
<p></p><p>Executing the above code will open a panel to enter the API key obtained above. Enter the API key there, and done—Opik has been configured.</p><p>In some parts of this article, we will be using a locally running model using Ollama. Ollama provides a platform to run LLMs locally, giving you control over your data and model usage.</p><p>Here's a step-by-step guide on using Ollama.</p><ul><li>Go to <a href="https://ollama.com/?ref=dailydoseofds.com" rel="noreferrer"><strong>Ollama.com</strong></a>, select your operating system, and follow the instructions.</li></ul><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2024/11/Screenshot-2024-11-02-at-10.49.27-PM.png" class="kg-image" alt="" loading="lazy" width="1491" height="654" srcset="https://www.dailydoseofds.com/content/images/size/w600/2024/11/Screenshot-2024-11-02-at-10.49.27-PM.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2024/11/Screenshot-2024-11-02-at-10.49.27-PM.png 1000w, https://www.dailydoseofds.com/content/images/2024/11/Screenshot-2024-11-02-at-10.49.27-PM.png 1491w" sizes="(min-width: 720px) 720px"></figure><ul><ul><li>If you are using Linux, you can run the following command:</li></ul></ul>
<!--kg-card-begin: html-->
<div style="overflow:hidden;margin-left:auto;margin-right:auto;border-radius:10px;width:100%;max-width:555px;position:relative"><div style="width:100%;padding-bottom:30.45045045045045%"></div><iframe width="555" height="169" title="" src="https://snappify.com/embed/70d2b2aa-e73c-4745-9c23-82ef520f2dab/fdd8c929-2c5d-441a-99c8-06f345c80603?responsive=1&b=1" allow="clipboard-write" allowfullscreen="" loading="lazy" style="background:linear-gradient(337deg,#654EA3FF,#DA98B4FF);position:absolute;left:0;top:0;width:100%" frameborder="0"></iframe></div>
<!--kg-card-end: html-->
<p></p><ul><li>Ollama supports a bunch of models that are also listed in the&nbsp;model library:</li></ul><figure class="kg-card kg-bookmark-card"><a class="kg-bookmark-container" href="https://ollama.com/library?ref=dailydoseofds.com"><div class="kg-bookmark-content"><div class="kg-bookmark-title">library</div><div class="kg-bookmark-description">Get up and running with large language models.</div><div class="kg-bookmark-metadata"><img class="kg-bookmark-icon" src="https://ollama.com/public/apple-touch-icon.png" alt=""></div></div><div class="kg-bookmark-thumbnail"><img src="https://ollama.com/public/og.png" alt="" onerror="this.style.display = 'none'"></div></a></figure><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2024/11/Screen-Recording-2024-11-02-at-10.52.10-PM.gif" class="kg-image" alt="" loading="lazy" width="1612" height="1004" srcset="https://www.dailydoseofds.com/content/images/size/w600/2024/11/Screen-Recording-2024-11-02-at-10.52.10-PM.gif 600w, https://www.dailydoseofds.com/content/images/size/w1000/2024/11/Screen-Recording-2024-11-02-at-10.52.10-PM.gif 1000w, https://www.dailydoseofds.com/content/images/size/w1600/2024/11/Screen-Recording-2024-11-02-at-10.52.10-PM.gif 1600w, https://www.dailydoseofds.com/content/images/2024/11/Screen-Recording-2024-11-02-at-10.52.10-PM.gif 1612w" sizes="(min-width: 720px) 720px"></figure><p>Once you've found the model you're looking for, run this command in your terminal:</p>
<!--kg-card-begin: html-->
<div style="overflow:hidden;margin-left:auto;margin-right:auto;border-radius:10px;width:100%;max-width:591px;position:relative"><div style="width:100%;padding-bottom:32.656514382402705%"></div><iframe width="591" height="193" title="" src="https://snappify.com/embed/70d2b2aa-e73c-4745-9c23-82ef520f2dab/def740b2-3a15-446e-bd2e-9779793e688f?responsive=1&b=1" allow="clipboard-write" allowfullscreen="" loading="lazy" style="background:linear-gradient(337deg,#654EA3FF,#DA98B4FF);position:absolute;left:0;top:0;width:100%" frameborder="0"></iframe></div>
<!--kg-card-end: html-->
<p></p><p>The above command will download the model locally, so give it some time to complete. But once it's done, you'll have Llama 3.2 3B running locally, as shown below which depicts Microsoft's Phi-3 served locally through Ollama:</p><figure class="kg-card kg-video-card kg-width-regular" data-kg-thumbnail="https://storage.ghost.io/c/3f/df/3fdf6ed2-17ac-4b12-a693-8078bd13e748/content/media/2024/10/akshay_pachaar---1800143490653032524---Video-1_thumb.jpg" data-kg-custom-thumbnail="">
            <div class="kg-video-container">
                <video src="https://storage.ghost.io/c/3f/df/3fdf6ed2-17ac-4b12-a693-8078bd13e748/content/media/2024/10/akshay_pachaar---1800143490653032524---Video-1.mp4" poster="https://img.spacergif.org/v1/1048x720/0a/spacer.png" width="1048" height="720" playsinline="" preload="metadata" style="background: transparent url('https://storage.ghost.io/c/3f/df/3fdf6ed2-17ac-4b12-a693-8078bd13e748/content/media/2024/10/akshay_pachaar---1800143490653032524---Video-1_thumb.jpg') 50% 50% / cover no-repeat;"></video>
                <div class="kg-video-overlay">
                    <button class="kg-video-large-play-icon" aria-label="Play video">
                        <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24">
                            <path d="M23.14 10.608 2.253.164A1.559 1.559 0 0 0 0 1.557v20.887a1.558 1.558 0 0 0 2.253 1.392L23.14 13.393a1.557 1.557 0 0 0 0-2.785Z"></path>
                        </svg>
                    </button>
                </div>
                <div class="kg-video-player-container">
                    <div class="kg-video-player">
                        <button class="kg-video-play-icon" aria-label="Play video">
                            <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24">
                                <path d="M23.14 10.608 2.253.164A1.559 1.559 0 0 0 0 1.557v20.887a1.558 1.558 0 0 0 2.253 1.392L23.14 13.393a1.557 1.557 0 0 0 0-2.785Z"></path>
                            </svg>
                        </button>
                        <button class="kg-video-pause-icon kg-video-hide" aria-label="Pause video">
                            <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24">
                                <rect x="3" y="1" width="7" height="22" rx="1.5" ry="1.5"></rect>
                                <rect x="14" y="1" width="7" height="22" rx="1.5" ry="1.5"></rect>
                            </svg>
                        </button>
                        <span class="kg-video-current-time">0:00</span>
                        <div class="kg-video-time">
                            /<span class="kg-video-duration">0:29</span>
                        </div>
                        <input type="range" class="kg-video-seek-slider" max="100" value="0">
                        <button class="kg-video-playback-rate" aria-label="Adjust playback speed">1×</button>
                        <button class="kg-video-unmute-icon" aria-label="Unmute">
                            <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24">
                                <path d="M15.189 2.021a9.728 9.728 0 0 0-7.924 4.85.249.249 0 0 1-.221.133H5.25a3 3 0 0 0-3 3v2a3 3 0 0 0 3 3h1.794a.249.249 0 0 1 .221.133 9.73 9.73 0 0 0 7.924 4.85h.06a1 1 0 0 0 1-1V3.02a1 1 0 0 0-1.06-.998Z"></path>
                            </svg>
                        </button>
                        <button class="kg-video-mute-icon kg-video-hide" aria-label="Mute">
                            <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24">
                                <path d="M16.177 4.3a.248.248 0 0 0 .073-.176v-1.1a1 1 0 0 0-1.061-1 9.728 9.728 0 0 0-7.924 4.85.249.249 0 0 1-.221.133H5.25a3 3 0 0 0-3 3v2a3 3 0 0 0 3 3h.114a.251.251 0 0 0 .177-.073ZM23.707 1.706A1 1 0 0 0 22.293.292l-22 22a1 1 0 0 0 0 1.414l.009.009a1 1 0 0 0 1.405-.009l6.63-6.631A.251.251 0 0 1 8.515 17a.245.245 0 0 1 .177.075 10.081 10.081 0 0 0 6.5 2.92 1 1 0 0 0 1.061-1V9.266a.247.247 0 0 1 .073-.176Z"></path>
                            </svg>
                        </button>
                        <input type="range" class="kg-video-volume-slider" max="100" value="100">
                    </div>
                </div>
            </div>
            
        </figure><p></p><p>That said, for our demo, we would be running the Llama 3.2 1B model instead since it is smaller and will not take much memory:</p>
<!--kg-card-begin: html-->
<div style="overflow:hidden;margin-left:auto;margin-right:auto;border-radius:10px;width:100%;max-width:331px;position:relative"><div style="width:100%;padding-bottom:51.057401812688816%"></div><iframe width="331" height="169" title="" src="https://snappify.com/embed/70d2b2aa-e73c-4745-9c23-82ef520f2dab/30cd8def-a1a5-48d5-b56a-56a92689f6ca?responsive=1&b=1" allow="clipboard-write" allowfullscreen="" loading="lazy" style="background:linear-gradient(337deg,#654EA3FF,#DA98B4FF);position:absolute;left:0;top:0;width:100%" frameborder="0"></iframe></div>
<!--kg-card-end: html-->
<p></p><p>Finally, install the open-source Opik framework, LlamaIndex, and LlamaIndex's Ollama integration module as follows:</p>
<!--kg-card-begin: html-->
<div style="overflow:hidden;margin-left:auto;margin-right:auto;border-radius:10px;width:100%;max-width:554px;position:relative"><div style="width:100%;padding-bottom:40.61371841155235%"></div><iframe width="554" height="225" title="" src="https://snappify.com/embed/79de4334-824c-4dc1-ad1d-03fc50c154bc/658118c9-10eb-4593-b005-c5c52e746400?responsive=1&b=1" allow="clipboard-write" allowfullscreen="" loading="lazy" style="background:linear-gradient(337deg,#654EA3FF,#DA98B4FF);position:absolute;left:0;top:0;width:100%" frameborder="0"></iframe></div>
<!--kg-card-end: html-->
<p></p><p>Setup and installation are done!</p><hr><h2 id="data">Data</h2><p>Next, download the dataset as follows:</p>
<!--kg-card-begin: html-->
<div style="overflow:hidden;margin-left:auto;margin-right:auto;border-radius:10px;width:100%;max-width:668px;position:relative"><div style="width:100%;padding-bottom:62.87425149700599%"></div><iframe width="668" height="420" title="" src="https://snappify.com/embed/79de4334-824c-4dc1-ad1d-03fc50c154bc/d99cbf00-ba70-4afd-b5b0-067e256f94b5?responsive=1&b=1" allow="clipboard-write" allowfullscreen="" loading="lazy" style="background:linear-gradient(337deg,#654EA3FF,#DA98B4FF);position:absolute;left:0;top:0;width:100%" frameborder="0"></iframe></div>
<!--kg-card-end: html-->
<p></p><p>The above code will create a data folder in your current working directory.</p><hr><h2 id="getting-started-with-opik">Getting started with Opik</h2><p>When it comes to monitoring machine learning workflows, simplicity is key.</p><p>In this respect, Opik offers an intuitive way to track and log your experiments, which can include advanced applications like Retrieval-Augmented Generation (RAG) pipelines or multi-agent systems.</p><p>Let’s dive into a quick demo to see how Opik simplifies the process.</p><h3 id="tracking-a-simple-python-function">Tracking a Simple Python Function</h3><p>Let's start with a simple demo.</p><p>Imagine we want to track all the invocations to this simple Python function specified below:</p>
<!--kg-card-begin: html-->
<div style="overflow:hidden;margin-left:auto;margin-right:auto;border-radius:10px;width:100%;max-width:361px;position:relative"><div style="width:100%;padding-bottom:60.387811634349035%"></div><iframe width="361" height="218" title="" src="https://snappify.com/embed/79de4334-824c-4dc1-ad1d-03fc50c154bc/55779cb2-0ec1-4d2e-9895-6eeacee3af48?responsive=1&b=1" allow="clipboard-write" allowfullscreen="" loading="lazy" style="background:linear-gradient(337deg,#654EA3FF,#DA98B4FF);position:absolute;left:0;top:0;width:100%" frameborder="0"></iframe></div>
<!--kg-card-end: html-->
<p></p><p>To do this, Opik provides a powerful <code>@track</code> decorator that makes tracking a Python function effortless.</p><p></p>
<!--kg-card-begin: html-->
<div style="overflow:hidden;margin-left:auto;margin-right:auto;border-radius:10px;width:100%;max-width:363px;position:relative"><div style="width:100%;padding-bottom:76.5840220385675%"></div><iframe width="363" height="278" title="" src="https://snappify.com/embed/79de4334-824c-4dc1-ad1d-03fc50c154bc/44feacdb-fdcc-4276-8618-53bc9b634306?responsive=1&b=1" allow="clipboard-write" allowfullscreen="" loading="lazy" style="background:linear-gradient(337deg,#654EA3FF,#DA98B4FF);position:absolute;left:0;top:0;width:100%" frameborder="0"></iframe></div>
<!--kg-card-end: html-->
<p></p><p>That's it!</p><p>By wrapping any function with this decorator, you can automatically trace and log its execution inside the Opik dashboard.</p><p>For instance, currently, our dashboard does not show anything.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2025/01/Screenshot-2025-01-17-at-1.16.31-PM.png" class="kg-image" alt="" loading="lazy" width="1688" height="1016" srcset="https://www.dailydoseofds.com/content/images/size/w600/2025/01/Screenshot-2025-01-17-at-1.16.31-PM.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2025/01/Screenshot-2025-01-17-at-1.16.31-PM.png 1000w, https://www.dailydoseofds.com/content/images/size/w1600/2025/01/Screenshot-2025-01-17-at-1.16.31-PM.png 1600w, https://www.dailydoseofds.com/content/images/2025/01/Screenshot-2025-01-17-at-1.16.31-PM.png 1688w" sizes="(min-width: 720px) 720px"></figure><p>If we run the above code, which is decorated with the <code>@track</code> decorator, and after that, we go to the dashboard, we will find this:</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2025/01/Screenshot-2025-01-17-at-1.18.09-PM.png" class="kg-image" alt="" loading="lazy" width="1688" height="1016" srcset="https://www.dailydoseofds.com/content/images/size/w600/2025/01/Screenshot-2025-01-17-at-1.18.09-PM.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2025/01/Screenshot-2025-01-17-at-1.18.09-PM.png 1000w, https://www.dailydoseofds.com/content/images/size/w1600/2025/01/Screenshot-2025-01-17-at-1.18.09-PM.png 1600w, https://www.dailydoseofds.com/content/images/2025/01/Screenshot-2025-01-17-at-1.18.09-PM.png 1688w" sizes="(min-width: 720px) 720px"></figure><p>As depicted above, after running the function, Opik automatically creates a default project in its dashboard.</p><p>In this project, you can explore the inputs provided to the function, the outputs it produced, and everything that happened during its execution.</p><p>For instance, once we open this project, we see the following invocation of the function created above, along with the input, the output produced by the function, and the time it took to generate a response.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2025/01/Screenshot-2025-01-17-at-1.19.02-PM.png" class="kg-image" alt="" loading="lazy" width="1688" height="1016" srcset="https://www.dailydoseofds.com/content/images/size/w600/2025/01/Screenshot-2025-01-17-at-1.19.02-PM.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2025/01/Screenshot-2025-01-17-at-1.19.02-PM.png 1000w, https://www.dailydoseofds.com/content/images/size/w1600/2025/01/Screenshot-2025-01-17-at-1.19.02-PM.png 1600w, https://www.dailydoseofds.com/content/images/2025/01/Screenshot-2025-01-17-at-1.19.02-PM.png 1688w" sizes="(min-width: 720px) 720px"></figure><p>Also, if you invoke this function multiple times, like below...</p>
<!--kg-card-begin: html-->
<div style="overflow:hidden;margin-left:auto;margin-right:auto;border-radius:10px;width:100%;max-width:334px;position:relative"><div style="width:100%;padding-bottom:58.08383233532933%"></div><iframe width="334" height="194" title="" src="https://snappify.com/embed/79de4334-824c-4dc1-ad1d-03fc50c154bc/1f368ff3-41fc-40ec-b769-5b39dac888a5?responsive=1&b=1" allow="clipboard-write" allowfullscreen="" loading="lazy" style="background:linear-gradient(337deg,#654EA3FF,#DA98B4FF);position:absolute;left:0;top:0;width:100%" frameborder="0"></iframe></div>
<!--kg-card-end: html-->
<p></p><p>...the dashboard will show all the invocations of the functions:</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2025/01/Screenshot-2025-01-17-at-1.21.35-PM.png" class="kg-image" alt="" loading="lazy" width="1688" height="1016" srcset="https://www.dailydoseofds.com/content/images/size/w600/2025/01/Screenshot-2025-01-17-at-1.21.35-PM.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2025/01/Screenshot-2025-01-17-at-1.21.35-PM.png 1000w, https://www.dailydoseofds.com/content/images/size/w1600/2025/01/Screenshot-2025-01-17-at-1.21.35-PM.png 1600w, https://www.dailydoseofds.com/content/images/2025/01/Screenshot-2025-01-17-at-1.21.35-PM.png 1688w" sizes="(min-width: 720px) 720px"></figure><p>Opening any specific invocation, we can look at the inputs and the outputs in a clean YAML format, along with other details that were tracked by Opik:</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2025/01/Screenshot-2025-01-17-at-1.21.52-PM.png" class="kg-image" alt="" loading="lazy" width="1688" height="1016" srcset="https://www.dailydoseofds.com/content/images/size/w600/2025/01/Screenshot-2025-01-17-at-1.21.52-PM.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2025/01/Screenshot-2025-01-17-at-1.21.52-PM.png 1000w, https://www.dailydoseofds.com/content/images/size/w1600/2025/01/Screenshot-2025-01-17-at-1.21.52-PM.png 1600w, https://www.dailydoseofds.com/content/images/2025/01/Screenshot-2025-01-17-at-1.21.52-PM.png 1688w" sizes="(min-width: 720px) 720px"></figure><p>This seamless integration makes it easy to monitor and debug your workflows without adding any complex boilerplate code.</p><p>To recap, these are the steps:</p><ul><li>Define Your Function: Start with any Python function—whether it’s as simple as adding two numbers or as complex as an LLM-based application.</li><li>Apply the <code>@track</code> Decorator: Add the <code>@track</code> decorator to the function definition.</li><li>Run and Observe: Execute the function, and the inputs, outputs, and other relevant data will be captured and displayed on the dashboard.</li></ul><p>The true power of Opik lies in its flexibility. While this demo showcased a simple function, the same tracking capabilities can be extended to much more complex systems:</p><ul><li>RAG applications: Track every step of your Retrieval-Augmented Generation pipeline, from query formulation to retrieval and final generation.</li><li>Multi-agent orchestration: Monitor interactions and outputs in applications involving multiple agents working together to solve a task.</li></ul><p>All you need to do is wrap the relevant logic in a Python function and apply the <code>@track</code> decorator. It’s that straightforward.</p><p>Having understood how to track simple Python functions, the next step is to explore how Opik can be used to track LLM calls to monitor every interaction with large language models, analyze their performance, and gain actionable insights.</p><p>Let's do that below.</p><h3 id="tracking-llm-calls-with-opik">Tracking LLM calls with Opik</h3><p>The purpose of this section is to show how Opik can log and monitor the interaction when the input includes both text and an image URL.</p><p>Before diving into Ollama, let's do a quick demo with OpenAI.</p><p>You would need an OpenAI API key for this, which you can get here: <a href="https://platform.openai.com/api-keys?ref=dailydoseofds.com"><strong>OpenAI API key</strong></a>.</p><p>Specify this in the <code>.env</code> file we created above as follows:</p>
<!--kg-card-begin: html-->
<div style="overflow:hidden;margin-left:auto;margin-right:auto;border-radius:10px;width:100%;max-width:590px;position:relative"><div style="width:100%;padding-bottom:33.22033898305085%"></div><iframe width="590" height="196" title="" src="https://snappify.com/embed/79de4334-824c-4dc1-ad1d-03fc50c154bc/7c6cd13b-06d6-4aa2-aaf2-ed24444e1beb?responsive=1&b=1" allow="clipboard-write" allowfullscreen="" loading="lazy" style="background:linear-gradient(337deg,#654EA3FF,#DA98B4FF);position:absolute;left:0;top:0;width:100%" frameborder="0"></iframe></div>
<!--kg-card-end: html-->
<p></p><p>To load these API keys into your environment, run these:</p>
<!--kg-card-begin: html-->
<div style="overflow:hidden;margin-left:auto;margin-right:auto;border-radius:10px;width:100%;max-width:347px;position:relative"><div style="width:100%;padding-bottom:55.61959654178674%"></div><iframe width="347" height="193" title="" src="https://snappify.com/embed/79de4334-824c-4dc1-ad1d-03fc50c154bc/1967c3c5-0914-43da-8097-af9786e9cba2?responsive=1&b=1" allow="clipboard-write" allowfullscreen="" loading="lazy" style="background:linear-gradient(337deg,#654EA3FF,#DA98B4FF);position:absolute;left:0;top:0;width:100%" frameborder="0"></iframe></div>
<!--kg-card-end: html-->
<p></p><p>Next, we shall be using Opik's OpenAI integration for this demo, which is imported below, along with the OpenAI library:</p>
<!--kg-card-begin: html-->
<div style="overflow:hidden;margin-left:auto;margin-right:auto;border-radius:10px;width:100%;max-width:514px;position:relative"><div style="width:100%;padding-bottom:32.87937743190661%"></div><iframe width="514" height="169" title="" src="https://snappify.com/embed/79de4334-824c-4dc1-ad1d-03fc50c154bc/d90533cb-6791-423e-9944-050f0a0019e7?responsive=1&b=1" allow="clipboard-write" allowfullscreen="" loading="lazy" style="background:linear-gradient(337deg,#654EA3FF,#DA98B4FF);position:absolute;left:0;top:0;width:100%" frameborder="0"></iframe></div>
<!--kg-card-end: html-->
<p></p><p>Moving on, we wrap the <code>OpenAI</code> client with Opik’s <code>track_openai</code> function. This ensures that all interactions with the OpenAI API are tracked and logged in the Opik dashboard. Any API calls made using this client will now be automatically monitored, including their inputs, outputs, and associated metadata.</p>
<!--kg-card-begin: html-->
<div style="overflow:hidden;margin-left:auto;margin-right:auto;border-radius:10px;width:100%;max-width:525px;position:relative"><div style="width:100%;padding-bottom:32.19047619047619%"></div><iframe width="525" height="169" title="" src="https://snappify.com/embed/79de4334-824c-4dc1-ad1d-03fc50c154bc/4f3dccae-f84f-45de-9497-fcd73d8fe840?responsive=1&b=1" allow="clipboard-write" allowfullscreen="" loading="lazy" style="background:linear-gradient(337deg,#654EA3FF,#DA98B4FF);position:absolute;left:0;top:0;width:100%" frameborder="0"></iframe></div>
<!--kg-card-end: html-->
<p></p><p>Next, we define our multimodal prompt input as follows:</p>
<!--kg-card-begin: html-->
<div style="overflow:hidden;margin-left:auto;margin-right:auto;border-radius:10px;width:100%;max-width:921px;position:relative"><div style="width:100%;padding-bottom:55.591748099891426%"></div><iframe width="921" height="512" title="" src="https://snappify.com/embed/79de4334-824c-4dc1-ad1d-03fc50c154bc/a2f67138-186f-4d5a-8399-51ed666c3255?responsive=1&b=1" allow="clipboard-write" allowfullscreen="" loading="lazy" style="background:linear-gradient(337deg,#654EA3FF,#DA98B4FF);position:absolute;left:0;top:0;width:100%" frameborder="0"></iframe></div>
<!--kg-card-end: html-->
<p></p><p>Finally, we invoke the chat completion API as follows:</p>
<!--kg-card-begin: html-->
<div style="overflow:hidden;margin-left:auto;margin-right:auto;border-radius:10px;width:100%;max-width:534px;position:relative"><div style="width:100%;padding-bottom:49.25093632958802%"></div><iframe width="534" height="263" title="" src="https://snappify.com/embed/79de4334-824c-4dc1-ad1d-03fc50c154bc/7d045bf2-ee3d-4640-890c-a6dc86741fef?responsive=1&b=1" allow="clipboard-write" allowfullscreen="" loading="lazy" style="background:linear-gradient(337deg,#654EA3FF,#DA98B4FF);position:absolute;left:0;top:0;width:100%" frameborder="0"></iframe></div>
<!--kg-card-end: html-->
<p></p><p>Here, we make the API call using the <code>chat.completions.create</code> method:</p><ul><li><code>model</code>: Specifies the LLM to use (<code>gpt-4o-mini</code> in this case).</li><li><code>messages</code>: Provides the multimodal input we defined earlier.</li><li><code>max_tokens</code>: Limits the number of tokens in the output to 300, ensuring the response remains concise.</li></ul><p>Yet again, if we go back to the dashboard, we can see the input and the output of the LLM:</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2025/01/Screenshot-2025-01-17-at-1.43.38-PM.png" class="kg-image" alt="" loading="lazy" width="1688" height="999" srcset="https://www.dailydoseofds.com/content/images/size/w600/2025/01/Screenshot-2025-01-17-at-1.43.38-PM.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2025/01/Screenshot-2025-01-17-at-1.43.38-PM.png 1000w, https://www.dailydoseofds.com/content/images/size/w1600/2025/01/Screenshot-2025-01-17-at-1.43.38-PM.png 1600w, https://www.dailydoseofds.com/content/images/2025/01/Screenshot-2025-01-17-at-1.43.38-PM.png 1688w" sizes="(min-width: 720px) 720px"></figure><div class="kg-card kg-callout-card kg-callout-card-blue"><div class="kg-callout-emoji">💡</div><div class="kg-callout-text">I deleted the previous project, which is why it is not showing the previous runs of the simple function created above.</div></div><p>Opening this specific run highlights so many details about the LLM invocation, like the input, the output, the number of tokens used, the cost incurred for this specific run, and more.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2025/01/Screenshot-2025-01-17-at-1.44.16-PM.png" class="kg-image" alt="" loading="lazy" width="1268" height="1213" srcset="https://www.dailydoseofds.com/content/images/size/w600/2025/01/Screenshot-2025-01-17-at-1.44.16-PM.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2025/01/Screenshot-2025-01-17-at-1.44.16-PM.png 1000w, https://www.dailydoseofds.com/content/images/2025/01/Screenshot-2025-01-17-at-1.44.16-PM.png 1268w" sizes="(min-width: 720px) 720px"></figure><p>This shows that by using <code>track_openai</code>, every input, output, and intermediate detail is logged in the Opik dashboard, for improved observability.</p><p>We can also do the same with Ollama for LLMs running locally.</p><p>Here's a quick demo.</p><p>The process remains almost the same.</p><p>We shall again use Opik's OpenAI integration for this demo, which is imported below, along with the OpenAI library:</p>
<!--kg-card-begin: html-->
<div style="overflow:hidden;margin-left:auto;margin-right:auto;border-radius:10px;width:100%;max-width:514px;position:relative"><div style="width:100%;padding-bottom:32.87937743190661%"></div><iframe width="514" height="169" title="" src="https://snappify.com/embed/79de4334-824c-4dc1-ad1d-03fc50c154bc/d90533cb-6791-423e-9944-050f0a0019e7?responsive=1&b=1" allow="clipboard-write" allowfullscreen="" loading="lazy" style="background:linear-gradient(337deg,#654EA3FF,#DA98B4FF);position:absolute;left:0;top:0;width:100%" frameborder="0"></iframe></div>
<!--kg-card-end: html-->
<p></p><p>Next, we again create an OpenAI client, but this time, we specify the <code>base_url</code> as <a href="https://localhost:11434/v1?ref=dailydoseofds.com">https://localhost:11434/v1</a>:</p>
<!--kg-card-begin: html-->
<div style="overflow:hidden;margin-left:auto;margin-right:auto;border-radius:10px;width:100%;max-width:681px;position:relative"><div style="width:100%;padding-bottom:45.08076358296623%"></div><iframe width="681" height="307" title="" src="https://snappify.com/embed/79de4334-824c-4dc1-ad1d-03fc50c154bc/d51850ff-93cb-47fc-b239-00d1ff6a7fdb?responsive=1&b=1" allow="clipboard-write" allowfullscreen="" loading="lazy" style="background:linear-gradient(337deg,#654EA3FF,#DA98B4FF);position:absolute;left:0;top:0;width:100%" frameborder="0"></iframe></div>
<!--kg-card-end: html-->
<p></p><p>Next, to log all the invocations made to our client, we pass the client to the <code>track_openai</code> method:</p>
<!--kg-card-begin: html-->
<div style="overflow:hidden;margin-left:auto;margin-right:auto;border-radius:10px;width:100%;max-width:681px;position:relative"><div style="width:100%;padding-bottom:54.91923641703378%"></div><iframe width="681" height="374" title="" src="https://snappify.com/embed/79de4334-824c-4dc1-ad1d-03fc50c154bc/0b526be3-e3a7-4f8d-a88b-1e07610a22aa?responsive=1&b=1" allow="clipboard-write" allowfullscreen="" loading="lazy" style="background:linear-gradient(337deg,#654EA3FF,#DA98B4FF);position:absolute;left:0;top:0;width:100%" frameborder="0"></iframe></div>
<!--kg-card-end: html-->
<p></p><p>Finally, we invoke the completion API as follows:</p>
<!--kg-card-begin: html-->
<div style="overflow:hidden;margin-left:auto;margin-right:auto;border-radius:10px;width:100%;max-width:545px;position:relative"><div style="width:100%;padding-bottom:65.87155963302752%"></div><iframe width="545" height="359" title="" src="https://snappify.com/embed/79de4334-824c-4dc1-ad1d-03fc50c154bc/19674f3a-efa4-4209-b43d-803e5b6bdbfd?responsive=1&b=1" allow="clipboard-write" allowfullscreen="" loading="lazy" style="background:linear-gradient(337deg,#654EA3FF,#DA98B4FF);position:absolute;left:0;top:0;width:100%" frameborder="0"></iframe></div>
<!--kg-card-end: html-->
<p></p><p>If we head over to the dashboard again, we see another entry:</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2025/01/Screenshot-2025-01-17-at-3.10.44-PM.png" class="kg-image" alt="" loading="lazy" width="1686" height="742" srcset="https://www.dailydoseofds.com/content/images/size/w600/2025/01/Screenshot-2025-01-17-at-3.10.44-PM.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2025/01/Screenshot-2025-01-17-at-3.10.44-PM.png 1000w, https://www.dailydoseofds.com/content/images/size/w1600/2025/01/Screenshot-2025-01-17-at-3.10.44-PM.png 1600w, https://www.dailydoseofds.com/content/images/2025/01/Screenshot-2025-01-17-at-3.10.44-PM.png 1686w" sizes="(min-width: 720px) 720px"></figure><p>Opening the latest (top) invocation, we can again see similar details like we saw with OpenAI—the input, the output, the number of tokens used, the cost, and more.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2025/01/Screenshot-2025-01-17-at-3.12.14-PM.png" class="kg-image" alt="" loading="lazy" width="1262" height="788" srcset="https://www.dailydoseofds.com/content/images/size/w600/2025/01/Screenshot-2025-01-17-at-3.12.14-PM.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2025/01/Screenshot-2025-01-17-at-3.12.14-PM.png 1000w, https://www.dailydoseofds.com/content/images/2025/01/Screenshot-2025-01-17-at-3.12.14-PM.png 1262w" sizes="(min-width: 720px) 720px"></figure><p>That was simple, wasn't it?</p><p>All the supported integrations and how to use them in your projects are available in the <a href="https://www.comet.com/opik/avi-chawla/quickstart?ref=dailydoseofds.com"><strong>quickstart guide</strong></a>.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2025/01/Screenshot-2025-01-17-at-1.29.08-PM.png" class="kg-image" alt="" loading="lazy" width="1688" height="999" srcset="https://www.dailydoseofds.com/content/images/size/w600/2025/01/Screenshot-2025-01-17-at-1.29.08-PM.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2025/01/Screenshot-2025-01-17-at-1.29.08-PM.png 1000w, https://www.dailydoseofds.com/content/images/size/w1600/2025/01/Screenshot-2025-01-17-at-1.29.08-PM.png 1600w, https://www.dailydoseofds.com/content/images/2025/01/Screenshot-2025-01-17-at-1.29.08-PM.png 1688w" sizes="(min-width: 720px) 720px"></figure><hr><h2 id="experiment-tracking-with-opik">Experiment tracking with Opik</h2><p>Now that we’ve seen how Opik works, it’s time to unlock its true potential by using it to evaluate and monitor LLM applications.</p><p>This involves creating an evaluation dataset, designing an experiment, and using Opik to track and analyze the results step by step.</p><h3 id="workflow">Workflow</h3><p>Here’s the workflow for setting up an evaluation experiment with Opik:</p><ul><li>Define the Dataset:<ul><li>We’ll start by preparing a dataset for evaluation.</li><li>This dataset will include input queries and their corresponding expected outputs.</li></ul></li></ul><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2025/01/image-69.png" class="kg-image" alt="" loading="lazy" width="1526" height="493" srcset="https://www.dailydoseofds.com/content/images/size/w600/2025/01/image-69.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2025/01/image-69.png 1000w, https://www.dailydoseofds.com/content/images/2025/01/image-69.png 1526w" sizes="(min-width: 720px) 720px"></figure><ul><ul><li>This will allow us to measure how well the LLM application performs across several parameters like coherence, factfullness, etc. We discussed all of these metrics  in detail in Part 2 of our RAG crash course series:</li></ul></ul><figure class="kg-card kg-bookmark-card"><a class="kg-bookmark-container" href="https://www.dailydoseofds.com/a-crash-course-on-building-rag-systems-part-2-with-implementations/"><div class="kg-bookmark-content"><div class="kg-bookmark-title">A Crash Course on Building RAG Systems – Part 2 (With Implementation)</div><div class="kg-bookmark-description">A deep dive into evaluating RAG systems (with implementations).</div><div class="kg-bookmark-metadata"><img class="kg-bookmark-icon" src="https://www.dailydoseofds.com/content/images/icon/logo-subsatck2-1-35.svg" alt=""><span class="kg-bookmark-author">Daily Dose of Data Science</span><span class="kg-bookmark-publisher">Avi Chawla</span></div></div><div class="kg-bookmark-thumbnail"><img src="https://www.dailydoseofds.com/content/images/thumbnail/RAG_building-1-5.png" alt="" onerror="this.style.display = 'none'"></div></a></figure><ul><li>Run the Experiment:<ul><li>Each query item will be processed through the RAG application to produce a response</li></ul></li></ul><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2025/01/image-70.png" class="kg-image" alt="" loading="lazy" width="2000" height="538" srcset="https://www.dailydoseofds.com/content/images/size/w600/2025/01/image-70.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2025/01/image-70.png 1000w, https://www.dailydoseofds.com/content/images/size/w1600/2025/01/image-70.png 1600w, https://www.dailydoseofds.com/content/images/2025/01/image-70.png 2083w" sizes="(min-width: 720px) 720px"></figure><ul><ul><li>During the retrieval process, we must have retrieved some context from the vector database before generating the response:</li></ul></ul><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2025/01/image-71.png" class="kg-image" alt="" loading="lazy" width="2000" height="815" srcset="https://www.dailydoseofds.com/content/images/size/w600/2025/01/image-71.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2025/01/image-71.png 1000w, https://www.dailydoseofds.com/content/images/size/w1600/2025/01/image-71.png 1600w, https://www.dailydoseofds.com/content/images/size/w2400/2025/01/image-71.png 2400w" sizes="(min-width: 720px) 720px"></figure><ul><ul><li>Thus, the application’s output will then be compared to the expected output using predefined evaluation metrics, generating a feedback score. Moreover, we shall compare the expected context with the retrieved context to evaluate the retrieval pipeline.</li></ul><li>Track the Experiment:<ul><li>Everything that happens within the above process is treated as an experiment.</li><li>Opik will trace and monitor the experiment, logging inputs, outputs, and scores for each dataset item in its dashboard for easy analysis.</li></ul></li></ul><p></p><h3 id="data-1">Data</h3><p>To run this experiment, we need some data, which we already downloaded earlier:</p>
<!--kg-card-begin: html-->
<div style="overflow:hidden;margin-left:auto;margin-right:auto;border-radius:10px;width:100%;max-width:668px;position:relative"><div style="width:100%;padding-bottom:62.87425149700599%"></div><iframe width="668" height="420" title="" src="https://snappify.com/embed/79de4334-824c-4dc1-ad1d-03fc50c154bc/d99cbf00-ba70-4afd-b5b0-067e256f94b5?responsive=1&b=1" allow="clipboard-write" allowfullscreen="" loading="lazy" style="background:linear-gradient(337deg,#654EA3FF,#DA98B4FF);position:absolute;left:0;top:0;width:100%" frameborder="0"></iframe></div>
<!--kg-card-end: html-->
<p></p><p>The above code will create a data folder in your current working directory.</p><h3 id="integrate-llamaindex-with-opik">Integrate LlamaIndex with Opik</h3><p>Since we would be using LlamaIndex to build a RAG pipeline shortly, the next step is to integrate LlamaIndex with Opik.</p><p>This integration enables tracking all LlamaIndex operations like document chunking and ingestion to generate and retrieve queries. </p><p>To enable this tracking, you need to configure an Opik callback handler. This handler acts as a bridge between LlamaIndex and Opik, logging all operations in real time.</p><p>First, you’ll need to import <code>Settings</code> and <code>CallbackManager</code> from LlamaIndex, along with the <code>LlamaIndexCallbackHandler</code> from Opik:</p>
<!--kg-card-begin: html-->
<div style="overflow:hidden;margin-left:auto;margin-right:auto;border-radius:10px;width:100%;max-width:578px;position:relative"><div style="width:100%;padding-bottom:30.276816608996537%"></div><iframe width="578" height="175" title="" src="https://snappify.com/embed/79de4334-824c-4dc1-ad1d-03fc50c154bc/be87581d-3e4e-4bda-b61f-c23bfb3fa6e5?responsive=1&b=1" allow="clipboard-write" allowfullscreen="" loading="lazy" style="background:linear-gradient(337deg,#654EA3FF,#DA98B4FF);position:absolute;left:0;top:0;width:100%" frameborder="0"></iframe></div>
<!--kg-card-end: html-->
<p></p><p>Next, we create an instance of <code>LlamaIndexCallbackHandler</code>, which automatically logs all LlamaIndex operations to Opik.</p>
<!--kg-card-begin: html-->
<div style="overflow:hidden;margin-left:auto;margin-right:auto;border-radius:10px;width:100%;max-width:462px;position:relative"><div style="width:100%;padding-bottom:36.58008658008658%"></div><iframe width="462" height="169" title="" src="https://snappify.com/embed/79de4334-824c-4dc1-ad1d-03fc50c154bc/360b8b50-e6e5-46ce-9187-d44407eec588?responsive=1&b=1" allow="clipboard-write" allowfullscreen="" loading="lazy" style="background:linear-gradient(337deg,#654EA3FF,#DA98B4FF);position:absolute;left:0;top:0;width:100%" frameborder="0"></iframe></div>
<!--kg-card-end: html-->
<p></p><p>Finally, we use LlamaIndex’s <code>Settings</code> to integrate the callback handler via the <code>CallbackManager</code>. This ensures that every operation performed by LlamaIndex is tracked.</p>
<!--kg-card-begin: html-->
<div style="overflow:hidden;margin-left:auto;margin-right:auto;border-radius:10px;width:100%;max-width:589px;position:relative"><div style="width:100%;padding-bottom:28.69269949066214%"></div><iframe width="589" height="169" title="" src="https://snappify.com/embed/79de4334-824c-4dc1-ad1d-03fc50c154bc/3ae6ff65-8c58-444c-9203-49c56118b4fc?responsive=1&b=1" allow="clipboard-write" allowfullscreen="" loading="lazy" style="background:linear-gradient(337deg,#654EA3FF,#DA98B4FF);position:absolute;left:0;top:0;width:100%" frameborder="0"></iframe></div>
<!--kg-card-end: html-->
<p></p><p>Done!</p><p>With Opik’s integration for LlamaIndex, we can now focus on building a simple RAG application without worrying about tracking or monitoring manually.</p><h3 id="build-a-rag-application">Build a RAG application</h3><p>Once LlamaIndex is integrated with Opik, the next logical step is to create a basic RAG pipeline.</p><p>Also, going ahead, our idea is to keep everything simple and focus on what's important, which is evaluation and observability.</p><p>This code below demonstrates how you can use LlamaIndex to load documents, build an index, and query it for insights.</p><p>We start with the necessary imports:</p>
<!--kg-card-begin: html-->
<div style="overflow:hidden;margin-left:auto;margin-right:auto;border-radius:10px;width:100%;max-width:589px;position:relative"><div style="width:100%;padding-bottom:28.69269949066214%"></div><iframe width="589" height="169" title="" src="https://snappify.com/embed/79de4334-824c-4dc1-ad1d-03fc50c154bc/15175f18-8387-4483-b5da-fcc21eda9fa8?responsive=1&b=1" allow="clipboard-write" allowfullscreen="" loading="lazy" style="background:linear-gradient(337deg,#654EA3FF,#DA98B4FF);position:absolute;left:0;top:0;width:100%" frameborder="0"></iframe></div>
<!--kg-card-end: html-->
<p></p><p>Next, we do the following:</p>
<!--kg-card-begin: html-->
<div style="overflow:hidden;margin-left:auto;margin-right:auto;border-radius:10px;width:100%;max-width:580px;position:relative"><div style="width:100%;padding-bottom:44.99999999999999%"></div><iframe width="580" height="261" title="" src="https://snappify.com/embed/79de4334-824c-4dc1-ad1d-03fc50c154bc/7d3c2d52-b807-494b-b860-b95bd88166aa?responsive=1&b=1" allow="clipboard-write" allowfullscreen="" loading="lazy" style="background:linear-gradient(337deg,#654EA3FF,#DA98B4FF);position:absolute;left:0;top:0;width:100%" frameborder="0"></iframe></div>
<!--kg-card-end: html-->
<p></p><ul><li>We use the <code>SimpleDirectoryReader</code> from LlamaIndex to read all the documents from the directory created above during data download.</li><li>With the documents loaded, the <code>VectorStoreIndex</code> is created to serve as the foundation for efficient retrieval. This index maps document content into a vector space stored in memory.</li><li>Once the index is ready, a query engine is created to perform semantic searches and answer specific questions based on the indexed data.</li></ul><p>Done!</p><p>Next, we query our RAG pipeline as follows, which produces a response:</p>
<!--kg-card-begin: html-->
<div style="overflow:hidden;margin-left:auto;margin-right:auto;border-radius:10px;width:100%;max-width:580px;position:relative"><div style="width:100%;padding-bottom:48.103448275862064%"></div><iframe width="580" height="279" title="" src="https://snappify.com/embed/79de4334-824c-4dc1-ad1d-03fc50c154bc/bb6dae84-fc66-4b9e-b597-c9da5737a528?responsive=1&b=1" allow="clipboard-write" allowfullscreen="" loading="lazy" style="background:linear-gradient(337deg,#654EA3FF,#DA98B4FF);position:absolute;left:0;top:0;width:100%" frameborder="0"></iframe></div>
<!--kg-card-end: html-->
<p></p><p>Yet again, if we go back to the dashboard, we see two things here:</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2025/01/Screenshot-2025-01-17-at-3.55.35-PM.png" class="kg-image" alt="" loading="lazy" width="1454" height="788" srcset="https://www.dailydoseofds.com/content/images/size/w600/2025/01/Screenshot-2025-01-17-at-3.55.35-PM.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2025/01/Screenshot-2025-01-17-at-3.55.35-PM.png 1000w, https://www.dailydoseofds.com/content/images/2025/01/Screenshot-2025-01-17-at-3.55.35-PM.png 1454w" sizes="(min-width: 720px) 720px"></figure><ul><li>First, we have an index construction process, which we just ran when defining the RAG pipeline. Opening this specific trace shows everything that went into this process, like the data location, total embeddings, the time it took to embed and chunk the data, etc.</li></ul><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2025/01/Screenshot-2025-01-17-at-3.56.53-PM.png" class="kg-image" alt="" loading="lazy" width="1265" height="679" srcset="https://www.dailydoseofds.com/content/images/size/w600/2025/01/Screenshot-2025-01-17-at-3.56.53-PM.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2025/01/Screenshot-2025-01-17-at-3.56.53-PM.png 1000w, https://www.dailydoseofds.com/content/images/2025/01/Screenshot-2025-01-17-at-3.56.53-PM.png 1265w" sizes="(min-width: 720px) 720px"></figure><ul><li>Next, we have the step where we queried our vector database to produce a response, which highlights so many details about the LLM invocation, like the input, the output, the number of tokens used, the cost incurred for this specific run, the time spent in each process, and more:</li></ul><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2025/01/Screenshot-2025-01-17-at-3.58.04-PM.png" class="kg-image" alt="" loading="lazy" width="1265" height="734" srcset="https://www.dailydoseofds.com/content/images/size/w600/2025/01/Screenshot-2025-01-17-at-3.58.04-PM.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2025/01/Screenshot-2025-01-17-at-3.58.04-PM.png 1000w, https://www.dailydoseofds.com/content/images/2025/01/Screenshot-2025-01-17-at-3.58.04-PM.png 1265w" sizes="(min-width: 720px) 720px"></figure><ul><ul><li>Moreover, if we look at the retrieve trace, it tells us all the retrieved context, which can be easily inspected the retrieved context for better debugging:</li></ul></ul><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2025/01/Screen-Recording-2025-01-17-at-4.00.02-PM.gif" class="kg-image" alt="" loading="lazy" width="1264" height="1036" srcset="https://www.dailydoseofds.com/content/images/size/w600/2025/01/Screen-Recording-2025-01-17-at-4.00.02-PM.gif 600w, https://www.dailydoseofds.com/content/images/size/w1000/2025/01/Screen-Recording-2025-01-17-at-4.00.02-PM.gif 1000w, https://www.dailydoseofds.com/content/images/2025/01/Screen-Recording-2025-01-17-at-4.00.02-PM.gif 1264w" sizes="(min-width: 720px) 720px"></figure><ul><li>Similarly, we can look at the subprocess, like the final prompt, the generation process, etc.</li></ul><hr><h3 id="rag-evaluation-and-tracking-with-opik">RAG evaluation and tracking with Opik</h3><p>Now that we have a RAG pipeline ready, it's time to create a data set on which we can evaluate it. In order to do this, we have created a sample data set that has a question, an answer, and the context that is being used to arrive at this answer.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2025/01/Screenshot-2025-01-17-at-4.05.20-PM.png" class="kg-image" alt="" loading="lazy" width="1256" height="422" srcset="https://www.dailydoseofds.com/content/images/size/w600/2025/01/Screenshot-2025-01-17-at-4.05.20-PM.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2025/01/Screenshot-2025-01-17-at-4.05.20-PM.png 1000w, https://www.dailydoseofds.com/content/images/2025/01/Screenshot-2025-01-17-at-4.05.20-PM.png 1256w" sizes="(min-width: 720px) 720px"></figure><p>You can download this dataset below:</p><div class="kg-card kg-file-card"><a class="kg-file-card-container" href="https://www.dailydoseofds.com/content/files/2025/01/test.csv" title="Download" download=""><div class="kg-file-card-contents"><div class="kg-file-card-title">test</div><div class="kg-file-card-caption"></div><div class="kg-file-card-metadata"><div class="kg-file-card-filename">test.csv</div><div class="kg-file-card-filesize">2 KB</div></div></div><div class="kg-file-card-icon"><svg viewBox="0 0 24 24"><defs><style>.a{fill:none;stroke:currentColor;stroke-linecap:round;stroke-linejoin:round;stroke-width:1.5px;}</style></defs><title>download-circle</title><polyline class="a" points="8.25 14.25 12 18 15.75 14.25"></polyline><line class="a" x1="12" y1="6.75" x2="12" y2="18"></line><circle class="a" cx="12" cy="12" r="11.25"></circle></svg></div></a></div><p>We have already learned how to generate this in Part 2 of our RAG crash course series so we are not going to discuss it again:</p><figure class="kg-card kg-bookmark-card"><a class="kg-bookmark-container" href="https://www.dailydoseofds.com/a-crash-course-on-building-rag-systems-part-2-with-implementations/"><div class="kg-bookmark-content"><div class="kg-bookmark-title">A Crash Course on Building RAG Systems – Part 2 (With Implementation)</div><div class="kg-bookmark-description">A deep dive into evaluating RAG systems (with implementations).</div><div class="kg-bookmark-metadata"><img class="kg-bookmark-icon" src="https://www.dailydoseofds.com/content/images/icon/logo-subsatck2-1-36.svg" alt=""><span class="kg-bookmark-author">Daily Dose of Data Science</span><span class="kg-bookmark-publisher">Avi Chawla</span></div></div><div class="kg-bookmark-thumbnail"><img src="https://www.dailydoseofds.com/content/images/thumbnail/RAG_building-1-6.png" alt="" onerror="this.style.display = 'none'"></div></a></figure><p>To get started, we first define a dataset client in Opik:</p>
<!--kg-card-begin: html-->
<div style="overflow:hidden;margin-left:auto;margin-right:auto;border-radius:10px;width:100%;max-width:513px;position:relative"><div style="width:100%;padding-bottom:37.42690058479532%"></div><iframe width="513" height="192" title="" src="https://snappify.com/embed/79de4334-824c-4dc1-ad1d-03fc50c154bc/5e27b54d-e0f4-4d39-ad43-6b9ba60e77f1?responsive=1&b=1" allow="clipboard-write" allowfullscreen="" loading="lazy" style="background:linear-gradient(337deg,#654EA3FF,#DA98B4FF);position:absolute;left:0;top:0;width:100%" frameborder="0"></iframe></div>
<!--kg-card-end: html-->
<p></p><p>This immediately gets reflected in the datasets tab of the dashboard:</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2025/01/Screenshot-2025-01-17-at-4.08.17-PM.png" class="kg-image" alt="" loading="lazy" width="1327" height="448" srcset="https://www.dailydoseofds.com/content/images/size/w600/2025/01/Screenshot-2025-01-17-at-4.08.17-PM.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2025/01/Screenshot-2025-01-17-at-4.08.17-PM.png 1000w, https://www.dailydoseofds.com/content/images/2025/01/Screenshot-2025-01-17-at-4.08.17-PM.png 1327w" sizes="(min-width: 720px) 720px"></figure><p>Now, we need to push our dataset to this dataset client. </p><p>To do this, we first create a dictionary of our question-answer-context triplets and upload it to our client using the <code>insert</code> method of the <code>dataset</code> object created above:</p>
<!--kg-card-begin: html-->
<div style="overflow:hidden;margin-left:auto;margin-right:auto;border-radius:10px;width:100%;max-width:362px;position:relative"><div style="width:100%;padding-bottom:78.72928176795581%"></div><iframe width="362" height="285" title="" src="https://snappify.com/embed/79de4334-824c-4dc1-ad1d-03fc50c154bc/fc4905ab-adb5-44ca-94b4-f858314e5a4e?responsive=1&b=1" allow="clipboard-write" allowfullscreen="" loading="lazy" style="background:linear-gradient(337deg,#654EA3FF,#DA98B4FF);position:absolute;left:0;top:0;width:100%" frameborder="0"></iframe></div>
<!--kg-card-end: html-->
<p></p><p>These records immediately get reflected within the dataset shown in the dashboard as shown below:</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2025/01/Screenshot-2025-01-17-at-4.11.35-PM.png" class="kg-image" alt="" loading="lazy" width="1691" height="555" srcset="https://www.dailydoseofds.com/content/images/size/w600/2025/01/Screenshot-2025-01-17-at-4.11.35-PM.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2025/01/Screenshot-2025-01-17-at-4.11.35-PM.png 1000w, https://www.dailydoseofds.com/content/images/size/w1600/2025/01/Screenshot-2025-01-17-at-4.11.35-PM.png 1600w, https://www.dailydoseofds.com/content/images/2025/01/Screenshot-2025-01-17-at-4.11.35-PM.png 1691w" sizes="(min-width: 720px) 720px"></figure><p>That said, you can also easily add a new data set item using the same format—the input, the expected out, put and any additional field that you want to add here.</p><p>For instance, say you have your application up and running in production and then you encounter a really interesting example that you want to be part of your evaluation set. So you can easily do that using the above code.</p><p>There's one more way.</p><ul><li>Click on “Create dataset item”:</li></ul><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2025/01/image-72.png" class="kg-image" alt="" loading="lazy" width="2000" height="675" srcset="https://www.dailydoseofds.com/content/images/size/w600/2025/01/image-72.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2025/01/image-72.png 1000w, https://www.dailydoseofds.com/content/images/size/w1600/2025/01/image-72.png 1600w, https://www.dailydoseofds.com/content/images/size/w2400/2025/01/image-72.png 2400w" sizes="(min-width: 720px) 720px"></figure><ul><li>Next, you can populate these fields using the UI:</li></ul><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2025/01/Screenshot-2025-01-17-at-4.15.36-PM.png" class="kg-image" alt="" loading="lazy" width="1411" height="689" srcset="https://www.dailydoseofds.com/content/images/size/w600/2025/01/Screenshot-2025-01-17-at-4.15.36-PM.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2025/01/Screenshot-2025-01-17-at-4.15.36-PM.png 1000w, https://www.dailydoseofds.com/content/images/2025/01/Screenshot-2025-01-17-at-4.15.36-PM.png 1411w" sizes="(min-width: 720px) 720px"></figure><p>With that, we have everything ready to do an end-to-end evaluation, and it's time to put everything together and see how it works.</p><p>More specifically, we have already created a data set, and now we just need to create an evaluation task.</p><p>Inside that task, we are going to put the LLM application (which is our RAG app).</p><p>We shall also take all the expected output from the data set that we created to get feedback on how our application is working.</p><p>So the first thing we need to do is create an LLM application. This we have already seen above:</p><p>We start with the necessary imports:</p>
<!--kg-card-begin: html-->
<div style="overflow:hidden;margin-left:auto;margin-right:auto;border-radius:10px;width:100%;max-width:589px;position:relative"><div style="width:100%;padding-bottom:28.69269949066214%"></div><iframe width="589" height="169" title="" src="https://snappify.com/embed/79de4334-824c-4dc1-ad1d-03fc50c154bc/15175f18-8387-4483-b5da-fcc21eda9fa8?responsive=1&b=1" allow="clipboard-write" allowfullscreen="" loading="lazy" style="background:linear-gradient(337deg,#654EA3FF,#DA98B4FF);position:absolute;left:0;top:0;width:100%" frameborder="0"></iframe></div>
<!--kg-card-end: html-->
<p></p><p>Next, we do the following:</p>
<!--kg-card-begin: html-->
<div style="overflow:hidden;margin-left:auto;margin-right:auto;border-radius:10px;width:100%;max-width:580px;position:relative"><div style="width:100%;padding-bottom:44.99999999999999%"></div><iframe width="580" height="261" title="" src="https://snappify.com/embed/79de4334-824c-4dc1-ad1d-03fc50c154bc/7d3c2d52-b807-494b-b860-b95bd88166aa?responsive=1&b=1" allow="clipboard-write" allowfullscreen="" loading="lazy" style="background:linear-gradient(337deg,#654EA3FF,#DA98B4FF);position:absolute;left:0;top:0;width:100%" frameborder="0"></iframe></div>
<!--kg-card-end: html-->
<p></p><ul><li>We use the <code>SimpleDirectoryReader</code> from LlamaIndex to read all the documents from the directory created above during data download.</li><li>With the documents loaded, the <code>VectorStoreIndex</code> is created to serve as the foundation for efficient retrieval. This index maps document content into a vector space stored in memory.</li><li>Once the index is ready, a query engine is created to perform semantic searches or answering specific questions based on the indexed data.</li></ul><p>Done!</p><p>Next, we use the <code>@track</code> decorator we used earlier in this article to define a function:</p>
<!--kg-card-begin: html-->
<div style="overflow:hidden;margin-left:auto;margin-right:auto;border-radius:10px;width:100%;max-width:393px;position:relative"><div style="width:100%;padding-bottom:59.03307888040713%"></div><iframe width="393" height="232" title="" src="https://snappify.com/embed/79de4334-824c-4dc1-ad1d-03fc50c154bc/e0da500e-8c48-4b02-8339-e271bb642409?responsive=1&b=1" allow="clipboard-write" allowfullscreen="" loading="lazy" style="background:linear-gradient(337deg,#654EA3FF,#DA98B4FF);position:absolute;left:0;top:0;width:100%" frameborder="0"></iframe></div>
<!--kg-card-end: html-->
<p></p><p>Inside this function, we have our RAG application. It takes a user query, which is the input string, and provides a string output, which is the output of the RAG application.</p><p>This <code>my_llm_application</code> is our final application now, which, in your case, can be anything—RAG, multimodal RAG, Agentic RAG, etc.</p><p>Next, we are going to use the same OpenAI integration to track all the LLM calls:</p>
<!--kg-card-begin: html-->
<div style="overflow:hidden;margin-left:auto;margin-right:auto;border-radius:10px;width:100%;max-width:444px;position:relative"><div style="width:100%;padding-bottom:56.75675675675676%"></div><iframe width="444" height="252" title="" src="https://snappify.com/embed/79de4334-824c-4dc1-ad1d-03fc50c154bc/33b61ae0-ecb5-4cc5-b5e3-9cb2e6c20919?responsive=1&b=1" allow="clipboard-write" allowfullscreen="" loading="lazy" style="background:linear-gradient(337deg,#654EA3FF,#DA98B4FF);position:absolute;left:0;top:0;width:100%" frameborder="0"></iframe></div>
<!--kg-card-end: html-->
<p></p><p>After this, we need to create an evaluation task that specifies the output that you get for a particular input. And inside this, we need to put in our LLM application:</p>
<!--kg-card-begin: html-->
<div style="overflow:hidden;margin-left:auto;margin-right:auto;border-radius:10px;width:100%;max-width:435px;position:relative"><div style="width:100%;padding-bottom:45.74712643678161%"></div><iframe width="435" height="199" title="" src="https://snappify.com/embed/79de4334-824c-4dc1-ad1d-03fc50c154bc/8a42de9f-b16c-40cb-8b55-761679d77258?responsive=1&b=1" allow="clipboard-write" allowfullscreen="" loading="lazy" style="background:linear-gradient(337deg,#654EA3FF,#DA98B4FF);position:absolute;left:0;top:0;width:100%" frameborder="0"></iframe></div>
<!--kg-card-end: html-->
<p></p><p>After this, we create a data set client, which we already saw above:</p>
<!--kg-card-begin: html-->
<div style="overflow:hidden;margin-left:auto;margin-right:auto;border-radius:10px;width:100%;max-width:519px;position:relative"><div style="width:100%;padding-bottom:38.34296724470135%"></div><iframe width="519" height="199" title="" src="https://snappify.com/embed/79de4334-824c-4dc1-ad1d-03fc50c154bc/a10b30be-587c-4a96-973c-78a3b1939115?responsive=1&b=1" allow="clipboard-write" allowfullscreen="" loading="lazy" style="background:linear-gradient(337deg,#654EA3FF,#DA98B4FF);position:absolute;left:0;top:0;width:100%" frameborder="0"></iframe></div>
<!--kg-card-end: html-->
<p></p><p>The next part is to define all the evaluation metrics based on which we will evaluate our LLM application.</p><p>So we instantiate some evaluation metrics to evaluate our pipeline with as follows:</p>
<!--kg-card-begin: html-->
<div style="overflow:hidden;margin-left:auto;margin-right:auto;border-radius:10px;width:100%;max-width:417px;position:relative"><div style="width:100%;padding-bottom:76.97841726618705%"></div><iframe width="417" height="321" title="" src="https://snappify.com/embed/79de4334-824c-4dc1-ad1d-03fc50c154bc/194f0599-96bb-48ce-b5be-177f4de48baa?responsive=1&b=1" allow="clipboard-write" allowfullscreen="" loading="lazy" style="background:linear-gradient(337deg,#654EA3FF,#DA98B4FF);position:absolute;left:0;top:0;width:100%" frameborder="0"></iframe></div>
<!--kg-card-end: html-->
<p></p><p>Based on the expected output and the output given by our application, we will get feedback scores for our LLM application.</p><p>Once this is done, it's time to run the evaluation, and we are going to put everything together—the data set, the evaluation task, the scoring metrics, all four metrics that we just discussed here, and the experiment configuration wherein we specify the model we intend to us as our evaluation system which acts as a judge and see like how things are working.</p><p>This is implemented below:</p>
<!--kg-card-begin: html-->
<div style="overflow:hidden;margin-left:auto;margin-right:auto;border-radius:10px;width:100%;max-width:610px;position:relative"><div style="width:100%;padding-bottom:52.622950819672134%"></div><iframe width="610" height="321" title="" src="https://snappify.com/embed/79de4334-824c-4dc1-ad1d-03fc50c154bc/31fe8949-ce74-4b15-9207-e0d01079ac76?responsive=1&b=1" allow="clipboard-write" allowfullscreen="" loading="lazy" style="background:linear-gradient(337deg,#654EA3FF,#DA98B4FF);position:absolute;left:0;top:0;width:100%" frameborder="0"></iframe></div>
<!--kg-card-end: html-->
<p></p><p>Done!</p><p>This produces the evaluation results shown below:</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2025/01/Screenshot-2025-01-17-at-4.57.38-PM.png" class="kg-image" alt="" loading="lazy" width="316" height="155"></figure><p>More specifically, these are the average scores we got for each of the four metrics on the five examples in our evaluation dataset.</p><p>We can also view these results in the Opik dashboard under the "Experiments" section:</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2025/01/Screenshot-2025-01-17-at-4.59.45-PM.png" class="kg-image" alt="" loading="lazy" width="1686" height="664" srcset="https://www.dailydoseofds.com/content/images/size/w600/2025/01/Screenshot-2025-01-17-at-4.59.45-PM.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2025/01/Screenshot-2025-01-17-at-4.59.45-PM.png 1000w, https://www.dailydoseofds.com/content/images/size/w1600/2025/01/Screenshot-2025-01-17-at-4.59.45-PM.png 1600w, https://www.dailydoseofds.com/content/images/2025/01/Screenshot-2025-01-17-at-4.59.45-PM.png 1686w" sizes="(min-width: 720px) 720px"></figure><p>Let's select our evaluation:</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2025/01/Screenshot-2025-01-17-at-5.00.19-PM.png" class="kg-image" alt="" loading="lazy" width="1686" height="664" srcset="https://www.dailydoseofds.com/content/images/size/w600/2025/01/Screenshot-2025-01-17-at-5.00.19-PM.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2025/01/Screenshot-2025-01-17-at-5.00.19-PM.png 1000w, https://www.dailydoseofds.com/content/images/size/w1600/2025/01/Screenshot-2025-01-17-at-5.00.19-PM.png 1600w, https://www.dailydoseofds.com/content/images/2025/01/Screenshot-2025-01-17-at-5.00.19-PM.png 1686w" sizes="(min-width: 720px) 720px"></figure><p>And here, we can see the individual scores for each of the 5 test samples we had in our evaluation dataset.</p><p>Moreover, if you go to the "Projects tab" and select your project, you can see all evaluation task-related activities:</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2025/01/Screenshot-2025-01-17-at-5.02.17-PM.png" class="kg-image" alt="" loading="lazy" width="1686" height="664" srcset="https://www.dailydoseofds.com/content/images/size/w600/2025/01/Screenshot-2025-01-17-at-5.02.17-PM.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2025/01/Screenshot-2025-01-17-at-5.02.17-PM.png 1000w, https://www.dailydoseofds.com/content/images/size/w1600/2025/01/Screenshot-2025-01-17-at-5.02.17-PM.png 1600w, https://www.dailydoseofds.com/content/images/2025/01/Screenshot-2025-01-17-at-5.02.17-PM.png 1686w" sizes="(min-width: 720px) 720px"></figure><p>If we open the very last evaluation task (which has a hallucination score of 1), it gives me a detailed overview of everything that happened for this specific example:</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2025/01/Screenshot-2025-01-17-at-5.04.38-PM.png" class="kg-image" alt="" loading="lazy" width="1253" height="690" srcset="https://www.dailydoseofds.com/content/images/size/w600/2025/01/Screenshot-2025-01-17-at-5.04.38-PM.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2025/01/Screenshot-2025-01-17-at-5.04.38-PM.png 1000w, https://www.dailydoseofds.com/content/images/2025/01/Screenshot-2025-01-17-at-5.04.38-PM.png 1253w" sizes="(min-width: 720px) 720px"></figure><p>This includes the inputs of the dataset we had in this first item, the output that was produced by our LLM pipeline, the time taken to compute each of the evaluation metrics and much more.</p><p>Let's say you want to go into more details related to the hallucination of this evaluation example, we can select the "hallucination_metric" section to see the input, the output and the corresponding context it retrieved. Moreover, as shown below, it also shows a reason behind producing a particular score.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2025/01/Screenshot-2025-01-17-at-5.06.21-PM.png" class="kg-image" alt="" loading="lazy" width="1253" height="860" srcset="https://www.dailydoseofds.com/content/images/size/w600/2025/01/Screenshot-2025-01-17-at-5.06.21-PM.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2025/01/Screenshot-2025-01-17-at-5.06.21-PM.png 1000w, https://www.dailydoseofds.com/content/images/2025/01/Screenshot-2025-01-17-at-5.06.21-PM.png 1253w" sizes="(min-width: 720px) 720px"></figure><p>This makes it much easier to debug and understand what's happening and how you can make further improvements to your system.</p><hr><h2 id="conclusion">Conclusion</h2><p>With that, we come to an end of this deep dive on LLM evaluation and tracking using a relatively new but extremely powerful open-source framework—Opik.</p><p>As we saw above, each of the underlying technicalities for evaluation and tracking has already been implemented by Opik, so the only thing you are supposed to do is build your LLM apps. </p><p>Every aspect of evaluation and tracking is handled by Opik.</p><p>Moreover, in this demo, we only looked at OpenAI and Ollama, but all the supported integrations and how to use them in your projects are available in the <a href="https://www.comet.com/opik/avi-chawla/quickstart?ref=dailydoseofds.com"><strong>quickstart guide</strong></a>.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2025/01/Screenshot-2025-01-17-at-1.29.08-PM.png" class="kg-image" alt="" loading="lazy" width="1688" height="999" srcset="https://www.dailydoseofds.com/content/images/size/w600/2025/01/Screenshot-2025-01-17-at-1.29.08-PM.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2025/01/Screenshot-2025-01-17-at-1.29.08-PM.png 1000w, https://www.dailydoseofds.com/content/images/size/w1600/2025/01/Screenshot-2025-01-17-at-1.29.08-PM.png 1600w, https://www.dailydoseofds.com/content/images/2025/01/Screenshot-2025-01-17-at-1.29.08-PM.png 1688w" sizes="(min-width: 720px) 720px"></figure><p>In most cases, you would hardly have to add a few lines of code to your existing pipelines to have this robust framework integrated into your LLM applications.</p><p>The code for today's article is available here:</p><figure class="kg-card kg-bookmark-card"><a class="kg-bookmark-container" href="https://github.com/patchy631/ai-engineering-hub/tree/main/eval-and-observability?ref=dailydoseofds.com"><div class="kg-bookmark-content"><div class="kg-bookmark-title">ai-engineering-hub/eval-and-observability at main · patchy631/ai-engineering-hub</div><div class="kg-bookmark-description">Contribute to patchy631/ai-engineering-hub development by creating an account on GitHub.</div><div class="kg-bookmark-metadata"><img class="kg-bookmark-icon" src="https://www.dailydoseofds.com/content/images/icon/pinned-octocat-093da3e6fa40-5.svg" alt=""><span class="kg-bookmark-author">GitHub</span><span class="kg-bookmark-publisher">patchy631</span></div></div><div class="kg-bookmark-thumbnail"><img src="https://www.dailydoseofds.com/content/images/thumbnail/ai-engineering-hub" alt="" onerror="this.style.display = 'none'"></div></a></figure><p>As always, thanks for reading!</p><p>Any questions?</p><p>Feel free to post them in the comments.</p><p>Or</p><p>If you wish to connect privately, feel free to initiate a chat here:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.dailydoseofds.com/content/images/2023/07/Screenshot-2023-07-28-at-3.36.00-PM.png" class="kg-image" alt="" loading="lazy" width="1475" height="337" srcset="https://www.dailydoseofds.com/content/images/size/w600/2023/07/Screenshot-2023-07-28-at-3.36.00-PM.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2023/07/Screenshot-2023-07-28-at-3.36.00-PM.png 1000w, https://www.dailydoseofds.com/content/images/2023/07/Screenshot-2023-07-28-at-3.36.00-PM.png 1475w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Connect via chat</span></figcaption></figure>]]></content:encoded>
                </item>
                <item>
                    <title><![CDATA[CI/CD Workflows]]></title>
                    <description><![CDATA[MLOps Part 18: A hands-on guide to CI/CD in MLOps with DVC, Docker, GitHub Actions, and GitOps-based Kubernetes delivery on Amazon EKS.]]></description>
                    <link>https://www.dailydoseofds.com/mlops-crash-course-part-18/</link>
                    <guid isPermaLink="false">69cad8cb10b4a900019d0d8e</guid>

                        <category><![CDATA[MLOps/LLMOps Course]]></category>
                        <category><![CDATA[MLOps]]></category>
                        <category><![CDATA[Deployment]]></category>

                        <dc:creator><![CDATA[Avi Chawla]]></dc:creator>

                    <pubDate>Sun, 30 Nov 2025 00:58:00 +0530</pubDate>

                        <media:content url="https://www.dailydoseofds.com/content/images/2025/12/Sanyog-MLOps_-_I.png" medium="image"/>

                    <content:encoded><![CDATA[<img src="https://www.dailydoseofds.com/content/images/2025/12/Sanyog-MLOps_-_I.png" alt="CI/CD Workflows"/> <h2 id="recap">Recap</h2><p>In Part 17 of this MLOps and LLMOps crash course, we explored monitoring and observability in ML systems, along with the key tools that make up the ML observability stack.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2025/12/image-25.png" class="kg-image" alt="" loading="lazy" width="1000" height="367" srcset="https://www.dailydoseofds.com/content/images/size/w600/2025/12/image-25.png 600w, https://www.dailydoseofds.com/content/images/2025/12/image-25.png 1000w" sizes="(min-width: 720px) 720px"></figure><p>We began by understanding the two types of monitoring: functional and operational monitoring.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2025/12/image-26.png" class="kg-image" alt="" loading="lazy" width="1204" height="472" srcset="https://www.dailydoseofds.com/content/images/size/w600/2025/12/image-26.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2025/12/image-26.png 1000w, https://www.dailydoseofds.com/content/images/2025/12/image-26.png 1204w" sizes="(min-width: 720px) 720px"></figure><p>After that, we explored Evidently AI, the functional monitoring tool, and walked through data-drift detection, data-quality analysis, and HTML dashboard generation, complete with code examples.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/01/image-136.png" class="kg-image" alt="" loading="lazy" width="850" height="182" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/01/image-136.png 600w, https://www.dailydoseofds.com/content/images/2026/01/image-136.png 850w" sizes="(min-width: 720px) 720px"></figure><p>Next, we did a brief discussion on Prometheus and Grafana for operational monitoring and understood their significance for ML systems and teams.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/01/image-137.png" class="kg-image" alt="" loading="lazy" width="681" height="385" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/01/image-137.png 600w, https://www.dailydoseofds.com/content/images/2026/01/image-137.png 681w"></figure><p>Finally, we went hands-on covering functional and operational monitoring setup in FastAPI-based applications.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2025/12/image-29.png" class="kg-image" alt="" loading="lazy" width="1000" height="453" srcset="https://www.dailydoseofds.com/content/images/size/w600/2025/12/image-29.png 600w, https://www.dailydoseofds.com/content/images/2025/12/image-29.png 1000w" sizes="(min-width: 720px) 720px"></figure><p>If you haven’t explored Part 17 yet, we recommend going through it first, since it sets the flow for what's about to come.</p><p>Read it here:</p><figure class="kg-card kg-bookmark-card"><a class="kg-bookmark-container" href="https://www.dailydoseofds.com/mlops-crash-course-part-17/"><div class="kg-bookmark-content"><div class="kg-bookmark-title">The Full MLOps Blueprint: Monitoring and Observability—Part B</div><div class="kg-bookmark-description">MLOps and LLMOps Crash Course—Part 17.</div><div class="kg-bookmark-metadata"><img class="kg-bookmark-icon" src="https://www.dailydoseofds.com/content/images/icon/logo-subsatck2-1-155.svg" alt=""><span class="kg-bookmark-author">Daily Dose of Data Science</span><span class="kg-bookmark-publisher">Avi Chawla</span></div></div><div class="kg-bookmark-thumbnail"><img src="https://www.dailydoseofds.com/content/images/thumbnail/Sanyog-MLOps_-_I-8.png" alt="" onerror="this.style.display = 'none'"></div></a></figure><p>In this chapter, we’ll understand and learn about CI/CD workflows in ML systems with necessary theoretical details and hands-on examples.</p><p>As always, every notion will be explained through clear examples and walkthroughs to develop a solid understanding.</p><p>Let’s begin!</p><hr><h2 id="introduction">Introduction</h2><p>CI/CD stands for&nbsp;continuous integration and continuous delivery/deployment, which are standard DevOps practices that automate the software development lifecycle to deliver software changes faster and more reliably.</p><p>CI focuses on automatically integrating code changes and running tests, while CD automates the release of those validated changes to a testing or production environment.&nbsp;This automation ensures code quality, speeds up releases, and helps developers respond quickly to user feedback.</p><p>Traditional CI/CD focuses on delivering code changes quickly and safely. With ML, we need to extend this idea to data and models too.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2025/12/image-42-1.png" class="kg-image" alt="" loading="lazy" width="1024" height="494" srcset="https://www.dailydoseofds.com/content/images/size/w600/2025/12/image-42-1.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2025/12/image-42-1.png 1000w, https://www.dailydoseofds.com/content/images/2025/12/image-42-1.png 1024w" sizes="(min-width: 720px) 720px"></figure><p>This means our pipelines must not only run code tests, but also validate input data, retrain models, track model metrics, and then deploy or push those models.</p><p>So let's go ahead and dive deeper into the various aspects, starting with CI for ML systems.</p><hr><h2 id="ci-for-ml">CI for ML</h2><p>Continuous Integration for ML means automating the validation of everything upstream of deployment: data, code, and model.</p><p>The goal is to catch problems early (at commit time) before a model is deployed. This section breaks CI into three parts unique to ML: Data CI, Code CI, and Model CI.</p><h3 id="data-ci-validating-data">Data CI: validating data</h3><p>In ML, data is essentially "code", it defines model behavior. Therefore, integrating new data (or data pipelines) requires rigorous checks just like new code does.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2025/12/image-44-1.png" class="kg-image" alt="" loading="lazy" width="1024" height="523" srcset="https://www.dailydoseofds.com/content/images/size/w600/2025/12/image-44-1.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2025/12/image-44-1.png 1000w, https://www.dailydoseofds.com/content/images/2025/12/image-44-1.png 1024w" sizes="(min-width: 720px) 720px"></figure><p>Data CI focuses on automatically testing data quality and detecting data drift as part of the integration process.</p><h4 id="schema-and-quality-checks">Schema and quality checks</h4><p>Whenever new data is ingested or a new dataset version is used for training, the pipeline should validate that the data meets the expected schema and values.</p><p>Tools like Pandera allow you to define explicit expectations or schemas for data. For example, you can assert that certain columns exist, data types are correct, no critical nulls or out-of-range values, etc.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/01/image-138.png" class="kg-image" alt="" loading="lazy" width="840" height="158" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/01/image-138.png 600w, https://www.dailydoseofds.com/content/images/2026/01/image-138.png 840w" sizes="(min-width: 720px) 720px"></figure><p>These checks can run in CI to prevent training on corrupt or invalid data. Using data validation frameworks (e.g., Pandera) enables schema checks and automated anomaly detection on your data pipeline.</p><p>For instance, using Pandera (version 0.27.0) we can define a schema and validate a Pandas <code>DataFrame</code> in a test:</p>
<!--kg-card-begin: html-->
<div style="overflow:hidden;margin-left:auto;margin-right:auto;border-radius:10px;width:100%;max-width:744px;position:relative"><div style="width:100%;padding-bottom:70.02688172043011%"></div><iframe width="744" height="521" title="" src="https://snappify.com/embed/870e9f7d-e05b-49ca-b6fb-071a1e8027d2/6d3e40ec-779d-4d10-8ed3-d90be3fce579?responsive=1&b=1" allow="clipboard-write" allowfullscreen="" loading="lazy" style="background:linear-gradient(337deg,#654EA3FF,#DA98B4FF);position:absolute;left:0;top:0;width:100%" frameborder="0"></iframe></div>
<!--kg-card-end: html-->
<p></p><p>This little script loads a CSV file into a pandas <code>DataFrame</code> and then uses a Pandera model called <code>TrainingDataSchema</code> to make sure every column in that <code>DataFrame</code> satisfies the rules declared for it.</p><p>The schema specifies that <code>feature1</code> must be a non-null floating-point value greater than zero, <code>feature2</code> must be an integer between zero and one hundred (inclusive), and <code>label</code> must be an integer equal to either zero or one.</p><p>When the DataFrame is validated using <code>validate</code> with <code>lazy=True</code>, Pandera checks every row and collects all violations instead of stopping at the first error.</p><p>The code wraps this validation in a try/except block so that if the data fails the schema checks, the exception provides a compact table (<code>failure_cases</code>) showing exactly which rows and values broke which constraint, along with a count of total errors.</p><p>If the data passes all checks, the script simply prints that validation succeeded.</p><p>In a CI context, this could be part of a test that fails if the data doesn't conform.</p><p>Data tests like this ensure that upstream data issues (like a schema change or unexpected distribution shift) are caught early, rather than silently corrupting the model training.</p><h4 id="data-drift-checks">Data drift checks</h4><p>Data drift refers to changes in the statistical distribution of data over time. While data drift is often monitored in production, you may also integrate drift detection in CI when new data is used for retraining.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/01/image-35-2.png" class="kg-image" alt="" loading="lazy" width="963" height="376" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/01/image-35-2.png 600w, https://www.dailydoseofds.com/content/images/2026/01/image-35-2.png 963w" sizes="(min-width: 720px) 720px"></figure><p>For example, if a new training dataset is significantly different from the previous training data, you might want the pipeline to flag it before retraining (especially if such drift is unintended).</p><p>Tools like Evidently AI can help with that. You can programmatically compare a new dataset against a reference dataset and get a drift score.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/01/image-136.png" class="kg-image" alt="" loading="lazy" width="850" height="182" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/01/image-136.png 600w, https://www.dailydoseofds.com/content/images/2026/01/image-136.png 850w" sizes="(min-width: 720px) 720px"></figure><p>We covered Evidently AI in Part 17 of this course, you can check it out below:</p><figure class="kg-card kg-bookmark-card"><a class="kg-bookmark-container" href="https://www.dailydoseofds.com/mlops-crash-course-part-17/"><div class="kg-bookmark-content"><div class="kg-bookmark-title">The Full MLOps Blueprint: Monitoring and Observability—Part B</div><div class="kg-bookmark-description">MLOps and LLMOps Crash Course—Part 17.</div><div class="kg-bookmark-metadata"><img class="kg-bookmark-icon" src="https://www.dailydoseofds.com/content/images/icon/logo-subsatck2-1-156.svg" alt=""><span class="kg-bookmark-author">Daily Dose of Data Science</span><span class="kg-bookmark-publisher">Avi Chawla</span></div></div><div class="kg-bookmark-thumbnail"><img src="https://www.dailydoseofds.com/content/images/thumbnail/Sanyog-MLOps_-_I-9.png" alt="" onerror="this.style.display = 'none'"></div></a></figure><p>In CI, this could prevent retraining a model on data that is too different without additional review. Often, significant drift might require updating data processing or a model retraining strategy.</p><h4 id="data-versioning">Data versioning</h4><p>To enable reproducibility and continuous integration of data, teams use data version control tools like DVC. These tools let you track dataset versions similar to how Git tracks code versions.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2025/12/image-33.png" class="kg-image" alt="" loading="lazy" width="307" height="133"></figure><p>In CI, you can then pull specific data snapshots for training or testing. For example, DVC can store data files in remote storage and link them to Git commits.</p><p>A CI pipeline might want to fetch the exact version of a dataset corresponding to the current code version before running tests or training. This ensures that experiments are reproducible and that model training in CI is using the correct data.</p><p>Tools such as DVC integrate with Git to track large data files and even pipeline dependencies, allowing reproducible training when data changes.</p><p>We covered versioning of data with DVC in Part 3 of this course, you can check it out below:</p><figure class="kg-card kg-bookmark-card"><a class="kg-bookmark-container" href="https://www.dailydoseofds.com/mlops-crash-course-part-3/"><div class="kg-bookmark-content"><div class="kg-bookmark-title">The Full MLOps Blueprint: Reproducibility and Versioning in ML Systems—Part A (With Implementation)</div><div class="kg-bookmark-description">MLOps and LLMOps Crash Course—Part 3.</div><div class="kg-bookmark-metadata"><img class="kg-bookmark-icon" src="https://www.dailydoseofds.com/content/images/icon/logo-subsatck2-1-157.svg" alt=""><span class="kg-bookmark-author">Daily Dose of Data Science</span><span class="kg-bookmark-publisher">Avi Chawla</span></div></div><div class="kg-bookmark-thumbnail"><img src="https://www.dailydoseofds.com/content/images/thumbnail/Sanyog-MLOps_-_I-10.png" alt="" onerror="this.style.display = 'none'"></div></a></figure><p>In summary, Data CI treats data as a first-class citizen in integration tests. It guarantees that any new data entering the pipeline is valid, and it keeps an eye on distribution changes. This prevents “data bugs”, which can be just as harmful as code bugs, from propagating to model training.</p><p>Next, let's go ahead and take a look at CI for code.</p><h3 id="code-ci-testing-code">Code CI: testing code</h3><p>Code continuous integration for ML looks much like traditional CI on the surface: you run unit tests, enforce code style, and perform integration tests. However, the content of these tests is tailored to ML pipeline code.</p><h4 id="unit-tests">Unit tests</h4><p>Our feature engineering functions, data loaders, and utility functions can have unit tests. For example, if we have a function <code>preprocess_data(df)</code> that fills missing values or scales features, write tests for it using small sample inputs.</p><p>If you have a custom loss function or metric, test it on known inputs. These tests catch logical bugs early.</p><p>Example: If <code>one_hot_encode()</code> is supposed to produce a fixed set of dummy columns, a unit test could feed a sample input and verify the output columns match expectations.</p><h4 id="pipeline-integration-test-small-training-run">Pipeline integration test (small training run)</h4><p>A unique kind of test in ML CI is a small-scale training run to catch pipeline integration issues.</p><p>This isn't about achieving good accuracy, but rather about making sure the training loop runs end-to-end with the current code and data schema.</p><p>For example, you might take a tiny subset of the data (or synthetic data) and run one epoch of training in CI, just to ensure that the model can train without runtime errors (and perhaps that the loss decreases on that small sample).</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2025/12/image-45-1-1-1.png" class="kg-image" alt="" loading="lazy" width="1024" height="497" srcset="https://www.dailydoseofds.com/content/images/size/w600/2025/12/image-45-1-1-1.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2025/12/image-45-1-1-1.png 1000w, https://www.dailydoseofds.com/content/images/2025/12/image-45-1-1-1.png 1024w" sizes="(min-width: 720px) 720px"></figure><p>This can catch issues like misaligned tensor dimensions, incompatible data types, or broken training scripts that unit tests might miss.</p><h4 id="configuration-and-dependency-checks">Configuration and dependency checks</h4><p>ML projects often have configuration for hyperparameters, data paths, etc. A CI process can validate these too. For example, if you use a config YAML or JSON, you might have a test to load it and ensure required fields exist.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2025/12/image-36.png" class="kg-image" alt="" loading="lazy" width="545" height="422"></figure><p>Additionally, because ML code depends on many libraries (numpy, pandas, torch, etc.), it's a good practice to pin versions and record dependencies. This avoids the classic “it works on my machine” issue when deploying code.</p><h4 id="property-based-tests">Property-based tests</h4><p>We must understand that tests in ML can be brittle if they rely on exact outputs, since adding new data or changing logic can make a comparison fail. Instead, focus on properties.</p><p>For example, if adding more data should not reduce the length of the output, test for that. If a function is supposed to normalize data, test that the mean of output is ~0. If the sum of the model’s prediction probabilities should sum to 1, test that property. These are more robust than testing for an exact prediction value.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2025/12/image-48-1-1.png" class="kg-image" alt="" loading="lazy" width="1014" height="518" srcset="https://www.dailydoseofds.com/content/images/size/w600/2025/12/image-48-1-1.png 600w, https://www.dailydoseofds.com/content/images/size/w1000/2025/12/image-48-1-1.png 1000w, https://www.dailydoseofds.com/content/images/2025/12/image-48-1-1.png 1014w" sizes="(min-width: 720px) 720px"></figure><p>The main point for Code CI is: treat your ML pipeline code like normal software. Write tests for data transformations, model training routines, and even the inference logic (e.g., a test could call the model’s predict on a known input to see if it returns output of expected shape and type).</p><p>By continuously integrating code changes with tests, you ensure that refactoring or new features (like trying a new preprocessing step) don’t break existing functionality.</p><p>Next, let's go ahead and take a look at CI for models.</p><h3 id="model-ci-testing-models-and-quality">Model CI: testing models and quality</h3><p>Beyond code and data, we also need to continuously test the model artifact and its performance. This is where ML CI adds new types of tests not seen in standard software:</p><h4 id="performance-metric-thresholds">Performance metric thresholds</h4><p>After training a model (even in CI on a smaller scale or with a previous training run), you should automatically evaluate it on a validation set and verify key metrics.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2025/12/image-37.png" class="kg-image" alt="" loading="lazy" width="1000" height="367" srcset="https://www.dailydoseofds.com/content/images/size/w600/2025/12/image-37.png 600w, https://www.dailydoseofds.com/content/images/2025/12/image-37.png 1000w" sizes="(min-width: 720px) 720px"></figure><p>For example, if you expect at least 85% accuracy on the validation data, your CI can assert that. If the model underperforms (maybe due to a bug or bad data), the pipeline should fail rather than deploy.</p><p>In practice, teams define "failure thresholds", e.g., if the new model’s AUC drops 5 points below the current production model, do not promote it.</p><h4 id="reproducibility-tests">Reproducibility tests</h4><p>A tricky aspect of ML is randomness. You should aim for reproducible training (set random seeds, control sources of nondeterminism). As part of CI, you might run a reproducibility test: train the model twice with the same data and seed, and confirm that you get the same result within certain tolerance, in cases.</p><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2025/12/image-46.png" class="kg-image" alt="" loading="lazy" width="571" height="226"></figure><p>If a model is truly deterministic given a seed, then any drift might indicate non-deterministic behavior or external issues. This test might be expensive so perhaps not run on every commit, but it’s valuable when establishing a pipeline.</p><p>Another approach is comparing model outputs to a stored “baseline” output for a few sample inputs, plus, instead of storing entire output datasets, you can store expectations about them (like distribution, ranges) so that tests are less brittle.</p><h4 id="bias-and-fairness-checks">Bias and fairness checks</h4><p>ML integration tests can also include checks for ethical or regulatory concerns. For instance, after training, you might compute metrics across protected subgroups (male vs female, etc.) to detect large disparities. Tools like IBM’s AI Fairness 360 can automate parts of this.</p><h4 id="model-artifact-checks">Model artifact checks</h4><p>Once a model is trained, there are some practical things to test about the artifact:</p><ul><li>Size check: If the model file is too large (maybe someone accidentally saved a model with debug info or a very high number of parameters), it could be impractical to deploy. A CI step can ensure the model file (e.g., <code>model.pkl</code> or <code>model.onnx</code>) is under a certain size.</li><li>Dependency check: Ensure that all libraries needed to load and run the model are accounted for. For example, if you use <code>pickle</code> or <code>joblib</code> for a scikit-learn model, loading it requires the same Python environment with those library versions. If you export to ONNX, ensure your serving environment has an ONNX runtime.</li></ul><figure class="kg-card kg-image-card"><img src="https://www.dailydoseofds.com/content/images/2026/01/image-139.png" class="kg-image" alt="" loading="lazy" width="890" height="246" srcset="https://www.dailydoseofds.com/content/images/size/w600/2026/01/image-139.png 600w, https://www.dailydoseofds.com/content/images/2026/01/image-139.png 890w" sizes="(min-width: 720px) 720px"></figure><ul><li>Serialization format validation: If you use a specific format, you might run a quick test that tries a round-trip: save the model, then load it back and see if it still predicts the same on a sample input. This catches issues where the model might not be properly serializable.</li></ul><p>To summarize: Model CI introduces automated “gates” based on model evaluation. Instead of relying purely on a human to review a training run, the CI will fail fast if the model isn’t up to par (performance-wise or other checks).</p><p>These tests give confidence that models which pass CI are at least likely to be good candidates for deployment, or at least they aren't obviously bad.</p><p>With this, we now understand what continuous integration is, let's go ahead and explore continuous delivery (CD) for ML.</p><hr><h2 id="cd-for-ml">CD for ML</h2><p>Continuous Delivery (CD) for ML takes the artifacts and results from CI (code, data, model that passed tests) and automates the process of packaging, releasing, and/or deploying them.</p><p>ML CD has to handle releasing not just application code, but also new model versions. It often integrates with specialized infrastructure like model serving platforms or orchestration on Kubernetes.</p><p>Let’s break down key aspects of CD for ML:</p>

<aside data-sx-content-cta class="relative p-0.5 js-toc-ignore">
        <div data-sx-content-cta-blur class="absolute w-full start-0 -top-0.5 -translate-y-full bg-linear-to-b to-70% from-black/0 to-background dark:to-background-dark pointer-events-none">
            &nbsp; <br> &nbsp; <br> &nbsp;
        </div>

    <div class="relative flex flex-col items-center text-center w-full px-4 py-10 lg:px-10 text-base rounded-xl sm:rounded-2xl">
        <div class="dark:hidden absolute -inset-px -z-10 rounded-xl bg-white shadow-pretty-sm transition-shadow duration-200"></div>
        <span class="hidden dark:block absolute -z-10 -inset-px rounded-inherit bg-radial-[50%_100%_at_50%_0%] from-white/10"></span>        
        <div class="hidden dark:block absolute -inset-px -z-10 rounded-xl bg-white/1 transition-colors duration-500 inset-shadow-md inset-shadow-white/2"></div>
        
        <div class="hidden dark:block absolute -z-10 -inset-0.5 mask-to-b mask-position-to-70% mask-opacity-to-50%">
            <div class="hidden dark:block absolute -z-10 inset-px rounded-xl border border-gray-50/10"></div>
        </div>                                                
        <div class="hidden dark:block absolute -z-10 -inset-px rounded-xl border border-gray-50/5 opacity-0"></div>

        <div class="hidden dark:block absolute -top-px start-1/2 -translate-x-1/2 w-1/2 h-px bg-linear-to-r from-white/0 to-white/0 via-gray-400/50"></div>

        <h2 class="text-gray-900 dark:text-gray-100 text-pretty text-xl sm:text-2xl leading-snug tracking-tight">




                                                <span class="hidden first:inline">
                                This lesson is for paying subscribers only
                            </span>                           
                    <span class="hidden first:inline">
                        This post is for paying subscribers only
                    </span>  

        </h2>

            <a href="/membership" class="mt-7 px-4 py-2.5 font-semibold text-sm text-center bg-primary rounded-full shadow-primary dark:shadow-none dark:drop-shadow-primary-sm hover:opacity-90 transition-opacity duration-200 inline-block" style="color: white;">
                Unlock Full Access
            </a>
            <p class="mt-5 text-sm">
                Already have an account? <button data-portal="signin" class="block sm:inline mx-auto font-medium underline decoration-primary decoration-2 underline-offset-2 hover:text-primary transition-colors duration-200">Sign in</button>
            </p>
    </div>
</aside>
]]></content:encoded>
                </item>
    </channel>
</rss>