Evaluation: Multi-turn Conversations, Tool Use, Tracing, and Red Teaming
LLMOps Part 11: Understanding evaluation of conversational LLM systems, tool evaluations, tracing with Langfuse, and automated red teaming.
400 posts published
LLMOps Part 11: Understanding evaluation of conversational LLM systems, tool evaluations, tracing with Langfuse, and automated red teaming.
LLMOps Part 10: Understanding model benchmarks, LLM application evaluation, and tooling.
LLMOps Part 9: A foundational guide to the evaluation of LLM applications, covering challenges and a practical taxonomy of evaluation methods.
LLMOps Part 8: A concise overview of memory, dynamic and temporal context in LLM systems, covering short and long-term memory, dynamic context injection, and some of the common context failure modes in agentic applications.
LLMOps Part 7: A conceptual overview of context engineering, covering context types, context construction principles, and retrieval-centric techniques for building high-signal inputs.
LLMOps Part 6: Exploring prompt versioning, defensive prompting, and techniques such as verbalized sampling, role prompting and more.
LLMOps Part 5: An introduction to prompt engineering (a subset of context engineering), covering prompt types, the prompt development workflow, and key techniques in the field.
LLMOps Part 4: An exploration of key decoding strategies, sampling parameters, and the general lifecycle of LLM-based applications.
LLMOps Part 3: A focused look at the core ideas behind attention mechanism, transformer and mixture-of-experts architectures, and model pretraining and fine-tuning.
Tools, prompts and resources form the three core capabilities of the MCP framework. Capabilities are essentially the features or functions that the server makes available. * Tools: Executable actions or functions that the AI (host/client) can invoke (often with side effects or external API calls). * Resources: Read-only data sources
At its heart, MCP follows a client-server architecture (much like the web or other network protocols). However, the terminology is tailored to the AI context. There are three main roles to understand: the Host, the Client, and the Server. Host The Host is the user-facing AI application, the
Without MCP, adding a new tool or integrating a new model was a headache. If you had three AI applications and three external tools, you might end up writing nine different integration modules (each AI x each tool) because there was no common standard. This doesn’t scale. Developers of
Imagine you only know English. To get info from a person who only knows: * French, you must learn French. * German, you must learn German. * And so on. In this setup, learning even 5 languages will be a nightmare for you. But what if you add a translator that understands all
LLMOps Part 2: A detailed walkthrough of tokenization, embeddings, and positional representations, building the foundational translation layer that enables LLMs to process and reason over text.
AI Agents Crash Course—Part 17 (with implementation).
AI Agents Crash Course—Part 16 (with implementation).
LLMOps Part 1: An overview of AI engineering and LLMOps, and the core dimensions that define modern AI systems.
A comprehensive guide to Opik, an open-source LLM evaluation and observability framework.
MLOps Part 18: A hands-on guide to CI/CD in MLOps with DVC, Docker, GitHub Actions, and GitOps-based Kubernetes delivery on Amazon EKS.
MLOps Part 17: ML monitoring in practice with Evidently, Prometheus and Grafana, stitched into a FastAPI inference service with drift reports, metrics scraping, and dashboards.
AI Agents Crash Course—Part 15 (with implementation).
...explained with usage.
MLOps Part 16: A comprehensive overview of drift detection using statistical techniques, and how logging and observability keep ML systems healthy.
MLOps Part 15: Understanding the EKS lifecycle, getting hands-on with AWS setup, and deploying a simple ML inference service on Amazon EKS.
MLOps Part 14: Understanding AWS cloud platform, and zooming into EKS.