Evaluation: Model Benchmarks and LLM Application Assessment
LLMOps Part 10: Understanding model benchmarks, LLM application evaluation, and tooling.
69 posts published
LLMOps Part 10: Understanding model benchmarks, LLM application evaluation, and tooling.
LLMOps Part 9: A foundational guide to the evaluation of LLM applications, covering challenges and a practical taxonomy of evaluation methods.
LLMOps Part 8: A concise overview of memory, dynamic and temporal context in LLM systems, covering short and long-term memory, dynamic context injection, and some of the common context failure modes in agentic applications.
LLMOps Part 7: A conceptual overview of context engineering, covering context types, context construction principles, and retrieval-centric techniques for building high-signal inputs.
LLMOps Part 6: Exploring prompt versioning, defensive prompting, and techniques such as verbalized sampling, role prompting and more.
LLMOps Part 5: An introduction to prompt engineering (a subset of context engineering), covering prompt types, the prompt development workflow, and key techniques in the field.
LLMOps Part 4: An exploration of key decoding strategies, sampling parameters, and the general lifecycle of LLM-based applications.
LLMOps Part 3: A focused look at the core ideas behind attention mechanism, transformer and mixture-of-experts architectures, and model pretraining and fine-tuning.
LLMOps Part 2: A detailed walkthrough of tokenization, embeddings, and positional representations, building the foundational translation layer that enables LLMs to process and reason over text.
AI Agents Crash Course—Part 17 (with implementation).
AI Agents Crash Course—Part 16 (with implementation).
LLMOps Part 1: An overview of AI engineering and LLMOps, and the core dimensions that define modern AI systems.
A comprehensive guide to Opik, an open-source LLM evaluation and observability framework.
MLOps Part 18: A hands-on guide to CI/CD in MLOps with DVC, Docker, GitHub Actions, and GitOps-based Kubernetes delivery on Amazon EKS.
MLOps Part 17: ML monitoring in practice with Evidently, Prometheus and Grafana, stitched into a FastAPI inference service with drift reports, metrics scraping, and dashboards.
AI Agents Crash Course—Part 15 (with implementation).
MLOps Part 16: A comprehensive overview of drift detection using statistical techniques, and how logging and observability keep ML systems healthy.
MLOps Part 15: Understanding the EKS lifecycle, getting hands-on with AWS setup, and deploying a simple ML inference service on Amazon EKS.
MLOps Part 14: Understanding AWS cloud platform, and zooming into EKS.
MLOps Part 13: An overview of cloud concepts that matter, from virtualization and storage choices to VPC, load balancing, identity, and observability.
MLOps Part 12: An introduction to Kubernetes, plus a practical walkthrough of deploying a simple FastAPI inference service using Kubernetes.
MLOps Part 11: A practical guide to taking models beyond notebooks, exploring serialization formats, containerization, and serving predictions using REST and gRPC.
MLOps Part 10: A comprehensive guide to model compression covering knowledge distillation, low-rank factorization, and quantization, followed by ONNX and ONNX Runtime as the bridge from training frameworks to fast, portable production inference.
MLOps Part 9: A deep dive into model fine-tuning and compression, specifically pruning and related improvements.
MLOps Part 8: A systems-first guide to model development and optimizing performance with disciplined hyperparameter tuning.
MLOps Part 7: An applied look at distributed data processing with Spark and workflow orchestration and scheduling with Prefect.
MLOps Part 6: A deep dive into sampling, class imbalance, and data leakage; plus a hands-on Feast feature store demo.
MLOps Part 5: A detailed walkthrough of data engineering for MLOps, covering data sources, format performance trade-offs, and ETL/ELT pipelines.
MLOps Part 4: A practical walkthrough of W&B-powered reproducibility.
MLOps Part 3: A practical exploration of reproducibility and versioning, covering deterministic training, data and model versioning, and experiment tracking.