A Free Mini Crash Course on AI Agents!

TODAY’S DAILY DOSE OF DATA SCIENCE

A Free Mini Crash Course on AI Agents!

We just released a free mini crash course on building AI Agents, which is a good starting point for anyone to learn about Agents and use them in real-world projects.

0:00

/33:08

Here's what it covers:

What is an AI Agent
Connecting Agents to tools
Overview of MCP
Replacing tools with MCP servers
Setting up observability and tracing

Everything is done with a 100% open-source tool stack:

CrewAI for building MCP-ready Agents.
Zep Graphiti to add memory to Agents.
Opik for observability and tracing.

It builds Agents based on the following definition:

An AI agent uses an LLM as its brain, has memory to retain context, and can take real-world actions through tools, like browsing the web, running code, etc.

In short, it thinks, remembers, and acts.

Here's an overview of the system we're building!

User sends a query
Assistant runs a web search via MCP
Query + results go to Memory Manager
Memory Manager stores context in Graphiti
Response agent crafts the final answer

The video above explains everything in detail.

You can find the entire code in this GitHub repo →

Let's learn more about LLM app evaluation below.

LLM evaluation

Create eval metrics for LLM Apps in pure English

Standard metrics are usually not that helpful since LLMs can produce varying outputs while conveying the same message.

In fact, in many cases, it is also difficult to formalize an evaluation metric as a deterministic code.

G-Eval is a task-agnostic LLM as a Judge metric in Opik that solves this.

The concept of LLM-as-a-judge involves using LLMs to evaluate and score various tasks and applications.

It allows you to specify a set of criteria for your metric (in English), after which it will use a Chain of Thought prompting technique to create evaluation steps and return a score.

Let’s look at a demo below.

First, import the GEval class and define a metric in natural language:

Done!

Next, invoke the score method to generate a score and a reason for that score. Below, we have a related context and output, which leads to a high score:

However, with unrelated context and output, we get a low score as expected:

Under the hood, G-Eval first uses the task introduction and evaluation criteria to outline an evaluation step.

Next, these evaluation steps are combined with the task to return a single score.

That said, you can easily self-host Opik, so your data stays where you want.

It integrates with nearly all popular frameworks, including CrewAI, LlamaIndex, LangChain, and HayStack.

If you want to dive further, we also published a practical guide on Opik to help you integrate evaluation and observability into your LLM apps (with implementation).

It has open access to all readers.

Start here: A Practical Guide to Integrate Evaluation and Observability into LLM Apps.

MCP

An MCP server to control Jupyter Notebook in real-time

0:00

/0:23

Jupyter MCP Server lets you interact with Jupyter notebooks.

As shown in the video below, it lets you:

Create code cells
Execute code cells
Create markdown cells

The server currently offers 2 tools:

add_execute_code_cell: Add and execute a code cell in a Jupyter notebook.
add_markdown_cell: Add a markdown cell in a Jupyter notebook.

You can use this for several complex use cases where you can analyze full datasets by just telling Claude Desktop the file path.

It will use the MCP server to control the Jupyter Notebook and analyze the dataset.

One really good thing about this is that during this interaction, the cells are executed immediately.

Thus, Claude knows if a cell ran successfully. If not, it can self-adjust.

GitHub repo →

No-Fluff Industry ML resources to

Succeed in DS/ML roles

At the end of the day, all businesses care about impact. That’s it!

Can you reduce costs?
Drive revenue?
Can you scale ML models?
Predict trends before they happen?

We have discussed several other topics (with implementations) in the past that align with such topics.

Develop Industry ML skills

Here are some of them:

Learn sophisticated graph architectures and how to train them on graph data in this crash course.
So many real-world NLP systems rely on pairwise context scoring. Learn scalable approaches here.
Run large models on small devices using Quantization techniques.
Learn how to generate prediction intervals or sets with strong statistical guarantees for increasing trust using Conformal Predictions.
Learn how to identify causal relationships and answer business questions using causal inference in this crash course.
Learn how to scale and implement ML model training in this practical guide.
Learn 5 techniques with implementation to reliably test ML models in production.
Learn how to build and implement privacy-first ML systems using Federated Learning.
Learn 6 techniques with implementation to compress ML models.

All these resources will help you cultivate key skills that businesses and companies care about the most.

Advertise to 600k+ data professionals

Our newsletter puts your products and services directly in front of an audience that matters — thousands of leaders, senior data scientists, machine learning engineers, data analysts, etc., around the world.

Get in touch today →

The Full MLOps Blueprint: Model Deployment—Part E

The Full MLOps Blueprint: Model Deployment—Part D

The Full MLOps Blueprint: Model Deployment—Part C

A Free Mini Crash Course on AI Agents!

TODAY’S DAILY DOSE OF DATA SCIENCE

A Free Mini Crash Course on AI Agents!

LLM evaluation

​Create eval metrics for LLM Apps in pure English​​

MCP

​An MCP server to control Jupyter Notebook in real-time

No-Fluff Industry ML resources to

Succeed in DS/ML roles

SPONSOR US

Advertise to 600k+ data professionals

Read next

The Full MLOps Blueprint: Model Deployment—Part E

The Full MLOps Blueprint: Model Deployment—Part D

The Full MLOps Blueprint: Model Deployment—Part C

Join the Daily Dose of Data Science Today!

Create eval metrics for LLM Apps in pure English

An MCP server to control Jupyter Notebook in real-time