Agentic AI 102: Guardrails and Agent Evaluation

An introduction to tools that make your model safer and more predictable and performant. The post Agentic AI 102: Guardrails and Agent Evaluation appeared first on Towards Data Science.

May 16, 2025 - 23:34
 0
Agentic AI 102: Guardrails and Agent Evaluation

Introduction

In the first post of this series (Agentic AI 101: Starting Your Journey Building AI Agents), we talked about the fundamentals of creating AI Agents and introduced concepts like reasoning, memory, and tools.

Of course, that first post touched only the surface of this new area of the data industry. There is so much more that can be done, and we are going to learn more along the way in this series.

So, it is time to take one step further.

In this post, we will cover three topics:

  1. Guardrails: these are safe blocks that prevent a Large Language Model (LLM) from responding about some topics.
  2. Agent Evaluation: Have you ever thought about how accurate the responses from LLM are? I bet you did. So we will see the main ways to measure that.
  3. Monitoring: We will also learn about the built-in monitoring app in Agno’s framework.

We shall begin now.

Guardrails

Our first topic is the simplest, in my opinion. Guardrails are rules that will keep an AI agent from responding to a given topic or list of topics.

I believe there is a good chance that you have ever asked something to ChatGPT or Gemini and received a response like “I can’t talk about this topic”, or “Please consult a professional specialist”, something like that. Usually, that occurs with sensitive topics like health advice, psychological conditions, or financial advice.

Those blocks are safeguards to prevent people from hurting themselves, harming their health, or their pockets. As we know, LLMs are trained on massive amounts of text, ergo inheriting a lot of bad content with it, which could easily lead to bad advice in those areas for people. And I didn’t even mention hallucinations!

Think about how many stories there are of people who lost money by following investment tips from online forums. Or how many people took the wrong medicine because they read about it on the internet.

Well, I guess you got the point. We must prevent our agents from talking about certain topics or taking certain actions. For that, we will use guardrails.

The best framework I found to impose those blocks is Guardrails AI [1]. There, you will see a hub full of predefined rules that a response must follow in order to pass and be displayed to the user.

To get started quickly, first go to this link [2] and get an API key. Then, install the package. Next, type the guardrails setup command. It will ask you a couple of questions that you can respond n (for No), and it will ask you to enter the API Key generated.

pip install guardrails-ai
guardrails configure

Once that is completed, go to the Guardrails AI Hub [3] and choose one that you need. Every guardrail has instructions on how to implement it. Basically, you install it via the command line and then use it like a module in Python.

For this example, we’re choosing one called Restrict to Topic [4], which, as its name says, lets the user talk only about what’s in the list. So, go back to the terminal and install it using the code below.

guardrails hub install hub://tryolabs/restricttotopic

Next, let’s open our Python script and import some modules.

# Imports
from agno.agent import Agent
from agno.models.google import Gemini
import os

# Import Guard and Validator
from guardrails import Guard
from guardrails.hub import RestrictToTopic

Next, we create the guard. We will restrict our agent to talk only about sports or the weather. And we are restricting it to talk about stocks.

# Setup Guard
guard = Guard().use(
    RestrictToTopic(
        valid_topics=["sports", "weather"],
        invalid_topics=["stocks"],
        disable_classifier=True,
        disable_llm=False,
        on_fail="filter"
    )
)

Now we can run the agent and the guard.

# Create agent
agent = Agent(
    model= Gemini(id="gemini-1.5-flash",
                  api_key = os.environ.get("GEMINI_API_KEY")),
    description= "An assistant agent",
    instructions= ["Be sucint. Reply in maximum two sentences"],
    markdown= True
    )

# Run the agent
response = agent.run("What's the ticker symbol for Apple?").content

# Run agent with validation
validation_step = guard.validate(response)

# Print validated response
if validation_step.validation_passed:
    print(response)
else:
    print("Validation Failed", validation_step.validation_summaries[0].failure_reason)

This is the response when we ask about a stock symbol.

Validation Failed Invalid topics found: ['stocks']

If I ask about a topic that is not on the valid_topics list, I will also see a block.

"What's the number one soda drink?"
Validation Failed No valid topic was found.

Finally, let’s ask about sports.

"Who is Michael Jordan?"
Michael Jordan is a former professional basketball player widely considered one of 
the greatest of all time.  He won six NBA championships with the Chicago Bulls.

And we saw a response this time, as it is a valid topic.

Let’s move on to the evaluation of agents now.

Agent Evaluation

Since I started studying LLMs and Agentic Ai, one of my main questions has been about model evaluation. Unlike traditional Data Science Modeling, where you have structured metrics that are adequate for each case, for AI Agents, this is more blurry.

Fortunately, the developer community is pretty quick in finding solutions for almost everything, and so they created this nice package for LLMs evaluation: deepeval.

DeepEval [5] is a library created by Confident AI that gathers many methods to evaluate LLMs and AI Agents. In this section, let’s learn a couple of the main methods, just so we can build some intuition on the subject, and also because the library is quite extensive.\

The first evaluation is the most basic we can use, and it is called G-Eval. As AI tools like ChatGPT become more common in everyday tasks, we have to make sure they’re giving helpful and accurate responses. That’s where G-Eval from the DeepEval Python package comes in.

G-Eval is like a smart reviewer that uses another AI model to evaluate how well a chatbot or AI assistant is performing. For example. My agent runs Gemini, and I am using OpenAI to assess it. This method takes a more advanced approach than a human one by asking an AI to “grade” another AI’s answers based on things like relevance, correctness, and clarity.

It’s a nice way to test and improve generative AI systems in a more scalable way. Let’s quickly code an example. We will import the modules, create a prompt, a simple chat agent, and ask it about a description of the weather for the month of May in NYC.

# Imports
from agno.agent import Agent
from agno.models.google import Gemini
import os
# Evaluation Modules
from deepeval.test_case import LLMTestCase, LLMTestCaseParams
from deepeval.metrics import GEval

# Prompt
prompt = "Describe the weather in NYC for May"

# Create agent
agent = Agent(
    model= Gemini(id="gemini-1.5-flash",
                  api_key = os.environ.get("GEMINI_API_KEY")),
    description= "An assistant agent",
    instructions= ["Be sucint"],
    markdown= True,
    monitoring= True
    )

# Run agent
response = agent.run(prompt)

# Print response
print(response.content)

It responds: “Mild, with average highs in the 60s°F and lows in the 50s°F. Expect some rain“.

Nice. Seems pretty good to me.

But how can we put a number on it and show a potential manager or client how our agent is doing?

Here is how:

  1. Create a test case passing the prompt and the response to the LLMTestCase class.
  2. Create a metric. We will use the method GEval and add a prompt for the model to test it for coherence, and then I give it the meaning of what coherence is to me.
  3. Give the output as evaluation_params.
  4. Run the measure method and get the score and reason from it.
# Test Case
test_case = LLMTestCase(input=prompt, actual_output=response)

# Setup the Metric
coherence_metric = GEval(
    name="Coherence",
    criteria="Coherence. The agent can answer the prompt and the response makes sense.",
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT]
)

# Run the metric
coherence_metric.measure(test_case)
print(coherence_metric.score)
print(coherence_metric.reason)

The output looks like this.

0.9
The response directly addresses the prompt about NYC weather in May, 
maintains logical consistency, flows naturally, and uses clear language. 
However, it could be slightly more detailed.

0.9 seems pretty good, given that the default threshold is 0.5.

If you want to check the logs, use this next snippet.

# Check the logs
print(coherence_metric.verbose_logs)

Here’s the response.

Criteria:
Coherence. The agent can answer the prompt and the response makes sense.

Evaluation Steps:
[
    "Assess whether the response directly addresses the prompt; if it aligns,
 it scores higher on coherence.",
    "Evaluate the logical flow of the response; responses that present ideas
 in a clear, organized manner rank better in coherence.",
    "Consider the relevance of examples or evidence provided; responses that 
include pertinent information enhance their coherence.",
    "Check for clarity and consistency in terminology; responses that maintain
 clear language without contradictions achieve a higher coherence rating."
]

Very nice. Now let us learn about another interesting use case, which is the evaluation of task completion for AI Agents. Elaborating a little more, how our agent is doing when it is requested to perform a task, and how much of it the agent can deliver.

First, we are creating a simple agent that can access Wikipedia and summarize the topic of the query.

# Imports
from agno.agent import Agent
from agno.models.google import Gemini
from agno.tools.wikipedia import WikipediaTools
import os
from deepeval.test_case import LLMTestCase, ToolCall
from deepeval.metrics import TaskCompletionMetric
from deepeval import evaluate

# Prompt
prompt = "Search wikipedia for 'Time series analysis' and summarize the 3 main points"

# Create agent
agent = Agent(
    model= Gemini(id="gemini-2.0-flash",
                  api_key = os.environ.get("GEMINI_API_KEY")),
    description= "You are a researcher specialized in searching the wikipedia.",
    tools= [WikipediaTools()],
    show_tool_calls= True,
    markdown= True,
    read_tool_call_history= True
    )

# Run agent
response = agent.run(prompt)

# Print response
print(response.content)

The result looks very good. Let’s evaluate it using the TaskCompletionMetric class.

# Create a Metric
metric = TaskCompletionMetric(
    threshold=0.7,
    model="gpt-4o-mini",
    include_reason=True
)

# Test Case
test_case = LLMTestCase(
    input=prompt,
    actual_output=response.content,
    tools_called=[ToolCall(name="wikipedia")]
    )

# Evaluate
evaluate(test_cases=[test_case], metrics=[metric])

Output, including the agent’s response.

======================================================================

Metrics Summary

  - ✅ Task Completion (score: 1.0, threshold: 0.7, strict: False, 
evaluation model: gpt-4o-mini, 
reason: The system successfully searched for 'Time series analysis' 
on Wikipedia and provided a clear summary of the 3 main points, 
fully aligning with the user's goal., error: None)

For test case:

  - input: Search wikipedia for 'Time series analysis' and summarize the 3 main points
  - actual output: Here are the 3 main points about Time series analysis based on the
 Wikipedia search:

1.  **Definition:** A time series is a sequence of data points indexed in time order,
 often taken at successive, equally spaced points in time.
2.  **Applications:** Time series analysis is used in various fields like statistics,
 signal processing, econometrics, weather forecasting, and more, wherever temporal 
measurements are involved.
3.  **Purpose:** Time series analysis involves methods for extracting meaningful 
statistics and characteristics from time series data, and time series forecasting 
uses models to predict future values based on past observations.

  - expected output: None
  - context: None
  - retrieval context: None

======================================================================

Overall Metric Pass Rates

Task Completion: 100.00% pass rate

======================================================================

✓ Tests finished                         </div>
                                            <div class=
                            
                                Read More