Escaping POC Purgatory: Evaluation-Driven Development for AI Systems

Let’s be real: building LLM applications today feels like purgatory. Someone hacks together a quick demo with ChatGPT and LlamaIndex. Leadership gets excited. “We can answer any question about our docs!” But then… reality hits. The system is inconsistent, slow, hallucinating—and that amazing demo starts collecting digital dust. We call this “POC Purgatory”—that frustrating limbo […]

Mar 25, 2025 - 16:45

Escaping POC Purgatory: Evaluation-Driven Development for AI Systems

Let’s be real: building LLM applications today feels like purgatory. Someone hacks together a quick demo with ChatGPT and LlamaIndex. Leadership gets excited. “We can answer any question about our docs!” But then… reality hits. The system is inconsistent, slow, hallucinating—and that amazing demo starts collecting digital dust. We call this “POC Purgatory”—that frustrating limbo where you’ve built something cool but can’t quite turn it into something real.

We’ve seen this across dozens of companies, and the teams that break out of this trap all adopt some version of Evaluation-Driven Development (EDD), where testing, monitoring, and evaluation drive every decision from the start.

The truth is, we’re in the earliest days of understanding how to build robust LLM applications. Most teams approach this like traditional software development but quickly discover it’s a fundamentally different beast. Check out the graph below—see how excitement for traditional software builds steadily while GenAI starts with a flashy demo and then hits a wall of challenges?

Traditional versus GenAI software: Excitement builds steadily—or crashes after the demo.

What makes LLM applications so different? Two big things:

They bring the messiness of the real world into your system through unstructured data.
They’re fundamentally non-deterministic—we call it the “flip-floppy” nature of LLMs: same input, different outputs. They’re fundamentally nondeterministic—we call it the “flip-floppy” nature of LLMs: Same input, different outputs. What’s worse: Inputs are rarely exactly the same. Tiny changes in user queries, phrasing, or surrounding context can lead to wildly different results.

This creates a whole new set of challenges that traditional software development approaches simply weren’t designed to handle. When your system is both ingesting messy real-world data AND producing nondeterministic outputs, you need a different approach.

The way out? Evaluation-driven development: A systematic approach where continuous testing and assessment guide every stage of your LLM application’s lifecycle. This isn’t anything new. People have been building data products and machine learning products for the past couple of decades. The best practices in those fields have always centered around rigorous evaluation cycles. We’re simply adapting and extending these proven approaches to address the unique challenges of LLMs.

We’ve been working with dozens of companies building LLM applications, and we’ve noticed patterns in what works and what doesn’t. In this article, we’re going to share an emerging SDLC for LLM applications that can help you escape POC Purgatory. We won’t be prescribing specific tools or frameworks (those will change every few months anyway) but rather the enduring principles that can guide effective development regardless of which tech stack you choose.

Throughout this article, we’ll explore real-world examples of LLM application development and then consolidate what we’ve learned into a set of first principles—covering areas like nondeterminism, evaluation approaches, and iteration cycles—that can guide your work regardless of which models or frameworks you choose.

FOCUS ON PRINCIPLES, NOT FRAMEWORKS (OR AGENTS)

A lot of people ask us: What tools should I use? Which multiagent frameworks? Should I be using multiturn conversations or LLM-as-judge?

Of course, we have opinions on all of these, but we think those aren’t the most useful questions to ask right now. We’re betting that lots of tools, frameworks, and techniques will disappear or change, but there are certain principles in building LLM-powered applications that will remain.

We’re also betting that this will be a time of software development flourishing. With the advent of generative AI, there’ll be significant opportunities for product managers, designers, executives, and more traditional software engineers to contribute to and build AI-powered software. One of the great aspects of the AI Age is that more people will be able to build software.

We’ve been working with dozens of companies building LLM-powered applications and have started to see clear patterns in what works. We’ve taught this SDLC in a live course with engineers from companies like Netflix, Meta, and the US Air Force—and recently distilled it into a free 10-email course to help teams apply it in practice.

IS AI-POWERED SOFTWARE ACTUALLY THAT DIFFERENT FROM TRADITIONAL SOFTWARE?

When building AI-powered software, the first question is: Should my software development lifecycle be any different from a more traditional SDLC, where we build, test, and then deploy?

Traditional software development: Linear, testable, predictable

AI-powered applications introduce more complexity than traditional software in several ways:

Introducing the entropy of the real world into the system through data.
The introduction of nondeterminism or stochasticity into the system: The most obvious symptom here is what we call the flip-floppy nature of LLMs—that is, you can give an LLM the same input and get two different results.
The cost of iteration—in compute, staff time, and ambiguity around product readiness.
The coordination tax: LLM outputs are often evaluated by nontechnical stakeholders (legal, brand, support) not just for functionality, but for tone, appropriateness, and risk. This makes review cycles messier and more subjective than in traditional software or ML.

What breaks your app in production isn’t always what you tested for in dev!

This inherent unpredictability is precisely why evaluation-driven development becomes essential: Rather than an afterthought, evaluation becomes the driving force behind every iteration.

Evaluation is the engine, not the afterthought.

The first property is something we saw with data and ML-powered software. What this meant was the emergence of a new stack for ML-powered app development, often referred to as MLOps. It also meant three things:

Software was now exposed to a potentially large amount of messy real-world data.
ML apps needed to be developed through cycles of experimentation (as we’re no longer able to reason about how they’ll behave based on software specs).
The skillset and the background of people building the applications were realigned: People who were at home with data and experimentation got involved!

Now with LLMs, AI, and their inherent flip-floppiness, an array of new issues arises:

Nondeterminism: How can we build reliable and consistent software using models that are nondeterministic and unpredictable?
Hallucinations and forgetting: How can we build reliable and consistent software using models that both forget and hallucinate?
Evaluation: How do we evaluate such systems, especially when outputs are qualitative, subjective, or hard to benchmark?
Iteration: We know we need to experiment with and iterate on these system. How do we do so?
Business value: Once we have a rubric for evaluating our systems, how do we tie our macro-level business value metrics to our micro-level LLM evaluations? This becomes especially difficult when outputs are qualitative, subjective, or context-sensitive—a challenge we saw in MLOps, but one that’s even more pronounced in GenAI systems.

Beyond the technical challenges, these complexities also have real business implications. Hallucinations and inconsistent outputs aren’t just engineering problems—they can erode customer trust, increase support costs, and lead to compliance risks in regulated industries. That’s why integrating evaluation and iteration into the SDLC isn’t just good practice, it’s essential for delivering reliable, high-value AI products.

A TYPICAL JOURNEY IN BUILDING AI-POWERED SOFTWARE

In this section, we’ll walk through a real-world example of an LLM-powered application struggling to move beyond the proof-of-concept stage. Along the way, we’ll explore:

Why defining clear user scenarios and understanding how LLM outputs will be used in the product prevents wasted effort and misalignment.
How synthetic data can accelerate iteration before real users interact with the system.
Why early observability (logging and monitoring) is crucial for diagnosing issues.
How structured evaluation methods move teams beyond intuition-driven improvements.
How error analysis and iteration refine both LLM performance and system design.

By the end, you’ll see how this team escaped POC purgatory—not by chasing the perfect model, but by adopting a structured development cycle that turned a promising demo into a real product.

You’re not launching a product: You’re launching a hypothesis.

At its core, this case study demonstrates evaluation-driven development in action. Instead of treating evaluation as a final step, we use it to guide every decision from the start—whether choosing tools, iterating on prompts, or refining system behavior. This mindset shift is critical to escaping POC purgatory and building reliable LLM applications.

POC PURGATORY

Every LLM project starts with excitement. The real challenge is making it useful at scale.

The story doesn’t always start with a business goal. Recently, we helped an EdTech startup build an information-retrieval app.¹ Someone realized they had tons of content a student could query. They hacked together a prototype in ~100 lines of Python using OpenAI and LlamaIndex. Then they slapped on tool used to search the web, saw low retrieval scores, called it an “agent,” and called it a day. Just like that, they landed in POC purgatory—stuck between a flashy demo and working software.

They tried various prompts and models and, based on vibes, decided some were better than others. They also realized that, although LlamaIndex was cool to get this POC out the door, they couldn’t easily figure out what prompt it was throwing to the LLM, what embedding model was being used, the chunking strategy, and so on. So they let go of LlamaIndex for the time being and started using vanilla Python and basic LLM calls. They used some local embeddings and played around with different chunking strategies. Some seemed better than others.

EVALUATING YOUR MODEL WITH VIBES, SCENARIOS, AND PERSONAS

Before you can evaluate an LLM system, you need to define who it’s for and what success looks like.

They then decided to try to formalize some of these “vibe checks” into an evaluation framework (commonly called a “harness”), which they can use to test different versions of the system. But wait: What do they even want the system to do? Who do they want to use it? Eventually, they want to roll it out to students, but perhaps a first goal would be to roll it out internally.

Vibes are a fine starting point—just don’t stop there.

We asked them:

Who are you building it for?
In what scenarios do you see them using the application?
How will you measure success?

The answers were:

Our students.
Any scenario in which a student is looking for information that the corpus of documents can answer.
If the student finds the interaction helpful.

The first answer came easily, the second was a bit more challenging, and the team didn’t even seem confident with their third answer. What counts as success depends on who you ask.

We suggested:

Keeping the goal of building it for students but orient first around whether internal staff find it useful before rolling it out to students.
Restricting the first goals of the product to something actually testable, such as giving helpful answers to FAQs about course content, course timelines, and instructors.
Keeping the goal of finding the interaction helpful but recognizing that this contains a lot of other concerns, such as clarity, concision, tone, and correctness.

So now we have a user persona, several scenarios, and a way to measure success.

SYNTHETIC DATA FOR YOUR LLM FLYWHEEL

Why wait for real users to generate data when you can bootstrap testing with synthetic queries?

With traditional, or even ML, software, you’d then usually try to get some people to use your product. But we can also use synthetic data—starting with a few manually written queries, then using LLMs to generate more based on user personas—to simulate early usage and bootstrap evaluation.

So we did that. We made them generate ~50 queries. To do this, we needed logging, which they already had, and we needed visibility into the traces (prompt + response). There were nontechnical SMEs we wanted in the loop.

Also, we’re now trying to develop our eval harness so we need “some form of ground truth,” that is, examples of user queries + helpful responses.

This systematic generation of test cases is a hallmark of evaluation-driven development: Creating the feedback mechanisms that drive improvement before real users encounter your system.

Evaluation isn’t a stage, it’s the steering wheel.

LOOKING AT YOUR DATA, ERROR ANALYSIS, AND RAPID ITERATION

Logging and iteration aren’t just debugging tools, they’re the heart of building reliable LLM apps. You can’t fix what you can’t see.

To build trust with our system, we needed to confirm at least some of the responses with our own eyes. So we pulled them up in a spreadsheet and got our SMEs to label responses as “helpful or not” and to also give reasons.

Then we iterated on the prompt and noticed that it did well with course content but not as well with course timelines. Even this basic error analysis allowed us to decide what to prioritize next.

When playing around with the system, I tried a query that many people ask LLMs with IR but few engineers think to handle: “What docs do you have access to?” RAG performs horribly with this most of the time. An easy fix for this involved engineering the system prompt.

Essentially, what we did here was:

Build
Deploy (to only a handful of internal stakeholders)
Log, monitor, and observe
Evaluate and error analysis
Iterate

Now it didn’t involve rolling out to external users; it didn’t involve frameworks; it didn’t even involve a robust eval harness yet, and the system changes involved only prompt engineering. It involved a lot of looking at your data!² We only knew how to change the prompts for the biggest effects by performing our error analysis.

What we see here, though, is the emergence of the first iterations of the LLM SDLC: We’re not yet changing our embeddings, fine-tuning, or business logic; we’re not using unit tests, CI/CD, or even a serious evaluation framework, but we’re building, deploying, monitoring, evaluating, and iterating!

In AI systems, evaluation and monitoring don’t come last—they drive the build process from day one

FIRST EVAL HARNESS

Evaluation must move beyond ‘vibes’: A structured, reproducible harness lets you compare changes reliably.

In order to build our first eval harness, we needed some ground truth, that is, a user query and an acceptable response with sources.

To do this, we either needed SMEs to generate acceptable responses + sources from user queries or have our AI system generate them and an SME to accept or reject them. We chose the latter.

So we generated 100 user interactions and used the accepted ones as our test set for our evaluation harness. We tested both retrieval quality (e.g., how well the system fetched relevant documents, measured with metrics like precision and recall), semantic similarity of response, cost, and latency, in addition to performing heuristics checks, such as length constraints, hedging versus overconfidence, and hallucination detection.

We then used thresholding of the above to either accept or reject a response. However, looking at why a response was rejected helped us iterate quickly:

read more