AI-II: A Five Step Program to Get your Prompts into Shape

Forward

To further AI understanding and adoption in the insurance industry, Lazarus is producing a series of articles titled Artificial Intelligence – Insights for Insurance, or “AI-II.” In an earlier article (“The Importance and Evolving Role of Prompt Engineers”), we outlined how leaders should think about prompting as a discipline. This article builds on that foundation, focusing on enterprise-level prompting and translating experience into a structured, repeatable approach. The guidance here draws directly from the day-to-day work of Lazarus prompting expert Kelly Daniel.

‍

Introduction

This article distills expert findings from practical prompting experience into a five-step program. Some elements will feel familiar to those experienced in technology implementation (e.g. testing, iteration, and controlled scaling). Others are specific to prompting, where small changes in structure or language can materially impact outcomes.

‍

Foundational Principles

Let’s start with three principles that should guide all enterprise prompting efforts.

First Principle: Don’t get blinded by the light.

Like all technology implementations, prompts must be tested and proven. The speed of generic language models can be misleading, resulting in the false assumption that enterprise-scale prompting requires little effort. In reality, generating responses quickly is trivial. Generating accurate, reliable responses at scale is not. Enterprise-grade prompting requires validation, testing, and a clear understanding of limitations. It’s crucial to make that effort.

Second Principle: Don’t jump into the deep end.

Start slowly and build incrementally. If working with a large document corpus, begin with a small, well-understood subset and select use cases with clear answers. Likewise, avoid edge cases early. Attempting to solve for complexity too soon often leads to wasted effort and time.

A key corollary of this principle: not every problem should be prompted. If an edge case can be solved cost-effectively and more reliably with existing approaches, introducing AI may add unnecessary complexity rather than value. Prompting should be applied intentionally, not indiscriminately.

Third Principle: Approach prompting as structured exploration.

Prompting is both an art and a science. Incorrect outputs are not failures but signals. In fact, understanding why a prompt fails often provides more insight than when it succeeds. This requires a mindset of exploration, where prompts are iteratively refined based on captured data.

A useful framing is to treat prompting akin to having a highly capable but literal assistant. Clear, explicit direction is required for them to explore and produce consistent results.

‍

Five Step Process to Get your Prompts into Shape

Step 1: Test on One Document

Lazarus AI’s best practice is to begin with a single, representative document.

The objective is to establish baseline performance, understand limitations, and identify areas for refinement before introducing variability. Going slower and focusing efforts up-front will save time and money in the long run. Rest assured, selecting the right place to start and judging the difficulty of use cases will get easier with experience.

Once there is success with the first document and a clear understanding of challenges, the scope can be expanded. Success at this stage should be defined pragmatically. If human interpretation of a document is inconsistent, expecting perfect model performance is unrealistic and will keep organizations stuck at Step 1.

At the same time, this step requires deliberate effort. Prompting is not instantaneous. If meaningful progress is not made within a reasonable timeframe (e.g. over a full day), it may be necessary to reassess the initial use case.

Step 2: Expand Testing Slightly (3 to 5 Documents)

Once initial success is achieved, expand testing to a small set of documents. Lazarus’ experience shows that this step can take much longer and often reveals overfitting: where prompts perform well on a single document but fail to generalize. It may be possible that an edge case has been inadvertently used or prompts are too narrowly structured.

Running your prompt on a handful of different documents will help you make adjustments early before scaling further.

Step 3: Move to Small Batches

After successfully completing Step 2, begin scaling with small batches.

At this stage, focus shifts from individual outcomes to aggregate performance. Improvements should be evaluated across the dataset, not based on isolated success.

Prompt refinement becomes more sensitive. Minor changes in wording, structure, or sequencing can materially impact results. It’s important to note that certain changes or prompts may improve certain outputs while degrading others. For instance, a prompt that fixes the data extraction for one document may return inaccurate results for three others. While not a welcome change, this is expected.

Tracking aggregate prompt performance becomes critical at this step. Here, worry less about any one document unless it represents a central and uncompromisable use case. A structured approach—such as maintaining a dataset of expected outputs to help automate scoring results—enables more efficient iteration as scale increases.

Step 4: Interrogate Inaccurate Results

When trying to improve a prompt, focus on incorrect outputs.

Rather than immediately rewriting prompts, investigate how the model is interpreting the input. Asking direct questions (such as whether a specific phrase is present in the document) can reveal gaps in the model’s understanding or context.

Further, allowing the model to provide more verbose explanations can surface useful signals. These insights can then inform prompt restructuring, including rewording, reordering, or clarifying instructions.

Step 5: Go Big or Go Home

At this point, prompts are ready for full-scale validation or for reassessment. If outputs are consistent, limitations are understood, and performance meets expectations, expand to full deployment and validation across the dataset.

The time and effort in this stage will vary greatly. Factors include: the number of prompts and fields, availability of known correct answers, risk associated with the supported business processes, and the level of variability in document structure and data quality.

More structured datasets will validate more easily. Heterogeneous or unstructured data will require broader testing on larger document sets.

If results are not meeting expectations, a restart may be required. In most cases, this indicates that one or more foundational principles were not fully applied. However, restarts are not wasted effort. Each iteration improves prompt quality and accelerates subsequent cycles.

If a restart is needed, progress should be evaluated directionally. If accuracy improves meaningfully across iterations, continued refinement is justified. If results are inconsistent or deteriorating, external expertise may be required.

‍

Summary

This article presents a five-step approach to achieve successful prompting. As prompting continues to evolve, organizations that approach it systematically will reduce risk, cost, and time to value.

‍

About Lazarus AI

Lazarus AI develops enterprise-grade AI systems for the insurance industry, public sector, and beyond. Our Applied Intelligence Engine (AIE) enables organizations to eliminate their processing bottlenecks and provides rapid time to value, allowing our customers to compete more effectively with reduced cost, lower risk, and greater speed.

‍