At True Sparrow, we enjoy delving into the user persona, understanding their challenges, and brainstorming solutions to address them. Once a consensus is reached, we often provide low-fidelity demos to clients to gather feedback and assess whether we are on the right track. We continue this process until we maximize the value delivered to the target persona without introducing unnecessary complications to the experience.
In a project we did last year for Airstack, we implemented an AI assistant, enabling users to interact with GraphQL APIs using natural language. To evaluate the performance of a candidate prompt, we created over a hundred test cases, consisting of pairs of test inputs and expected outputs. As developers, we utilized the OpenAI Evals framework, which we extended for GraphQL support, to execute the test cases and assess the impact of prompt modifications.
We wondered though, why should evaluating prompts require technical expertise? Non-technical individuals, particularly product managers, can contribute significantly to prompt modifications compared to technically skilled individuals lacking domain knowledge. However, product managers are not always comfortable with executing commands from the OpenAI Evals framework and a more visual approach might be better suited.
In the following sections, I will walk you through our journey of solving the prompt evaluation problem for a product manager persona (or more generally a non technical persona). Additionally, we will also discuss the various design patterns we used in the solution.
Evolving to the Solution
One thing was clear: We needed almost everything to be visual. Use of commands, etc was not the way forward. To provide a visualization of the testing process, we had to break it into logical steps.
We needed to experiment with different prompt versions and evaluate their performance. We called the whole process of trying out a prompt on different test cases as an experiment.
The prompt should have variables which will take different values in different test cases. Since there are variables inside of the prompt, we called it the prompt template.
A test case is a set of variable definitions (to be used for replacing in prompt template) and expected output.
To run an experiment, we ask the user about which OpenAI model to use and what evaluation strategy to use, which will actually choose which OpenAI eval to use. Here's a screenshot of the dialog box that opens when running an experiment for a prompt template.
Following flow-chart explains various steps involved in an experiment run.
Applied Design Patterns
Design patterns facilitate the reuse of experience in software development, enhancing efficiency and promoting best practices for creating scalable and maintainable solutions in various contexts. We applied two design patterns in developing the Prompt Evaluator.
1. Strategy
The Strategy design pattern enables runtime selection of algorithms. In our case, we allowed users to choose evaluation strategies and OpenAI models in an experiment run.
2. Chain of Responsibility
The Chain of Responsibility design pattern was applied by assigning specific tasks to each component of the process: generating prompts, inferring from the model, evaluating responses, and logging results. This ensured a clear division of responsibilities throughout the experiment run.
Conclusion
We recognize and value the diverse strengths each team member brings to the table. We utilized the product manager's domain knowledge and compensated for their lack of technical expertise by offering a visual tool. This enabled them to actively contribute to the prompt design phase by experimenting with different variations effectively.
Our code for Prompt Evaluator is open source! Visit our GitHub repositories for the frontend and backend. Test it out and report any issues. Developers, feel free to contribute by submitting pull requests (PRs).
 
         
 
             
         
             
            