Enhancing Performance of an AI Assistant: A Journey Towards Efficiency and Accuracy

Artificial Intelligence (AI) has witnessed a surge in popularity across websites as businesses and individuals recognize its immense potential. One of the prominent applications of AI in websites is semantic search, which revolutionizes the way users interact with online platforms.

At True Sparrow, a product boutique studio, we implemented AI Assistant for one of our clients by leveraging the power of the GPT-4 model. Although we initially achieved good accuracy, we encountered high costs and latency. In this blog post, we delve deeper into our efforts to optimize our approach for cost, speed, and accuracy.

Proof of Concept

To find the best solution, we conducted a small proof of concept (POC) on the Etherscan platform. Our objective was to direct the user to the correct information based on their natural language search queries. We provided the GPT-4 model with System and User prompts, including descriptions of various Etherscan explorer URIs.

We included descriptions of the contents of URIs in the User Prompt, as demonstrated in the example below.

System Prompt

You are an AI assistant of Etherscan blockchain explorer and an expert at selecting the best-suited Etherscan URI that can answer the user's question related to the Ethereum network.

User Prompt

Here are the various Etherscan explorer URIs that you can use to answer my question related to the Etherscan Ethereum blockchain network. Each URI has a description. Sometimes there are one or more sample input questions for which the given URI is the best-suited one. 

Please pay attention to these details.

URI: https://etherscan.io/tokentxns
Description: The page displays a list of token transfers, token mints, or token burns on Ethereum chains. The list provides information about tokens, transaction hash, from & to addresses, method, age, and value. This list is arranged in ascending order of age.

Here's my question: {{user_question}}

Considering my question, choose the best-suited URI from the ones listed above. Ensure that the URI is built using appropriate URI parameters from the user question and you provide the updated URI with updated variables.

Only make use of the information about the URIs listed above, their descriptions, and sample input questions. Do not use any other information. Do not create your own URIs.

Your response should start with "https://" string and contain only a URI string. If you don't know the answer, please respond "I don't know".

Please note that while the user prompt mentioned above includes a description for a single URI (https://etherscan.io/tokentxns), in our proof of concept (POC), we have actually prepared a similar explanatory description for 42 URIs listed on the Etherscan platform and included them in the prompt.

However, as our user prompts grew larger, we faced challenges such as increased costs and higher latency.Exploring Alternative Approaches

Exploring Alternative Approaches

In our pursuit of optimization, we tested two alternatives — the GPT-3.5-turbo-16k model and Vector Embedding.

GPT-3.5-turbo-16k Model: Cost-effective and Efficient

The GPT-3.5-turbo-16k model offered nearly one-tenth of the cost compared to GPT-4, while still having similar performance and reduced latency. It could handle a maximum of 16,384 tokens, making it a cost-effective and efficient choice. We used a similar System and User prompts as with GPT-4 for our experiments.

Vector Embeddings: Speed and Scalability

Vector Embedding is a technique that represents words or phrases as numerical vectors in a high-dimensional space. It allows for efficient computation and comparison of similarities between different words or phrases. To create vector embeddings for each URI description, we utilized OpenAI's text-embedding-ada-002 model. According to OpenAI's documentation, the text-embedding-ada-002 model offers improved performance, cost-effectiveness, and simplicity in generating embeddings.

To store these embeddings along with their corresponding URIs in the metadata, we opted for the Pinecone vector database. In our previous client project (Jam), we also had experience working with the Weaviate Vector database. Both of these databases leverage high-dimensional vectors to facilitate efficient mapping and retrieval of relevant information. They provide fast and scalable search functionality, making them ideal solutions for our needs.

To compare the performance of GPT-4, GPT-3.5-turbo-16k, and Vector Embeddings for our POC, we conducted experiments that are listed below.

First Experiment

To ensure a comprehensive and unbiased assessment of GPT-4, GPT-3.5-turbo-16k, and Vector Embeddings, we adopted a systematic approach.

Our evaluation process involved using identical prompts (as mentioned in the previous section) and a diverse set of search queries comprising real-world examples as well as grammatically correct and incorrect queries.

We employed our home-grown Prompt-Evaluation tool to test the performance of GPT-4 and GPT-3.5-turbo-16k models. For Vector Embeddings, we retrieved the best match (k=1) from Pinecone and compared it against the desired result using a script.

All models were subjected to the same set of test cases, and we obtained the following accuracy percentages:

GPT-4: 89%
GPT-3.5-Turbo-16k: 84%
Vector Embeddings: 59%

It is worth noting that Vector Embeddings exhibited a relatively lower accuracy compared to the GPT models in our evaluation.

Second Experiment

To improve Vector Embeddings’ accuracy, we modified our approach in the second experiment. Instead of fetching a single match, we retrieved the top 5 best matches from the Pinecone Database for each query. This boosted our accuracy to an impressive 96%. We discovered that the correct result often lay within the 5 nearest matches in the vector database.

Experiment Takeaway and our Hybrid Solution

Through rigorous experimentation and analysis, we identified the strengths and weaknesses of each approach. We aimed to strike a balance between the cost-effectiveness of the GPT-3.5-turbo-16k model and the speed, scalability, and cost-effectiveness of the vector database. Consequently, we devised a hybrid solution that leveraged the strengths of both approaches.

Our hybrid approach involved retrieving the top 5 matches from the Pinecone database. These matches served as the basis for creating User and System Prompts, similar to the ones discussed in the previous sections. We then utilized these prompts with the GPT-3.5-turbo-16k model to generate the final result.

By reducing the token size sent to the GPT models (by using only 5 URI descriptions), we successfully achieved a significant cost reduction. Additionally, the smaller size of the prompt contributed to an improvement in accuracy for the final results obtained from the GPT-3.5-turbo-16k model. With only 5 URIs to process instead of 42, the model could provide more precise outcomes.

Observation: Comparative Performance Analysis

Conclusion

Through the strategic combination of the GPT-3.5-turbo-16k model and vector embeddings, we have successfully optimized our AI Assistant system. This hybrid approach has not only reduced costs but also improved the accuracy of search results, providing a highly efficient and fast experience. With impressive accuracy of 87.5%, significant cost savings, and fast response times, our solution offers a significant enhancement to our AI Assistant.

Enhancing Performance of an AI Assistant: A Journey Towards Efficiency and Accuracy