True Sparrow Blog

What’s going on with you, Sirji?

Kedar Chandrayan — Mon, 15 Apr 2024 14:48:37 GMT

If you are new here, Sirji is an open-source AI software development agent. Internally it works as a combination of planner, coder, researcher, and executor agents. We are building Sirji as a VS Code extension.

We launched the Sirji VS Code extension on both the Visual Studio marketplace and the Open VSX repository last week.

The positive feedback so far: The installation is a breeze (under 10 seconds, seriously!), and the UI is quite intuitive. The extension seamlessly integrates with VS Code, making it super easy to use in the local development environment.

Of course, along with the positive feedback, there have been hiccups, problems and frustrations that our early users reported. Keep them coming, please. These are super valuable as we continue building Sirji.

And since we are building Sirji in public, best to share all our challenges publicly as well.

Here’s a compilation of all of Sirji’s challenges.

Leaving things incomplete

We've noticed several recurring issues where Sirji says it is done solving the problem but doesn’t fully implement the expected functionalities. The key problems include:

Frontend and backend integration: Often, Sirji skips the necessary integration between the frontend and backend components of the projects. This leaves the features non-functional even though Sirji reports completion.
Frontend implementation issues: In some scenarios, the frontend components of a feature are not implemented at all.
Linking between files is missing: Sirji frequently neglects to link files correctly. For example, while developing a React App, it usually misses linking files from the main App.js file.
Assumed changes: There have been instances where Sirji hallucinates and acts as though certain changes or additions have been made to the code without actually implementing them.

GitHub Issue: https://github.com/sirji-ai/sirji/issues/56

Unable to debug web errors

Currently, Sirji cannot detect when a user encounters errors with web pages. This limitation hinders Sirji's autonomous debugging abilities, making it reliant on users to report errors by copying and pasting them as manual feedback. Enhancing Sirji with error detection for web page interactions would greatly improve its self-diagnostic and troubleshooting capabilities.

GitHub Issue: https://github.com/sirji-ai/sirji/issues/57

Inability to recognize its own capability

Sirji often displays the message "Outside of my capability" for tasks that are actually within its capabilities, leading to unnecessary interruptions in workflow.

This issue is particularly notable in these scenarios:

Installation of databases (like MongoDB)
Verification of an installation

GitHub Issue: https://github.com/sirji-ai/sirji/issues/58

Invalid message format

Sirji sometimes receives messages in invalid format from LLM. We have an error-catching mechanism in which we again ask LLM to give a response as per the defined message protocol. In this case, even after 2 retries (again calling OpenAI completions API with the feedback that the last message was invalid), LLM doesn’t respond in a proper format. This primarily happens when LLM has to give some information to the user, to act on but this action is missing from the protocol.

GitHub Issue: https://github.com/sirji-ai/sirji/issues/59

Installing system-wide packages without confirming

At times, Sirji installs system-wide packages like Python without seeking user approval. A confirmation step would ensure users are aware and agreeable to changes being made to their system.

GitHub Issue: https://github.com/sirji-ai/sirji/issues/60

Server issues

We're encountering multiple issues with how Sirji handles server processes, particularly starting, restarting, and logging.

Below are the specific problems:

Server start-up: Occasionally, Sirji forgets to start the server as expected after the solution is complete.
Server restart issue: Sirji does not automatically restart the server after updating the code. Consequently, the modifications are not reflected.
Log file overwriting: The logs from running servers are not uniquely preserved. They risk being overwritten by subsequent processes.

GitHub Issue: https://github.com/sirji-ai/sirji/issues/61

Files are created outside of workspace

There is an issue where Sirji occasionally attempts to create files and folders outside the designated workspace. This behavior is concerning because it can interfere with the user's other files. A specific instance of this problem was observed when Sirji navigated to the home directory (~) and started creating files there (oops!).

GitHub Issue: https://github.com/sirji-ai/sirji/issues/62

Can’t use third-party APIs that need keys

Sirji gives up or implements outline code when the third-party APIs need keys. For Sirji to effectively interact with third-party APIs, it needs to ask users to input the necessary API keys. Users should be prompted to enter these keys into an environment variable file, enabling Sirji to securely access them from the generated code.

GitHub Issue: https://github.com/sirji-ai/sirji/issues/63

UI from the 90s

Unless specifically directed, Sirji creates webpages like they are from the 1990s.

GitHub Issue: https://github.com/sirji-ai/sirji/issues/64

Progress steps are buggy

The steps are not always updated to their respective status from ‘not running’ to 'running' to 'completed'. This inconsistency affects the overall task tracking.

GitHub Issue: https://github.com/sirji-ai/sirji/issues/65

Continuing before package installations complete

Sirji does not pause its process for the completion of long-running package installations. This leads to subsequent steps erroneously assuming the necessary packages are installed, causing failures.

GitHub Issue: https://github.com/sirji-ai/sirji/issues/66

…

Any other problems that you encountered?

Our next milestone is going to be about making Sirji reliable. Stay tuned!

Sirji Updates - VS Code Extension (launch)

Nishith Shah — Fri, 12 Apr 2024 13:27:00 GMT

After releasing the dogfood version, we got busy reimagining Sirji as a VS Code extension. This extension brings an interactive chat interface right within your Visual Studio Code IDE.

Here’s what the architecture looks like for Sirji's VS Code extension.

And today I present to you the first version of the extension. Install it in under 5 seconds from Visual Studio Marketplace or the Open VSX repository & check it out for yourself.

Quoting from Sirji’s README:

Sirji is a Visual Studio Code extension that works as an AI software development agent.

It is a virtual software developer that includes individual agents for planning, coding, researching, and executing projects.

Sirji solves users’ problem statements ranging from a new greenfield project to an existing brownfield project where it enhances existing code, fixes bugs, and writes test cases.

The extension leverages the capabilities of VS Code, including the Editor, Terminal, and Project Explorer.

It provides an interactive chat interface through which users submit their problem statements, answer questions, and provide feedback to Sirji.

Additionally and most importantly, Sirji sets up your local development environment by installing system-level packages as well as programming language-specific dependencies. It later executes the generated code in your local development environment.

Sirji Updates - Episode 3 (The first run)

Nishith Shah — Sat, 23 Mar 2024 07:01:33 GMT

So the end-to-end stitching for the dogfood release is done. And we took Sirji for a spin. The problem statement was:

Make a simple website with a server that uses Yahoo API to get the stock price for a user entered stock.

And… it threw an error! 😂

#sirji #phatgaya

Fixed the errors and THEN… Sirji had its FIRST successful run. Yay! 🚀

That’s HUGE even if it is just one run. We will be testing with a variety of different problem statements.

And I wouldn’t say we are done with dogfood release yet. The main functionality remaining for this release is adding debugging capabilities (including user feedback & interaction).

More on it next week but for now, please check the latest code here: https://github.com/sirji-ai/sirji/

And don’t forget to ⭐ the repo!

Sirji Updates - Episode 2

Nishith Shah — Thu, 21 Mar 2024 15:05:34 GMT

Sirji’s dogfood release is just around the corner, and we have a whole bunch of updates.

Interface

Dogfood is going to be our very first release, and it won’t have a browser-based interface. Instead, it’s going to be a basic command line interface with terminal windows with logs from various agents.

Kedar talks about what you can expect from the interface of the dogfood release.

Alongside, Rachin is already exploring Sirji’s eventual IDE and why he’s drawn towards solutions like codespace.

Prompts & Messages

We have standardized the system prompts for each agent, and created a simple protocol for inter-agent communication.

Here’s Sunil’s update on the messages & prompts packages.

Stitching It Together

The good news is the Planner & Coder integration is done, and they are talking to each other. But the Executor is throwing errors. And that’s what we are debugging right now.

More tomorrow...

Sirji Updates - Episode 1

Kedar Chandrayan — Tue, 19 Mar 2024 14:18:40 GMT

Yesterday, we told the world about Sirji, an open-source AI software development agent, inspired by Devin.

And with that, we got to work. We promised to build in public. And so, here's our first update on Sirji.

Btw, we are not going to follow a cadence of daily updates. Maybe we will have much more frequent updates for the next few days since there's a lot going on. We will then start dialing it down.

Here's the GitHub repo: https://github.com/sirji-ai/sirji

Today was about releasing the first version of the Researcher module which enables Sirji to gain new knowledge and infer from it.

Researcher

Whenever Sirji comes across requirements in which there are knowledge points, outside of its knowledge, it invokes the Researcher module. Researcher is based on the RAG (Retrieval-Augmented Generation) framework.

Current implementation uses OpenAI Assistants API. We have taken care of making the module composable, which will make strategy changes easier (described in detail below).

The Researcher module has 2 main parts: Embeddings Manager and Inferer.

Embeddings Manager

There can be different strategies for implementing the Embeddings manager. Factory and strategy design patterns are used to improve composability and to make the addition of new strategies easy. Presently, OpenAI Assistants API is used to upload new documents to the assistant. Embeddings manager has 2 major functions:

Index: This is where the new knowledge points are indexed.
Retrieve Context: This is where the matching context based on the problem statement is retrieved and passed to the Inferer as a part of the prompt. In the current OpenAI Assistant API implementation, this step is not needed and so it is implemented using an empty method. If we use a vector database for storing embeddings, we would implement this part for shortlisting and populating the retrieved context using embeddings match.

Inferer

In this part, the LLM model is called to infer using a prompt that has both the problem statement and the retrieved context from the previous part. In the present OpenAI Assistant API implementation, the inference is made on the same assistant (assistant id preserved in object). There can be different strategies for implementing this part and to make it composable, we have used strategy and factory design patterns.

Fun Fact

When developing the Researcher module, we needed to go through the OpenAI Assistants API documentation. This documentation was outside the knowledge of our LLM (gpt-4-turbo-preview). So the model was not able to assist us in development. Rather than going through the documentation manually, we thought of using the Sirji approach to research. We manually indexed (manual, since the automated process is what we needed to develop) a new assistant with the PDF prints of the documentation. After this indexing, the assistant helped us to write the Researcher. This also proved to us that the Sirji way of research works!

Getting Hands Dirty

Let’s run the Researcher! Following the steps here to run the Researcher module on your machine.

Crawler

The Crawler module is used by the Researcher module. Depending on the type of the URL (PDF file, GitHub repo, or just a webpage URL), the Crawler module implements different strategies (thus using Strategy design pattern). Also, a Factory class provides encapsulation around the strategy selection.

Here is how the crawler module handles different types of URLs:

When the crawler discovers a PDF URL, it uses PyPDF2 to extract the information and stores it as a markdown file.
If the URL points to a GitHub repository, the crawler clones it and stores it in the specified place.
For all other URLs, the crawler utilizes BeautifulSoup to explore the webpage and store the collected data as a markdown file.

Logger

The Logger module accepts two parameters: the file name where the log file will be stored and the log level. We have utilized the singleton design pattern to ensure the use of a single instance for each file name. The instance is created only when it's first requested for a particular filename and the same instance is returned when requested again for the same file. All the logs will be stored under the workspace/logs folder. Workspace folder is added to .gitignore to avoid accidental commits.

Parallel Research

Along with starting on concrete development of the various components of Sirji, our team is also doing research on various topics.

We are looking into self-hosted cloud development environments like Visual Studio Code Server & Coder which will allow us to securely connect to remote machines from anywhere through a local client, with/without the requirement of SSH.

Call for Contributions

If you are interested in contributing to Sirji, here are a few enhancement requests which you can take up.

Cross-Pollination is all you need: Transfer learning in AI

Sunil Khedar — Tue, 12 Mar 2024 10:44:39 GMT

In our previous blog, we talked about the theory related to transfer learning. We discussed the recommended method for transfer learning: Freeze the layers with weights from the pretrained model for the first few epochs and only train the new or the last few layers during this time. But is this really the best approach?

That’s what we will try to find out In this post. We will share our experiments, our methodology, and the results.

If you want to cut to the chase and find the best method for transfer learning, just scroll down to the last section of this post. But if you enjoy following through on our journey, read along!

The Dataset

For our experiments, we used the quickdraw doodle classification dataset, specifically, we used numpy 28x28 doodle images. These images were from 10 classification categories - tractor, toothpaste, toothbrush, skull, spider, toilet, mountain, sword, marker, and sheep. The training and the test dataset consisted of 10,000 images for each category. Following are some sample images.

Apart from just observing the model performances on the clean test dataset, we also added some synthetic noise to the data to see the out-of-the-box noise robustness of each training methodology. Following are some noisy images.

We used EfficientNet b0, a model built to solve the imagenet classification task. We devised various strategies to build models using transfer learning and the basic CNN approach to see how well each transfer learning method performed as compared to each other and also against a basic CNN model.

The Training Approaches

The training approaches we used were:

Approach 1: Basic CNN built from scratch.
Approach 2: EfficientNet b0 architecture used without porting weights at all. This will help us know exactly how powerful the architecture is, without the knowledge gained from solving the imagenet task.
Approach 3: Transfer learning in the recommended way - initialize pretrained weights of unmodified layers, freeze the pretrained weights for a few epochs, then unfreeze and train the whole network. Usually, it is suggested to freeze till model convergence. Since we have initialized only 2 layers, and our dataset size is 100,000, we froze it for 2 epochs.
Approach 4: Transfer learning using gradient clipping - where we limit the gradient to 1.0 that backpropagates to the entire network. This method stops NN from completely forgetting its imagenet knowledge due to the large gradients generated by the randomly initialized layers in the initial few steps. This is our alternative to the recommended way.
Approach 5: Transfer learning in a very unrecommended way, i.e. initialize pretrained weights and modify them right from the beginning. No gradient normalization and no weight freezing. This approach is unrecommended, theoretically, because the huge gradients from randomly initialized layers cause the NN to forget its pretrained knowledge, by moving too far and too quickly from the optimal state it was in.

Here is a validation accuracy summary of the clean and noisy validation image.

Inferences

Here are our inferences from the above results:

Basic CNN (approach 1) beats an EfficientNet if pretrained weights are not used (approach 2).
Once an EfficientNet has been initialized with pretrained weights (approaches 3, 4 & 5), we see that accuracy is much better as compared to basic CNN (approach 1) and EfficientNet without pretrained weights (approach 2). It shows the importance of knowledge, derived from solving the imagenet task, to utilize the full potential of EfficientNet architecture.
It also shows that AI models that solve a much harder task can solve a relatively simpler doodle classification task. Does it mean that EfficientNet is the best and most efficient way to solve the doodle classification task? No, seeing that EfficientNet has ~4.65x parameters and still betters the basic CNN model by only 2%. A bigger basic CNN model could be a way forward if you want to build an NN that solves the doodle classification task most efficiently. And that’s what we had done for Thursday’s Doodle Race game.
The results also show that the various transfer learning ways (approaches 3, 4 & 5) are very close, with approach 4 narrowly winning. This is in contrast to what the theory suggests, the recommended way should win quickly and convincingly.
A very curious thing unfolds when you look at the noisy accuracies of the models. The best out-of-the-box noise robustness comes out of the model built using our alternative to the theoretically recommended way (approach 4).
What is shocking here, is the noise robustness of the model trained in the recommended way (approach 3). It is much worse as compared to unrecommended (approach 5) and our alternative (approach 4) and even the model built without knowledge derived from solving the imagenet task (approach 2). This shows how much emphasis the recommended way puts into solving for the exact dataset and down it comes crashing when the dataset is slightly modified, at least that's what the the above results indicate.

So, can we conclusively state from the above results that an unrecommended (approach 5) way works just as well as the recommended way? No, because, an argument can be made, that for a simple 10-class doodle classification task, a 100,000 fine-tune dataset is very big, and hence the huge, diverse, data itself contains enough information, to negate knowledge-forgetting that happens in the unrecommended (approach 5) fine-tuning strategy. To know this for sure, we repeated the same experiments by reducing the dataset from 10,000 to 100 images per category.

In this run too, we see that our alternative to the recommended way (approach 4) beats the theoretical recommended way (approach 3) convincingly and is very close to beating the unrecommended way (approach 5).

If we had seen the recommended way winning significantly in a low-dataset size setting we could conclude that the unrecommended way forgot important previous knowledge and hence the recommended way won.

Now, this should not stand as a testament to always completely ignoring the theory and the recommended way. This is because, perhaps, in this dataset, the pretrained model’s weights could have fallen together in such a way that knowledge forgetting was not happening significantly even when layers were initialized randomly. Also, the number of layers/neurons randomly initialized was much less compared to the number of unmodified and pretrained weight layers.

So, what have all these experiments been about?

Once again, these experiments have been about verifying the theoretically recommended way because we saw in our previous blog that it hurts performance in a few cases. These experiments have also been about, perhaps, finding a better alternative to the theoretically recommended way.

So, which way is the best way for transfer learning?

AI is a sum of lots of variables and not something we fully understand. The best bet, according to us, and these experiments, would be to clip the gradient and immediately start training all the layers, without ever freezing them for a few steps and just training the randomly initialized layers.

This conclusion is purely based on our own experiments and based on a few other use cases (this and this), that say, freezing layers actually HURT the performance with a large enough dataset.

Of course, if you can afford it, it’s better to experiment with all the above methods and more, but we would still start the experiments from the gradient clipping method.

Co-Maintaining NIAH: A Journey in Open Source and Model Benchmarking

Kedar Chandrayan — Sat, 09 Mar 2024 04:32:01 GMT

Recently, I took the opportunity to co-maintain an open-source repository called Needle in a Haystack (NIAH). NIAH is a benchmarking technique for Large Language Models (LLMs), where the model is tasked with identifying specific information (the "needle") within a larger context (the "haystack"). Originally authored by Greg Kamradt for GPT-4 and Claude 2.1, the test has gained popularity, with Google and Anthropic adopting it for Gemini 1.5 and Claude 3, respectively. Here’s a short video from Greg, on his thoughts on the future of NIAH.

Embracing the New Opportunity

Greg needed help managing an increase in activity on his project. He reached out on Twitter seeking co-maintainers. I saw this as a chance to learn and broaden my experience in the open-source world. Although I've contributed to many projects before, this was a new opportunity for me to take on a different role. I quickly sent Greg a summary of my experience, and after a brief call, we discovered a shared passion. We decided to collaborate on making the NIAH repository the go-to place for model benchmarks.

The First Ask

When I began my journey as a co-maintainer, the first task was deciding between two conflicting pull requests (PRs) aimed at solving the same problem. Let me walk you through the decision-making process.

The problem at hand was the need to separate the model interaction layer in NIAH to facilitate easy addition of new models for testing.

The first PR proposed an inheritance approach. It introduced a base class with an abstract model interaction layer, which was then implemented in child classes. However, this approach created an is-a relationship that felt unintuitive. Additionally, it posed limitations for future modifications, as a class cannot inherit from multiple parents.

On the other hand, the second PR suggested a composition approach. It employed the strategy pattern to allow for a more flexible and composable way of injecting the model interaction layer into the test. This approach, with a has-a relationship, was more intuitive and scalable. Moreover, it also addressed the need to separate the evaluator layer using the same approach, which I fully supported.

After discussing internally with Greg and Pavel (another co-maintainer), we agreed that the composition approach was the way to go. We thoroughly reviewed the second PR and provided feedback for necessary code changes. The developer was responsive and promptly implemented all the suggested modifications.

The Multi-Needle Enhancement

Both Gemini 1.5 and Claude 3 had used a multi-needle variant of the NIAH test to benchmark their models. Incidentally, Lance Martin (from Langchain) reached out to contribute in this direction. He implemented the multi-needle variant in NIAH and raised a PR, which I reviewed and suggested some changes. I enjoyed the review task and got to contribute too in a core change of making the needle distribution uniform.

Synergy with True Sparrow

At True Sparrow, we highly prioritize and support open-source contributions. In addition to our client projects, we actively pursue meaningful contributions in the AI/ML domain. Over the past year, we've been involved in various fascinating projects, and we're now delving deeper into architectural concepts like the transformer model and tokenizer. Joining NIAH aligns well with this initiative.

An OpenSource Prompt Evaluator for Product Managers

Kedar Chandrayan — Tue, 05 Mar 2024 11:27:42 GMT

At True Sparrow, we enjoy delving into the user persona, understanding their challenges, and brainstorming solutions to address them. Once a consensus is reached, we often provide low-fidelity demos to clients to gather feedback and assess whether we are on the right track. We continue this process until we maximize the value delivered to the target persona without introducing unnecessary complications to the experience.

In a project we did last year for Airstack, we implemented an AI assistant, enabling users to interact with GraphQL APIs using natural language. To evaluate the performance of a candidate prompt, we created over a hundred test cases, consisting of pairs of test inputs and expected outputs. As developers, we utilized the OpenAI Evals framework, which we extended for GraphQL support, to execute the test cases and assess the impact of prompt modifications.

We wondered though, why should evaluating prompts require technical expertise? Non-technical individuals, particularly product managers, can contribute significantly to prompt modifications compared to technically skilled individuals lacking domain knowledge. However, product managers are not always comfortable with executing commands from the OpenAI Evals framework and a more visual approach might be better suited.

In the following sections, I will walk you through our journey of solving the prompt evaluation problem for a product manager persona (or more generally a non technical persona). Additionally, we will also discuss the various design patterns we used in the solution.

Evolving to the Solution

One thing was clear: We needed almost everything to be visual. Use of commands, etc was not the way forward. To provide a visualization of the testing process, we had to break it into logical steps.

We needed to experiment with different prompt versions and evaluate their performance. We called the whole process of trying out a prompt on different test cases as an experiment.

The prompt should have variables which will take different values in different test cases. Since there are variables inside of the prompt, we called it the prompt template.

A test case is a set of variable definitions (to be used for replacing in prompt template) and expected output.

To run an experiment, we ask the user about which OpenAI model to use and what evaluation strategy to use, which will actually choose which OpenAI eval to use. Here's a screenshot of the dialog box that opens when running an experiment for a prompt template.

Following flow-chart explains various steps involved in an experiment run.

Applied Design Patterns

Design patterns facilitate the reuse of experience in software development, enhancing efficiency and promoting best practices for creating scalable and maintainable solutions in various contexts. We applied two design patterns in developing the Prompt Evaluator.

1. Strategy

The Strategy design pattern enables runtime selection of algorithms. In our case, we allowed users to choose evaluation strategies and OpenAI models in an experiment run.

2. Chain of Responsibility

The Chain of Responsibility design pattern was applied by assigning specific tasks to each component of the process: generating prompts, inferring from the model, evaluating responses, and logging results. This ensured a clear division of responsibilities throughout the experiment run.

Conclusion

We recognize and value the diverse strengths each team member brings to the table. We utilized the product manager's domain knowledge and compensated for their lack of technical expertise by offering a visual tool. This enabled them to actively contribute to the prompt design phase by experimenting with different variations effectively.

Our code for Prompt Evaluator is open source! Visit our GitHub repositories for the frontend and backend. Test it out and report any issues. Developers, feel free to contribute by submitting pull requests (PRs).

How to Train Your D̶r̶a̶g̶o̶n̶ AI Model?

Sunil Khedar — Tue, 27 Feb 2024 09:17:13 GMT

You would think transfer learning is very straightforward and simple. Alas, why not, you just initialize weights from a proven NN which solved a different, but similar enough problem. Simple enough. Right?

Yes, if you just want to get something working.
No, if you want to make sense of it after delving deeper into it or you want to create the absolute best transfer-learned model there is.

Let me elaborate.

While working on Thursday, our team devised an amusing and interesting game idea called Doodle Race. What is the doodle race game? You draw and AI guesses! Whoever manages to make AI guess the fastest, wins!

We decided to use the quickdraw dataset, a collection of 50 million drawings across 345 doodle categories, to build a Convolutional Neural Network (CNN) architecture model. CNN architecture is used for understanding and interpreting images or visual data or even timeseries, speech, etc.

The smaller image sizes of quickdraw dataset made it our best choice to experiment while keeping the computational costs lower. Our experiments began with a focus on finding the best solution out of two, either creating a CNN from scratch or using the transfer learning technique on a pretrained model (like efficientnet-b0).

In the process, we also started experimenting to determine what is the best way of transfer learning. While an upcoming blog post will talk more about our experiments, this blog will be focused on the theory behind transfer learning, what theory says is the best approach to build any transfer learning model, and whether the theory has actually been proven right or not.

We started building a classification CNN model to distinguish 10 doodle categories using transfer learning on the efficientnet-b0 base model.

The efficientnet-b0 is trained to solve a much bigger ImageNet dataset which consists of 1000 categories, with real-world RGB images of resolution 224x224. Whereas our experimental dataset consists of images from 10 categories, that are grayscale, and 28x28 resolution.

It was clear enough that efficientnet-b0 was an overkill in our scenario. And, hence had our doubts if it will hinder the performance on this relatively simple task. Sit tight, to know the answer for this in our upcoming blog post.

A sneak peek into our upcoming blog post: To solve the above-mentioned problem, using transfer learning, we modify the input layer of the efficientnet-b0 model to accept the grayscale images and the last layer to output only 10 categories instead of 1000. What this means is that only the modified layers were randomly initialized, while the rest of the neural network was initialized with weights from the pretrained model.

Fun fact:

Did you know that Yolo segmentation model transfer learning works even if you keep the output size the same and your data uses only a small percent of output categories, YOLO makers claim, that if you train their model using more output neurons compared to the distinct classes in your data, their NN should automatically learn to default output to zeros, for the unused output neurons. You can read the comment here. Really wild, how much flexibility NNs give us.

Without going too deep into the implementation details, let's jump right into the theory.

Transfer learning is a very useful tool in AI, where a pre-trained model, used to solve a different, but similar enough task, is used as a base, to train a model to solve the current task. So, in the case of transfer learning, to solve a specific dataset, you bring architecture, and weights from a different NN model that has been proven to work well on a similar task. This way, you can use its learnings from solving a different, but similar enough dataset, to adapt to working well on your current task/dataset. This is especially useful in cases where your dataset size is small or your current task is very complex to build new NN architecture and train from scratch.

Widely, there are two ways to go about transfer learning:

Freeze the layers with weights from the pretrained model for the first few epochs and only train the new or the last few layers during this time.
Don’t freeze layers, train every single parameter of the entire model right from the start.

The first way is recommended, as it prevents the knowledge (i.e. weights and biases we ported) from the pretrained model from evaporating completely due to the very large gradient that comes from the last few or randomly initialized layers, especially at the very beginning of training. The first few steps of training see a very large gradient being backpropagated throughout the entire network, since the weights were initialized randomly. Hence, the loss differential for those layers is very high at the start of training and gradually reduces as the neural network starts to learn and optimize itself.

Also, during transfer learning the last layers are mostly modified, as the last layers are responsible for the final outputs and learn more dataset distribution-specific knowledge as compared to the initial layers, which learn usecase-specific knowledge. For example: In a dogs vs cats CNN, the final layers are responsible for differentiating between cats and dogs. Whereas, the initial layers learn image-specific knowledge and feature extraction, i.e. reduce the dimensionality of an image from its original size into just a few numbers which allows the final layers to classify.

Neural networks are all about traveling in a very high dimensional space (dimensionality of the space = number of trainable parameters in a model) to arrive at an optimal position where NNs are in a state that is useful to humans. So, freezing the initial layers for a few steps or epochs and only updating the final/random weight layer's neurons during this time, would allow our NN to come closer to the area where the initial pretrained model resided. We know that this area is good because our current use case and the pretrained model's use case are similar.

Now, once the frozen finetuned model comes near the optimal area after the first few steps/epochs, we unfreeze all layers and start training all the neurons together, so that the final model moves to the absolute best state for the current use case, while starting from an optimal state.

All this makes sense theoretically and thus, is the recommended way of finetuning any neural network. I mean freezing should never hurt model performance. Right? After unfreezing all the model layers, weights cannot be in a worse position as compared to just randomly initializing newly built layer weights.

Let's say we freeze the unmodified layer weights for two epochs, then at the start of the 3rd epoch, we unfreeze all model parameters. It basically means “epoch-2-freezing-way” is equivalent to “epoch-0-no-freezing-way” with just one exception, “epoch-2-freezing-way” has trained weights in the modified layers and “epoch-0-no-freezing-way” has random weights for the modified layers. However, the unmodified layer weights are exactly the same for both “epoch-0-no-freezing-way” and “epoch-2-freezing-way”.

So, we can say that any training is ultimately better than random initializations, or at least as good as random initializations. Because the trained weights are ultimately one special case of random initializations with a very low probability. That means, if you randomly initialize the weights for a large number of times, at some point you will automatically get the set of trained parameter values.

But do real-world experiments support this conclusion, that freezing is at least as good as non-freezing though it might take more epochs? What do you think?

If you HAVE an answer, then you are wrong, let me explain.

If AI was anything like Physics, yes, you will HAVE one answer, at least in most cases, unless you decide to tread into the quantum world where all your answers are just probabilities. But, AI is not like physics, we do not fully understand AI. So a lot depends on the dataset type, its size, and numerous other factors, where the theoretically recommended way of transfer learning has been proven to worsen the performance as compared to bettering it in a few use cases (this and this).

So does that mean the recommended way of transfer learning and the theory we discussed, so far, are wrong?

No, we will soon come up with another blog post on our real experiments to show the merits and demerits of a large variety of transfer learning methods. Also, we will talk about the process you should adopt to solve any problem with transfer learning, keeping practical model performance needs and available computing resources in context.

OpenAPI meets OpenAI: AI-Driven Dynamic Mock APIs

Alpesh Modi — Mon, 19 Feb 2024 06:07:02 GMT

In a fast-paced development environment, we often encounter various types of blocking dependencies within or between teams. One such dependency is getting the backend APIs ready to be integrated into consumer interfaces and flows.

Making backend APIs available on time becomes challenging, especially when both the frontend and backend teams are working simultaneously. So it becomes crucial to resolve the API dependency issue quickly to avoid unnecessary delays.

Traditional Solution

Traditionally, teams create dummy APIs to serve placeholder responses once API specifications are finalized.

However, this approach has its drawbacks. Firstly, implementing these dummy APIs adds unnecessary development effort and time. Secondly, it's difficult to create multiple placeholder responses to cover all possible scenarios, so teams often implement a single basic response. Finally, any changes in API specifications during development require extra effort to update the placeholder responses accordingly.

AI-Driven Solution

To address these challenges, we developed a solution: an intelligent code that dynamically generates placeholder responses using AI. This code serves as a link between the frontend and the non-existing backend APIs 😀, allowing backend developers to focus on the actual implementation.

The solution uses OpenAPI specifications to identify the required APIs and the expected responses. And then marrying it with OpenAI chat API, to generate real-time placeholder responses.

Unlike traditional approaches that provide static responses, this solution generates dynamic responses. Hence, making it easy to serve placeholder responses for paginated APIs as well.

Implementation details

The implementation involves creating a wildcard route to capture any unimplemented request, extracting relevant details from predefined OpenAPI specifications, and generating prompts for AI-driven responses. The process is managed by a service that interacts with an AI model, ensuring that the responses align with the specification.
For this blog, I will provide implementation details using express.js. But we can use the same concept to achieve it in any programming language or framework.

Step 1: OpenAPI specification

Prepare the OpenAPI specifications for the required APIs. I have used JSON format for OpenAPI specs, but the YAML format can also be used.

Step 2: WildCard route

Add a wildcard GET route to the routes file. This wildcard route will capture any unimplemented/undefined APIs in the codebase. For simplicity, I have created a wildcard route for GET requests, but in the real world, we have to add wildcard routes for other HTTP methods too.

The wildcard route invokes a placeholderResponse service (implementation details at Step 4) for the response.

app.get("/*", async function (req, res) {
const resp = await placeholderResponse(req.originalUrl, "GET");
res.status(200).send(resp);
});

Step 3: Prompt Engineering

Prepare the OpenAI prompts instructing it to generate the required response. I have created a sample prompt.

Step 4: Service Implementation

Finally, implement the placeholder response generation service. The key responsibilities of the service are:

Generate system and user prompts using - OpenAI specification, request method, and URL
OpenAI chat completion API invocation to generate a dummy placeholder response.
Return the response

Finally: Optimizations

We know OpenAI API calls are a bit heavy on the wallet and take a few seconds to respond. That’s why using optimizations is always good. We can implement the following optimizations to minimize OpenAI model interactions:

Caching: Implement the caching strategy for placeholder responses to avoid OpenAI every single time a new request is received.
Tokens: OpenAI charges are based on the number of tokens used in prompt and response. To reduce the prompt token size, break the OpenAI specifications into smaller and more related routes. And send the relevant OpenAI spec to OpenAI, instead of one big file.
Model: We have used the OpenAI GPT-3.5-turbo model to keep the cost low. But, other cheaper LLM models or services can also be used.

We have seen significant time savings by using this solution. It saves us anywhere from 1-3 hours per route. So, if the project has 10 routes, we are easily saving 10-30 hours. Additionally, there are noteworthy time savings for the frontend team, as the approach helps mitigate post-integration issues, allowing them to develop their components with more realistic data.

We have used this solution for multiple client projects. And we love it.

AWS S3 presigned POST URL in Golang

Aditya Raj Singh — Thu, 15 Feb 2024 10:50:16 GMT

Allowing users to upload files on consumer applications is a common requirement. Services allow images, videos, CSV, and other file formats to be uploaded for further processing and usage. While designing the file upload solution, we should take into consideration the file content type, its size, caching strategy, and other security-related considerations.

While working on a client project, we had to allow users to upload their profile pictures from web and mobile applications. As the application was hosted on the AWS cloud, Amazon S3 was our storage solution. While designing the upload service, we cautiously decided to upload the files directly from the user's device to S3. And that’s exactly what the presigned URL is designed for. It’s also worth mentioning that the application was built using Golang.

After looking at AWS Go SDK documentation we discovered it only supports presigned PUT URLs and had a couple of restrictions for our usage. In this blog, we'll explore the limitations of presigned PUT requests, the need for a more flexible solution, and how we overcome these challenges by implementing a secure solution using presigned POST requests in Go.

What were we trying to achieve?

The goal was to enable seamless file uploads from the client device to Amazon S3, ensuring file uploads are secure and controlled.

Presigned POST requests provide a way to generate URLs and fields for file uploads, allowing us to set restrictions such as file size limits and content type, while still ensuring that only authorized users can perform the upload.

What roadblock did we hit?

Presigned PUT URLs, while useful, pose some limitations. They include essential information like the signature and expiration date but lack built-in mechanisms for setting limits on the upload size. This limitation becomes apparent when trying to impose restrictions on the file size or specify multiple content types for a single presigned URL.

In contrast, the presigned POST approach offers greater flexibility by leveraging a policy mechanism. This mechanism empowers the signer to define rules, such as specifying supported file sizes or multiple content types. Unfortunately, while supporting presigned GET, PUT, and DELETE URLs, the AWS Go SDK lacks support for presigned POST URLs.

We found presigned POST implementations in AWS Ruby and Javascript SDKs. But very limited help was available for Go.

How did we solve it?

To overcome the limitations and implement a secure solution, we developed and released an open-source go-presigned-post package. This package provides a convenient way to generate presigned POST URLs and fields for file uploads to Amazon S3 using HTTP POST requests. It includes features such as specifying expiration times, file content types, ACLs, cache control, and more.

Key Features:

Presigned POST Support: Generate presigned POST URLs and fields for secure file uploads.

Flexible Policy Mechanism: Leverage a policy mechanism for setting rules such as file size limits and content types.

AWS SDK Integration: Seamlessly integrate with the AWS Go SDK for a comprehensive solution.

Here's a simplified example of how to use the go-presigned-post package:

The go-presigned-post package offers a flexible and secure solution for presigned POST URLs in Go. With its support for policy mechanisms, developers can define and enforce rules for file uploads. Its seamless integration with AWS Go SDK makes it simple enough to be integrated with any Golang application.

Looking forward to your feedback.

Insights from Serving and Hosting LLMs on Production

Mohit Charkha — Thu, 01 Feb 2024 12:45:43 GMT

OpenAI and similar services have democratized access to AI, making it more accessible to everyone without worrying too much about the technical complexities of hosting and serving.

On the other hand, newer and more advanced models are developed and released on platforms like Hugging Face, allowing them to tackle specific problems more effectively and quickly. However, deploying these models requires considerable research and exploration into how to set them up on cloud servers because of the computational needs and large model size.

We dealt with this challenge firsthand in one of our recent client projects. The project required fine-tuning, evaluating the fine-tuned model, serving, and hosting of the Llama2 model.

Serving: Hugging Face Transformers vs vLLM

Hugging Face Transformers
Since we were using llama-recipes for fine-tuning experiments the inference script of llama-recipes was our primary choice for evaluation of fine-tuned models. The script uses the Transformers library by Hugging Face. We know that the inference time also depends on the token size. In our case, prompts consisted of an average of 660 tokens. The script took an average of ~10.5 seconds for inferencing on a 24 GB single GPU Machine. 😱

Inference Time for Hugging Face Transformers

It was a no-brainer that we cannot use this code for inference on production.

We started looking out for alternative solutions to serve the model and came across the open-source vLLM library that claimed to achieve 24x LLM inference throughput than HuggingFace Transformers.

vLLM
vLLM is a fast and easy-to-use library that provides State-of-the-art serving throughput for LLM inference and serving. We replaced the Hugging Face Transformers Library with vLLM in our fine-tuned model evaluation script. vLLM reduces the inference time by efficient memory management, optimized Cuda kernels & various other optimization techniques. The script took an average of ~200 milliseconds for inferencing on the same 24 GB single GPU Machine. Hence achieved speed gains of up to 50x. 🕺

Inference Time for vLLM

vLLM also provided us with ready-to-use chat and completion API endpoints for real-time inferencing. It also supports continuous batching of incoming requests. When we ran the evaluation with 13k records, vLLM responded with inference results in batches of 10 records.

Although it does not have an inbuilt caching layer, it is easy to implement caching on top of vLLM to improve performance.

It proved to be the best fit for the project satisfying most of our requirements.

Hosting: Modal vs AWS vs Runpod
The next step was to host it on a cloud server. Below is an insight into the different hosting platforms we explored.

Modal
Modal provides a pay-as-you-use service providing serverless hosting solutions. Serverless architecture provides the benefits of scalability and high availability with minimal operational costs along with efficient costs. We referred web endpoints document to host our code on Modal. However, it was not suitable for our use case as its cold start time was ~30 seconds.

AWS
We tried hosting the vLLM APIs on an AWS EC2 machine using the Docker Image provided by vLLM. However, the cost of the AWS GPU machine was high. So, at least for the MVP, we decided to go with a more cost-effective solution.

Runpod
We were already using Runpod to fine-tune LLMs. It provides GPU instances at a cheaper rate. These instances run on docker and hence vLLM docker image could not be used directly on the machine. We instead hosted the vLLM API as a standalone server using the supervisor daemon, a process control system. It served our purpose well for the MVP release by keeping the cost low.

(Btw, we did an interesting hackathon project for the Runpod team. Fun stuff!)

For MVP, the expected traffic on the model server was low so with a single instance of Runpod we were able to handle it. But as the traffic grows there would be a requirement to shift to a setup with multiple instances and maybe auto-scaling which is when we might plan to move to rented GPU machines.

Our hands-on experience with the vLLM library and its extensive integration capabilities has been great. Your needs for hosting and serving may vary, but we highly recommend giving it a try.

Building AI-Powered Semantic Search

Aman Barbaria — Fri, 19 Jan 2024 14:20:01 GMT

In today’s world, vast amounts of data are being generated and consumed online & having smart search functionality is a must. Traditional search techniques are keyword-based solutions and have limitations. Specialized search databases like ElasticSearch provide Fuzzy search allowing a bit of flexibility for typing mistakes. But is that good enough?

Not for Jam, a decentralized social media platform built on the Farcaster protocol. This was in December 2022. Jam was one of our clients and while working on it, we quickly realized smart search was going to be the key differentiator for this product.

Keyword Search and its Limitations

Keyword search or lexical search is where you look for exact matches of the search term (and maybe allow for typos or plural cases). What keyword search doesn’t do is understand the query’s overall meaning.

If you are searching for “famous landmarks of Paris”, the result with the Eiffel Tower won’t come up unless the result has those search query keywords.

That’s where semantic search or vector search comes in. It understands that the Eiffel Tower is in fact Paris’ most famous landmark.

Image Credit: Inspiration from OpenAI

What is a Semantic Search or Vector Search?

Vector search is an approach in information retrieval that uses numeric representations of content for search scenarios. These numeric representations, also called vectors, are N-dimensional arrays created using AI models that capture the meaning and semantics of the whole content. The AI model, also referred as Embedding model, can transform complex unstructured data, including text and images, into vector representations.

Vector search uses approximate nearest neighbour (ANN) algorithms to find similar data and yields more relevant and precise results than traditional search.

Image Credit: Inspiration from OpenAI

Let’s understand in detail how a search and recommendation solution was implemented in Jam using vector search.

Selecting Database

While there are several more vector databases to choose from now (including Redis), when we started on Jam in December 2022, the choice was between Weaviate and Pinecone. We finalized Weaviate for our use case. It is an open-source platform with pre-built modules for popular AI models and frameworks enabling faster development.

💡

Btw, Weaviate has an absolutely stellar support team. Sila Uygun, one of their account managers who was always very prompt in answering our queries, deserves a shout-out.

Choosing an AI Model

We conducted a POC to evaluate various large language models (LLMs) that generate embeddings. Factors such as speed, cost, and accuracy were taken into consideration. Ultimately, OpenAI's text-embedding-ada-002 model emerged as the most suitable choice for our use case.

Integrating into the application

We used Weaviate's nearText method for vector search which internally uses OpenAI for creating vectors. You only need to provide the text and Weaviate converts the input text to a vector. After analyzing the results for different use cases, a distance threshold was set between 0.3 to 0.25 to filter out less relevant content.

Advanced Search Features

We provided users precise control over search results with advanced sorting and filtering options. Users could sort content based on relevance & engagement score. Factors such as replies, likes, etc were taken into consideration. Filtering content based on published time allowed users to search within a specific timeframe.

Custom Relevance Filtering

We built custom logic on top of the vector search to further refine our results. Several factors, such as text length, internal engagement score, presence of the search term, etc were taken into consideration. This additional layer of refinement ensured that users received the most relevant and meaningful results.

Recommendations with Twitcaster

We extended search functionality to another feature called Twitcaster. Twitcaster intelligently suggested related users and content within the Farcaster network based on a user's recent tweets. The recommendation system enhanced the user experience by facilitating discovery within the platform. We used the nearText method to find the top 3 content sorted by internal engagement score and within a vector distance of 0.3 for each tweet. The fetched results were then re-ranked and filtered based on the internal engagement score and the vector distance with the tweet.

Account for Downtimes

We faced occasional downtimes with OpenAI services due to the increased load on their system, especially during the early days. To mitigate such disruptions, we employed a fallback strategy. In case of downtimes, we utilized our primary database with limited search capabilities. This approach ensured that users could still access content even during periods of reduced AI service availability.

Game Changer

In conclusion, using vector search at Jam proved to be a game-changer and resulted in Jam becoming the most widely used third-party app on Farcaster.It gave Jam a significant edge over other Farcaster apps that used traditional search methods.

It is important to note that Vector Search is not just limited to these use cases and can be used for classification, clustering, anomaly detection etc. It beholds interesting opportunities that can be explored in future. With rapid advancements happening in the field of AI, these solutions are only going to get better. It's crucial to embrace such innovations & stay relevant in the changing tech landscape.

Supercharging LLM Fine-Tuning with WandB

Mohit Charkha — Wed, 17 Jan 2024 15:30:03 GMT

LLMs (Large Language Models) have revolutionized the technology world over the past few years. All these LLMs are available in different configurations, some need high-end GPU instances while others can be used on smartphones. To meet your specific needs, you need to choose the right model, and sometimes need to fine-tune these models before they can be used for inference.

While working on a recent AI project for one of our clients, Blockscout, we had a requirement to choose and fine-tune an LLM. In the early days of this project, we faced problems with maintaining multiple fine-tuning configurations and evaluating the fine-tuning results like losses at different epochs, resource utilization, and other key metrics. The problem increased exponentially when the evaluation needed to be performed through multiple experiments on different LLMs.

Fine-tuning Challenges

With multiple runs, it was becoming difficult for us to keep track of the configurations and hyperparameters (like number of epochs, learning rate, etc.), and so we started maintaining these along with other logs in a Google Sheet. It was a cumbersome process and led to numerous manual errors. There had to be an automated method that collects all the details and stores them in one place.
The training_loss for each epoch was printed in logs and analyzing these was difficult as the number of epochs increased to 30. The best-fit model is produced by merging the base model with the epoch having the least training_loss; therefore, the ability to get insights into these trends through graphs was crucial.
Each fine-tuning experiment took around 16-17 hours to complete. During one experiment, the fine-tuning failed at an intermediate stage due to an 'Out Of Memory' issue and the machine remained in an idle state for a long time. A real-time notification for such alerts would have helped us save both time and machine costs.

We discovered several tools that could help us navigate these challenges and decided to go with WandB after analyzing a few.

What is WandB?

In their words: “Weights & Biases (WandB) is a Platform that helps AI developers build better models faster. Quickly track experiments, version and iterate on datasets, evaluate model performance, reproduce models, and manage your ML workflows end-to-end.”

How did we use WandB?

We integrated WandB into our workflow with a few simple steps. To log in to WandB, we called the wandb.login() method with the access token.
To initialize a project on WandB, we used wandb.init() which took project_name, run_name, and configuration as parameters. Using this, we were able to maintain configurations across multiple runs. This also helped us in reproducing a finetuned model with the same configurations at a future point in time.

wandb.init(project= project_name, name= run_name, config={training_params})

We passed the training results of each epoch to wandb.log() method which helped in visualizing the model results by plotting graphs for those results (shown in the image below). These graphs assisted us in assessing the model fit during fine-tuning and adjusting the hyperparameter values for upcoming training.

If the training_loss consistently decreased, it indicated that the model was learning effectively and was likely a good fit. Conversely, if the training_loss failed to decrease, the model may be under-fit, suggesting it hadn't learned the patterns well. On the other hand, if, after a certain epoch, the training_loss started increasing, forming a V-shape on the graph, it suggested overfitting i.e. the model has learned the training data too well but may struggle with new data.

wandb.log({"training_loss": training_loss,"training_perplexity": train_perplexity, "validation_loss": validation_loss, "validation_perplexity": validation_perplexity})

WandB allowed easy integration with Slack for notifications related to the completion of fine-tuning or termination due to any error. We also used wandb.alert() to send custom alerts after each epoch completion.

Managing multiple runs within a single project helped us compare fine-tuning results and configurations across different runs and pick the best model based on the results.

WandB also provided charts for system metrics like GPU Utilization, Network Traffic, Memory usage, etc. which helped us manage device configurations for upcoming fine-tunings.

WandB addressed specific challenges that we faced while fine-tuning the model with minimal integration efforts. Btw, we used their free tier and that was sufficient for our needs.

What can we say? We have become WandB fanboys! It is a very useful platform that has grown into an essential part of our fine-tuning projects.

Short-lived Preview Environments

Rohit Jha — Thu, 10 Aug 2023 07:49:34 GMT

We worked on Thursday with a team of 11 developers working remotely. Throughout this time, we were building at least 4 to 5 features simultaneously. For UX and QA reviews, we only had one staging environment and a local development environment for each developer.

The Challenges

When working with agile methodology, the conventional staging and local development environment approach presented us with the following challenges that were slowing down the overall pace of the team.

Limitations of using local development environment:

Asynchronous UX Review: The product and design teams could not review the frontend work asynchronously. This often caused delays in the feedback loop.
Maintaining Production-Quality Local Database: It was challenging to replicate and maintain a production-quality database locally, potentially leading to discrepancies and issues during testing.
Testing with Multiple Pods/Servers: The ability to test changes across multiple pods or servers was limited.
Facilitating Feature Testing by Stakeholders: Providing access to other stakeholders for testing the feature was impossible.

Limitations of using a single staging environment:

Multiple Features Simultaneously: When multiple features are being tested simultaneously in a single staging environment, it can lead to conflicts and challenges in isolating issues specific to each feature.
Unstable Code and Database: If unstable code is introduced during testing, it can potentially corrupt the database, causing data inconsistencies and impacting the overall testing process.
Early UX Feedback on Partially Complete Code: Obtaining early UX feedback on partially complete code in the staging environment can negatively affect the stability of the environment, making it harder to ensure reliable and accurate testing outcomes.

To address these challenges, we initiated an exploration of allowing developers to deploy any individual git branch on a completely new environment. We call them "Preview Environments".

This was a task we took on during an internal hackathon conducted during one of our in-person meetups.

Infrastructure Overview

Thursday is currently operating on a standard setup that comprises the following components:

Kubernetes (EKS) is utilized to run all the services. Auto-scaling is efficiently managed through the Horizontal Pod Scaler, which automatically adjusts the number of running pods based on the workload.
AWS Aurora stores and manages the data for the application.
Elasticache (Redis) is employed for caching purposes, enhancing the performance and responsiveness of the application by reducing the need to fetch data from the database repeatedly.
AWS MQ (RabbitMQ) is responsible for managing the publish/subscribe pattern, enabling efficient communication between various components of the application.

In addition to the mentioned components, we are utilizing GitHub Actions and Helm to facilitate the deployment of our services to the Kubernetes Cluster.

Our Design Strategy

To avoid high costs, we chose not to create multiple replicas of our staging environment. Instead, we started by identifying reusable components.

We decided to reuse the Kubernetes cluster and create all preview environments as pods within a separate namespace. Leveraging the existing helm chart, we could easily generate a new release for each new branch.

We decided to reuse Redis and RabbitMQ as well. For Redis, we were already adding a unique environment key as a prefix to all the saved keys which helped us avoid any cross-environment conflicts. Additionally, for RabbitMQ, we used new virtual hosts, creating dedicated spaces for each environment.

Finally, for the database, we decided to write an AWS CDK script to launch a new RDS for every environment using a recent backup.

Implementation Specifications

Whenever code is pushed to a Git branch with a name that matches the pattern "preview-*" a pipeline is automatically triggered. This pipeline carries out the following steps in sequence:

Log in to AWS and get ECR credentials
Build a docker image of the current branch
Push the image to ECR
Create a new RDS using a snapshot
Update environment variables
Create a helm release with the latest image and environment variables

Helm release contains ingress, deployments, service, and pod auto scaler. We are using the following services, which help us automate the end-to-end creation of these environments:

Karpenter: Auto scales nodes if we do not have adequate capacity
Cert-manager: Generating SSL certificates for the domain name
External DNS: Automatically mapping DNS.

The Benefits of Preview Environments

After introducing the solution we saw many benefits while keeping our infra cost to the minimum. Here are a few of the benefits:

Allowed early feedback from various stakeholders
Allowed early testing in an environment that closely resembles the production environment
Helped identify and resolve potential issues early in the development process
The staging environment became more stable

Preview environments have been a game-changer for us. Please feel free to connect with us to orchestrate the same for your product.