Co-Maintaining NIAH: A Journey in Open Source and Model Benchmarking

Recently, I took the opportunity to co-maintain an open-source repository called Needle in a Haystack (NIAH). NIAH is a benchmarking technique for Large Language Models (LLMs), where the model is tasked with identifying specific information (the "needle") within a larger context (the "haystack"). Originally authored by Greg Kamradt for GPT-4 and Claude 2.1, the test has gained popularity, with Google and Anthropic adopting it for Gemini 1.5 and Claude 3, respectively. Here’s a short video from Greg, on his thoughts on the future of NIAH.

Embracing the New Opportunity

Greg needed help managing an increase in activity on his project. He reached out on Twitter seeking co-maintainers. I saw this as a chance to learn and broaden my experience in the open-source world. Although I've contributed to many projects before, this was a new opportunity for me to take on a different role. I quickly sent Greg a summary of my experience, and after a brief call, we discovered a shared passion. We decided to collaborate on making the NIAH repository the go-to place for model benchmarks.

The First Ask

When I began my journey as a co-maintainer, the first task was deciding between two conflicting pull requests (PRs) aimed at solving the same problem. Let me walk you through the decision-making process.

The problem at hand was the need to separate the model interaction layer in NIAH to facilitate easy addition of new models for testing.

The first PR proposed an inheritance approach. It introduced a base class with an abstract model interaction layer, which was then implemented in child classes. However, this approach created an is-a relationship that felt unintuitive. Additionally, it posed limitations for future modifications, as a class cannot inherit from multiple parents.

On the other hand, the second PR suggested a composition approach. It employed the strategy pattern to allow for a more flexible and composable way of injecting the model interaction layer into the test. This approach, with a has-a relationship, was more intuitive and scalable. Moreover, it also addressed the need to separate the evaluator layer using the same approach, which I fully supported.

After discussing internally with Greg and Pavel (another co-maintainer), we agreed that the composition approach was the way to go. We thoroughly reviewed the second PR and provided feedback for necessary code changes. The developer was responsive and promptly implemented all the suggested modifications.

The Multi-Needle Enhancement

Both Gemini 1.5 and Claude 3 had used a multi-needle variant of the NIAH test to benchmark their models. Incidentally, Lance Martin (from Langchain) reached out to contribute in this direction. He implemented the multi-needle variant in NIAH and raised a PR, which I reviewed and suggested some changes. I enjoyed the review task and got to contribute too in a core change of making the needle distribution uniform.

Synergy with True Sparrow

At True Sparrow, we highly prioritize and support open-source contributions. In addition to our client projects, we actively pursue meaningful contributions in the AI/ML domain. Over the past year, we've been involved in various fascinating projects, and we're now delving deeper into architectural concepts like the transformer model and tokenizer. Joining NIAH aligns well with this initiative.

Kedar Chandrayan

Kedar Chandrayan

I focus on understanding the WHY of each requirement. Once this is clear, then HOW becomes easy. In my blogs too, I try to take the same approach.