In day-to-day software development, we use so many open-source libraries. In fact, we cannot even imagine working without some of them. The developer in me is very much inclined to contribute to open-source software, so that the likes of me will use my contribution all over the world.
My desire to contribute to open-source libraries seemed to become reality when True Sparrow (the company I belong to, my second home) encouraged team members to make contributions to open-source libraries.
To introduce myself, I have been developing software products from scratch for over a decade. In this time span, I have developed both proprietary and open-source products. But to contribute to someone else’s repository, was a new ball game overall, leaving behind the familiar waters.
Swimming in Unknown Waters
Contributing to open-source libraries was like unknown waters to me. Learning to swim here was a challenge. To contribute to already existing open-source libraries meant that my contribution should gel with the existing code style and practices.
I wanted my first contribution to be special. Especially, I wanted to contribute to some library which is extensively used. The most well-known libraries are in existence for a long time and are quite stable. So finding trivial issues in them is almost out of the question. For solving complex issues, you need to have a good understanding of the existing code.
On the surface, the water seemed calm and there were very few issues in famous libraries. When going deeper, I saw the undercurrents, some issues were related to security vulnerabilities too. Now I added one new self-criterion for my first issue, i.e. it should improve the library from a security perspective.
The Joy of Jumping In
I took the following approach to pick the issue, in this order:
- Decide on a famous library having weekly downloads more than 1M with at least more than 10 issues. Also the GitHub repository should have more than 1k stars.
- Going through the code and understanding it.
- Going through the issue to find a perfect fit as per my personal criterias.
Finalizing the library
It took substantial time to finalize the library - sanitize-html. Sanitizing HTML tags from user-entered data is needed to remove XSS vulnerability. sanitize-html is a famous NPM package for carrying out HTML sanitization having weekly downloads of over 1.5M. The GitHub repository has over 3.1k stars. It fitted well with my criterias.
Deep diving the code
When going through the code of sanitize-html, I saw that the changelog was strictly maintained. Readme had extensive documentation from a usage standpoint. I put in various debug logs and ran test code to gain a deeper understanding. Test cases were having good coverage. After swimming in the code, I felt confident that I have gained a good understanding of it and that the water was no longer unknown.
Finalizing the issue
With this raised confidence level due to code familiarity, I went through the issue list and found one issue which could result in unexpected sanitization results and thus a security vulnerability. Here is the link to GitHub issue.
Due to the newly gained code familiarity, I was able to find out the root cause of the issue. I had put in debug logs at places of concern and then ran the sample code.
Journey to the solution
About the issue
Invalid HTML having closing tags without matching opening tags was not getting stripped out by sanitize-html. As the tags were not stripped, the attacker can go on injecting such tags and this opens up the possibility of XSS vulnerability and unexpected results due to improper/missing sanitization. As the HTML is not valid, expected behavior is that the invalid markup should be stripped.
sanitize-html internally uses HTML parser for parsing the inputs and gets callbacks for an opening tag, closing tag, and text obtained while parsing. A stack of tags is maintained by adding at the time of callback for the opening tag and popping at the time of callback for the closing tag.
I addressed 2 cases of invalid markup, which are explained below.
Case 1: Closing tags without matching opening tag
For the case where the opening tag is not present and the closing tag is present, the HTML parser only gives a callback for the closing tag. To remove the invalid markup, we need to avoid any operation when such a closing tag callback is obtained.
For identifying this scenario, I checked whether the popped tag from the stack is the same as the tag for which the closing tag callback is obtained. If yes, then proceed. If no, then ignore and add the just now popped element back to the stack, as it will be used in some other closing tag callback. Thus by ignoring such invalid closing tag callbacks, they get stripped from the result. Adding to this, I also added a test case to prove that stripping of such invalid tags is happening. Here is the link to the commit.
Case 2: Opening tags without matching closing tags
If only the opening tag is present and the closing tag is absent, the HTML parser corrects it by providing a closing tag callback just after the opening tag callback. Similar callbacks are given for self-closing tags like <br>. I added a test case for this behavior and here is the link to the commit for the same.
Feeling of Accomplishment
Hurray! I learned how to swim in the unknown waters of open-source. I raised my first Pull request. After some discussion with the maintainers, it got merged to the main branch. I was happy to see myself as one of the contributors in the repository and my contribution became part of the 2.7.2 release, with my name in the release notes acknowledged for the contribution.
I would like to thank @boutel for actively reviewing my PR and giving valuable feedback on the same. Also, I would like to thank True Sparrow for encouraging team members to support and make contributions to the open-source libraries to give back to the community.
I hope you liked reading my blog describing the fun, journey and learnings involved in my first open-source contribution. Thank you!