November 10, 2024
5
min read

How Microsoft worked around a Git limitation to shrink a repository by 94%

Daniel Cranney

Imagine that you are responsible for a Git repository with 1000 users, and 20 million lines of code. You struggle to keep up with constant pull requests but the biggest problem is that the Git file size of the repository is mushrooming to over 170GB making it impossible for people to contribute. Jonathan Creamer of Microsoft was in that position, and discovered with help from Git experts that there is a weird limit in Git itself. Patching the code of Git, they managed to shrink down the size of the repository by 94%!

Microsoft’s Massive Monorepo

Managing a monorepo is no simple task. The one in question, colloquially known as 1JS, is a JavaScript code base where engineers maintain and improve features included in the Office online suite (Word, Excel and more).

With around 1,000 engineers contributing 20 million lines of code and working with well over 2,000 packages, the scale is difficult to truly comprehend, with a daily git-pull often including hundreds - or even thousands - of commits from the previous day.

As you can imagine, in this kind of environment, a minor mistake or issue can grow out of control and compound into something far more major in a relatively short period of time.

Finding a Fix

After noticing the repository's git size had slowly grown from an acceptable 2GB to over 170GB, Jonathan began investigating with the help of GitHub specialist Derrick Stolee, determined to get to the bottom of the issue.

Initial efforts to manage the bloat targeted versioning processes and redundant files, especially automated change files that had piled up by the thousands. Despite some light-touch changes, it continued to expand rapidly, indicating a deeper issue that needed investigation.

Discovering the 16-Character Git Limitation

By working with Derrick, Jonathan realised there was a fundamental issue in Git’s file-packing algorithm. This algorithm only compared the last 16 characters of filenames to determine file similarities. This limitation inadvertently became problematic for monorepos with numerous files sharing long, similar names, such as CHANGELOG files across packages in 1JS.

For instance, Git would treat repo/packages/foo/CHANGELOG.json and repo/packages/bar/CHANGELOG.json as the same file, and as a result would store full copies of files instead of diffs. Even minor updates to a CHANGELOG file could result in the entire file being duplicated, and in a repository like 1JS with thousands of frequently updated CHANGELOGs, the extent of the git size growth was staggering.

Implementing a Path-Based Repacking Solution

To address this, Jonathan employed a new Git feature with a custom repack option (--path-walk), enabling Git to evaluate entire paths rather than just the last 16 characters of filenames. Running git repack -adf --path-walk reduced the repository size from 178GB to 5GB, effectively eliminating file duplication by ensuring Git generated efficient diffs for files with similar names.

They also added a configuration (pack.usePathWalk) to ensure this compression method would be used by default, preventing future bloat in similarly structured repositories. This was, of course, useful for Microsoft in solving the immediate issue, but also for others organisations who might have a similar issue without even realising it, much less knowing the solution for fixing it.

Broader Impact and New Git Tooling

Commands like git survey are currently in the works to help developers identify potential size bottlenecks, especially in files with long, repetitive names. This work not only optimized the 1JS monorepo but also contributed to the Git ecosystem, providing a scalable solution for developers working with large repositories worldwide.

The Role of Remote Work

With the 1JS team distributed all around the globe, you might be forgiven for thinking that this had in some way contributed to the issue occurring in the first place. Perhaps - you might think - issues like this might be easier to identify and resolve when team members are in close proximity to one other, easily able request help from others around them.

In reality, the situation was quite the opposite. Team members located close to the GitHub servers (GitHub is owned by Microsoft, by the way) were less likely to notice the gradual slowdown caused by interacting with a constantly growing repository.

However, team members located further away from the servers - including Jonathan, who is located in the Southeastern state of Tennessee - were impacted by slow loading times each time they interacted with the repo on the command line.

Had the team been together in one office, the issue might have gone undiscovered for months or even years, so it's one more win for remote work (good job, Microsoft!).

Lessons to be Learned

Since finding the issue (and its solution), Jonathan has been speaking with other organisations like Loops, Shopify and Netflix to see whether their git sizes could be reduced in a similar way.

Now that Jonathan has brought the issue to the attention of the development community, it will no doubt be investigated by those looking to get a deeper understanding of the inner workings of repositories and git in general. As is so often the case, community collaboration will drive improvements that ripple through the development ecosystem.

With git being at the heart of most developers’ workflows, this case got us wondering whether we really understand its inner-workings as much as we should, considering how integral it is to what we do. Jonathan’s experience - and support from a git expert like Derrick - meant that they were able to resolve the issue, but a small team (or an individual) may not be able to without digging deeper into how git works, and the best practices for using it with vast repositories.

What Next?

Coffee with Developers - How a Small Team Shrank a Microsoft Monorepo by 94%

Learn more about this case by heading over to our Watch page, or listen as an audio podcast on Buzzsprout.

Jonathan writes about how to manage and maintain monorepos over on his blog, so be sure to subscribe there to pick up more tips and tricks in future.

Finally, don't forget to subscribe to the Dev Digest so you stay up-to-date with all of the latest tech news, tips and tricks, events, jobs and more .

How Microsoft worked around a Git limitation to shrink a repository by 94%

November 10, 2024
5
min read

Continue reading

We are busy writing more posts on this topic right now. Sign up for our newsletter to not miss them.

Subscribe to DevDigest

Get a weekly, curated and easy to digest email with everything that matters in the developer world.

From developers. For developers.