How did Kubernetes push git to its limit? Some stories from my time (2016-2019) working on Kubernetes.

No atomicity across subprojects – In 2016, Kubernetes was still a monorepo. Everything was developed in a single repository. This meant that developers could reuse CI infrastructure easily and ensure that changes to the different components (e.g., kube-proxy, kube-apiserver, kubelet) would work together.

However, downstream projects needed to build on the API. That meant vendoring parts of Kubernetes or separating API specifications from the code.

Transitioning to different subproject repositories wasn't easy. It happened gradually and painfully. The plan was to continue developing the libraries in the monorepo (under a special staging subfolder) and sync the changes to new project repositories. But, of course, this led to all sorts of problems – unreliable syncs, missing commit history, different commit SHAs, and more.

The solution might seem simple, but even simple problems become difficult at scale, especially when many different people and organizations are involved.

A system that could record atomic commits across projects or a better submodule experience would have allowed for more flexible developer organization, especially as the project grew to a new scale.

No benevolent dictator to merge patches – While the Linux kernel successfully scaled on git, Kubernetes had a different governing model. There was no Linus Torvalds to collect patches and manually apply them.

The first problem of this model is authorization. Who can merge what code? The team devised a budget version of the GitHub OWNERS file (before the feature existed natively on GitHub).

The second problem was the merge queue. Feature branches could pass tests independently but fail when merged sequentially. The solution was a merge queue (now evolved into prow, the Kubernetes testing infrastructure).

Native package management – The Kubernetes build system is bash. The project experimented with bazel but removed it (too complicated, bad developer experience).

While the CI infrastructure does the right thing most of the time by linking artifacts (e.g., binaries, container images, test output) back to the source code (git commit SHA), much of this work could be generalized. While the Linux kernel ships relatively few artifacts (a single tar.gz), Kubernetes ships many different products – container images, binaries, API types and client libraries, and wholistic system configurations (e.g., which version of third-party dependencies like etcd).

These features might seem out of scope for a version control system. Are they collaborative features?

Adding authorization gives versioned files ownership; does that make sense (it does for a filesystem)?

Integrating package management extends the scope of versioning past just source code: binaries, images, output, and more – but if that's how we mostly consume software, is that so bad?  

Finally, collaborative features can seem like feature creep for a technical product. Yet, git already makes opinionated workflow decisions – merging, the lack of file locking, local-first optimizations, cheap branching over trunk-based development, and more. So why shouldn't a VCS embrace its role as a collaboration tool and explore more generic merge-based optimizations like a queue?