The Case for a Unified Language Tree in Monorepo Codebases

TL;DR

A proposal to structure a polyglot monorepo around a single, unified, cohesive source tree that merges all languages and code types under a single branch. This approach optimizes for cohesion in polyglot projects, accepting the cost of breaking from convention and creating potential IDE friction.

Context

In a typical monorepo, code is sorted into top-level directories by language and purpose (e.g., production vs. test). For example, a project using Java and Python with Bazel, would follow a conventional structure as follows:

java/com/starfleet/warpdrive/PlasmaManifold.java

java/com/starfleet/warpdrive/BUILD

javatests/com/starfleet/warpdrive/PlasmaManifoldTest.java

javatests/com/starfleet/warpdrive/BUILD

python/com/starfleet/warpdrive/core.py

python/com/starfleet/warpdrive/BUILD

tests/com/starfleet/warpdrive/core_test.py

tests/com/starfleet/warpdrive/BUILD

This approach results in the code for logical components (i.e., features, libraries, and applications) being fragmented across multiple separate directories and multiple build files, with the division starting at the very root of the repository.

Proposal

I propose monorepo maintainers eliminate these top-level language and test directories. Instead, they should adopt a single, unified, merged source tree, where all code for a logical component is co-located, regardless of language or purpose. In this new structure, the previous example is simplified to:

com/starfleet/warpdrive/PlasmaManifold.java

com/starfleet/warpdrive/PlasmaManifoldTest.java

com/starfleet/warpdrive/core.py

com/starfleet/warpdrive/core_test.py

com/starfleet/warpdrive/BUILD

This approach shifts the responsibility for separating code. Instead of using the directory structure to define compilation units, we use the build tool. Directories group by feature (the package), and the build tool groups by function (the target). The BUILD file in the example above would contain a target for each source file using the appropriate language-specific build rules.

Rationale

This proposal is not a revolutionary paradigm but rather the logical extension of a well-established principle: feature-based packaging. This approach organizes code into a tree of highly cohesive components and promotes low coupling between them. It contrasts with layer-based packaging, which groups code into broad architectural layers (like ui, api, data) with low cohesion. The choice is between organizing by vertical feature slices (all code for one feature stays together) or by horizontal architectural layers (all code of one type stays together).

In a scaled monorepo, feature-based packaging provides the following advantages:

Conflict Reduction: Layer-based packaging forces unrelated projects into shared, high-traffic directories, which guarantees interpersonal and technological friction and can lead to a restrictive engineering monoculture as a mitigation strategy. In contrast, feature-based packaging provides separation between packages by keeping unrelated projects isolated, which mitigates conflict and permits the technological diversity required for innovation.

Productivity Improvement: This model improves productivity by co-locating all code related to a single task. This reduces the engineering time wasted by navigating a sprawling repository, thereby allowing engineering to focus on improving products rather than finding code and moving around the source tree. This mirrors workflow optimization in other engineering fields, which minimizes the wasteful "travel time" between stations.

Centralized Documentation: A fragmented structure complicates the placement of canonical resources (e.g. READMEs, design docs, guides). When components are spread across multiple top-level directories it becomes unclear where the canonical resources belong. This forces a choice between replicating documents, which leads to synchronization failures, or linking between them, which adds unnecessary complexity and is not resilient to divergence as maintainers add new information. In contrast, the merged tree approach provides a single, obvious location for all of a component's assets, including documentation and other supplementary resources.

The convention of sorting code into top-level directories by language (java/, python/) is, by definition, a form of layer-based packaging. It forces unrelated components to co-locate simply because they share a language, while simultaneously splitting a single cohesive feature across multiple directories. By eliminating these language layers, we extend the benefits of feature-based packaging to the entire repository. In this model, a language is not a top-level container but simply an attribute of a component within a cohesive, polyglot package.

Side Note: Conceptually, the argument for a merged source tree has parallels with the common arguments for trunk-based development. Both practices aim to minimize divergence in source code, with trunk-based development focused on divergence over time, and this proposal focused on divergence over space.

Tradeoffs

Adopting this approach means accepting four tradeoffs:

Violates Conventions: The proposal eschews widely established conventions for project and package layout (e.g. the java/javatests split is ubiquitous in the Java ecosystem)
Creates a Learning Curve: Violating convention means engineers unfamiliar with the structure will face a learning curve to understand the unconventional layout.
Breaks IDE Integration: Violating convention may break or degrade the functionality of IDEs and their plugins, as they are often built upon conventions.
Language Incompatibility: The co-located structure may not be possible for all programming languages, depending on their toolchain constraints.

Tradeoff 1 is acceptable because innovation requires challenging the status quo. Many conventions predate modern polyglot tools, and while they create harmony, accepting them as a hard constraint on innovation is untenable. Principled and purposeful disruption must remain a priority for engineering.

Tradeoff 2 is acceptable because optimizing to avoid learning curves risks engineering becoming stuck in a local maximum. This structure improves long-term cohesion and simplicity at the cost of short-term discomfort, which is a necessary tradeoff in sustainable engineering.

Tradeoff 3 is acceptable because source code must drive tooling evolution, not the other way around. Avoiding the friction between new architectures and existing tools is a decision to prioritize the status quo over innovation, which in extreme cases locks organizations into legacy systems.

Tradeoff 4 is acceptable because perfection is not required, and all major languages in use today are compatible with this approach in Bazel. If a future toolchain requires a separate directory, it can be treated as a deliberate exception, not the default rule.

Conclusion

Extending feature-based packaging to the root of the repository improves cohesion, boosts productivity, and embraces the polyglot-first approach essential for modern software. It requires an engineering culture that values reasoning from first principles over rigid adherence to convention, and while it challenges the status quo, the alternative (a culture that avoids learning and change) is not built for long-term survival. I am adopting this approach today and recommend you do the same.

Page updated

Google Sites

Report abuse