Writing Bazel Rules for Interpreted Languages

Anthony A. Vardaro, Apr 2022

One of Bazel's greatest features is that it can support any language, allowing you to consolidate the build system into one tool. For big monorepos, it's incredibly useful for building complex, multilingual projects. This article highlights Bazel's flexibility in extending the build system with custom plugins written in Starlark. Rules are build-time handlers responsible for registering the actions that ultimately produce pre-defined build outputs. They are the functional intermediary between source code and build artifacts. We'll be writing custom rules to extend Bazel to support Python. Not that Python rules don't exist already, they do, but learning how to extend Bazel is both a valuable skill and kind of fun in small doses.

The goal here is to implement the executable py_binary rule, which produces a Python executable from a set of source files and dependencies. We'll also implement the py_library rule, for selecting and preparing source files for consumption by py_binary as a dependency. These rules are already built into Bazel's core, meaning they are written in Java. There's an ongoing effort to rewrite these rules, and other builtin rules, in Starlark in the interest of shrinking Bazel's core and welcoming more open-source collaboration. I'm choosing to emphasize writing rules for interpreted languages, like Python, as the process can be a little different than writing rules for static, compiled languages. We'll be working out of this example repository.

Concepts and Terminology

In the simplest terms, a build target is a mapping between a defined set of inputs and a defined set of outputs. A rule is simply the interface through which inputs are converted into outputs. They feel and behave like functions, like you would expect in programming or mathematics. In the implementation of a rule, a series of actions are registered to the action graph and triggered during the execution phase. Actions produce the defined outputs of a rule, and are used to do things like writing to a file, running a CLI tool, etc. Anything that Starlark itself can't do. When evaluating a build, Bazel can make clever assumptions on affected inputs to determine which outputs need to be reconstructed. It's partly what makes Bazel fast.

For interpreted languages, there's nothing to really build per se, as most things are deferred to runtime. It's a little annoying because you won't know if you've goofed the build until after you've built it. The tradeoff is that builds for interpreted languages tend to be faster, as there's fewer things to "build" and thus fewer actions to run. This is different than languages like Go, where a rule implementation would register actions that wrap go tool compile, or something like that. For interpreted languages, the role of a _binary rule is to construct a runfile tree and provide a bash script to launch the binaries entrypoint. A runfile is a file that a binary expects to be available at runtime. For Python, this is pretty much every .py source file.

Writing a Binary Rule

Let's start by giving it a name:

 # internal/py_binary.bzl
def _py_binary_impl(cx):
   pass

py_binary = rule(
    implementation = _py_binary_impl,
    attrs = {},
    executable = True,

    # You pass a toolchain_type target here.
    # In this case, we use a builtin toolchain_type that was registered in a previous
    # blog post: https://www.anthonyvardaro.com/blog/hermetic-python-toolchain-with-bazel
    toolchains = ["@bazel_tools//tools/python:toolchain_type"],
    doc = "Builds a Python executable from source files and dependencies.",
)

A Bazel rule is going to need srcs, deps, and data attributes, all of which are type label_list. Another useful bit for our usecase is a label attribute to disambiguate which file should be the designated entrypoint, so we'll add a main attribute as well. By passing executable = True, it notifies Bazel that this rule builds an executable, and that this rule is compatible with bazel run. The Python toolchain I've configured points to a toolchain_type I configured in a previous blog post, Hermetic Python Toolchain with Bazel.

Let's give our rule some attributes:

 # internal/py_binary.bzl
def _py_binary_impl(cx):
   pass

py_binary = rule(
    implementation = _py_binary_impl,
    attrs = {
        "srcs": attr.label_list(
            allow_files = True,
            doc = "Source files to compile",
        ),
        "deps": attr.label_list(
            doc = "Direct dependencies of the binary",
        ),
        "data": attr.label_list(
            allow_files = True,
            doc = "Data files available to binary at runtime",
        ),

        # This doesn't really need to be mandatory, I'm choosing to do so to keep the example simple.
        "main": attr.label(
            allow_single_file = True,
            mandatory = True,
            doc = "Label denoting the entrypoint of the binary",
        ),

        # Our rule is going to register an action to expand whatever template this attribute points to.
        "_bash_runner_tpl": attr.label(
            default = "@rules_py_simple//internal:py_binary_runner.bash.tpl",
            doc = "Label denoting the bash runner template to use for the binary",
            allow_single_file = True,
        ),

        # Bazel ships with a useful bash function for querying the absolute path to runfiles at runtime.
        "_bash_runfile_helper": attr.label(
            default = "@bazel_tools//tools/bash/runfiles",
            doc = "Label pointing to bash runfile helper",
            allow_single_file = True,
        ),
    },
    executable = True,

    # You pass a toolchain_type target here.
    # In this case, we use a builtin toolchain_type that was registered in a previous
    # blog post: https://www.anthonyvardaro.com/blog/hermetic-python-toolchain-with-bazel
    toolchains = ["@bazel_tools//tools/python:toolchain_type"],
    doc = "Builds a Python executable from source files and dependencies.",
)

The _bash_runner_tpl points to a file template that we'll write later in the post. At build time, our rule is going to register an action to expand said template into a default output of the rule. The _bash_runfile_helper is a Bazel-builtin library that provides a useful Bash function, rlocation, for querying runfiles from a Bash script. It's loosely documented here.

Let's declare our executable:

# internal/py_binary.bzl
def _py_binary_impl(ctx):
    """
    py_binary implementation

    Produces a bash script for launching a Python binary using the toolchain
    registered in "@bazel_tools//tools/python:toolchain_type".

    Args:
        ctx: Analysis context
    """
    executable = ctx.actions.declare_file(ctx.label.name)

As far as binary rules go, this is pretty standard. This notifies Bazel that we expect one of the actions registered in this rule to produce a file named after the target. The ctx.actions.declare_file returns a file object that we'd eventually pass to ctx.actions.expand_template.

Our rule must be able to aggregate transitive dependencies, not just the direct ones. To do so, we accumulate and merge transitive runfiles objects found in ctx.attr.deps.

 # internal/py_binary.bzl
interpreter = ctx.toolchains["@bazel_tools//tools/python:toolchain_type"].py3_runtime.interpreter

# File targets to be included in runfile object
files = [
    ctx.file.main,
    executable,
    interpreter,
    ctx.file._bash_runfile_helper,
]

files.extend(ctx.files.srcs)
files.extend(ctx.files.data)

# Merge the current runfiles objects with all of the
# transitive runfile trees (all of which would return the requested DefaultInfo provider)
runfiles = ctx.runfiles(files = files)
runfiles = runfiles.merge_all([
    dep[DefaultInfo].default_runfiles
    for dep in ctx.attr.deps
])

We're constructing a localized runfile object composed of the files that are immediately relevant to this target, and applying it to the transitive runfile trees. This runfiles object gets merged in with a localized runfiles object at every step of the build, until it reaches py_binary, which would be its terminal state.

Because we specified @bazel_tools//tools/python:toolchain_type as a compatible toolchain in the rule definition, we're able to query that toolchain in ctx.toolchains and access the PyRuntimeInfo provider that it returns. This provides us with the filegroup target that points to the python3 executable we'll need to launch our binary. This also technically means that the Python interpreter itself must be a runfile, hence it's included in the files list.

The merge_all operation is merging the transitive depset objects found in the DefaultInfo providers in ctx.attr.deps. A depset is used to prevent quadratic time and space usage when aggregating transitive closures. It's a special purpose data structure unique to Bazel, designed for efficiently merging information accumulated across dependent build targets. Their usage is not limited to file targets, you can technically use a depset to manage any kind of information. The general rule of thumb is that for information local to a rule, a list or a dict is safe to use, for information passed to other rules via a provider, a depset is preferred.

All that's left to do is to register an action that creates the executable file we declared earlier. Using string substitution, we'll expand the provided Bash template with information regarding the path to ctx.file.main and the interpreter. The Bash runner will spawn a Python process with the entrypoint, effectively doing something like python3 main.py with the hermetic runtime.

 # internal/py_binary.bzl
entrypoint_path = "{workspace_name}/{entrypoint_path}".format(
    workspace_name = ctx.workspace_name,
    entrypoint_path = ctx.file.main.short_path,
 )
 
interpreter_path = interpreter.short_path.replace("../", "")

substitutions = {
    "{entrypoint_path}": entrypoint_path,
    "{interpreter_path}": interpreter_path,
}

ctx.actions.expand_template(
    template = ctx.file._bash_runner_tpl,
    output = executable,
    substitutions = substitutions,
)

The files path to the Bash launcher must be in the format workspace_name/path. Hence we have chop the ../ from the interpreter path as the short_path is relative to the runfile root.

As a standard Bazel practice, our py_binary rule should return a DefaultInfo provider for any rules that consume this one. _binary targets don't often become dependencies of other rules, but it does happen.

 # internal/py_binary.bzl
return [
  DefaultInfo(
      executable = executable,
      runfiles = runfiles,
  ),
]

This is pretty much saying: "for any downstream consumers of me, here are the files i've created". Which in our case, it's a Bash executable, and some Python files. Hypothetically, another rule could consume this provider and register an action using the executable created by this py_binary.

And that's pretty much it! You can view the whole thing here.

Writing the Bash Launcher

We need implement the Bash template specified in the _bash_runner_tpl attribute of our py_binary. Using @bazel_tools//tools/bash/runfiles, it's a pretty easy one-liner using the rlocation function.

# internal/py_binary_runner.bash.tpl
#!/usr/bin/env bash

# --- begin runfiles.bash initialization v2 ---
# Copy-pasted from the Bazel Bash runfiles library v2.
set -uo pipefail; f=bazel_tools/tools/bash/runfiles/runfiles.bash
source "${RUNFILES_DIR:-/dev/null}/$f" 2>/dev/null || \
source "$(grep -sm1 "^$f " "${RUNFILES_MANIFEST_FILE:-/dev/null}" | cut -f2- -d' ')" 2>/dev/null || \
source "$0.runfiles/$f" 2>/dev/null || \
source "$(grep -sm1 "^$f " "$0.runfiles_manifest" | cut -f2- -d' ')" 2>/dev/null || \
source "$(grep -sm1 "^$f " "$0.exe.runfiles_manifest" | cut -f2- -d' ')" 2>/dev/null || \
  { echo>&2 "ERROR: cannot find $f"; exit 1; }; f=; set -e
# --- end runfiles.bash initialization v2 ---

$(rlocation "{interpreter_path}") $(rlocation "{entrypoint_path}") "@"

The first bit is boilerplate needed to source the runfiles.bash library, copied from here. Using rlocation we can query the absolute path of interpreter_path and entrypoint_path templated by our py_binary. After string substitution, that line of code will look something like this:

$(rlocation "py_darwin_x86_64/bin/python3") $(rlocation "rules_py_simple/tests/py_binary/src_files_test/src_files_test.py") "@"
Writing a Library Rule

After writing py_binary, writing py_library becomes trivial:

 # internal/py_library.bzl
def _py_library_impl(ctx):
    files = ctx.files.srcs + ctx.files.data

    runfiles = ctx.runfiles(files = files)
    runfiles = runfiles.merge_all([
        dep[DefaultInfo].default_runfiles
        for dep in ctx.attr.deps
    ])

    return [
        DefaultInfo(
            runfiles = runfiles,
        ),
    ]

py_library = rule(
    implementation = _py_library_impl,
)

The py_library rule isn't doing anything py_binary doesn't do already.

It's capturing the transitive runfiles objects, merging it with a local one, and passing it onward via the DefaultInfo provider. Technically this rule has no artifacts, because it doesn't register any actions. Meaning if you bazel build a py_library target, nothing would happen. This rule just collects and propogates information to py_binary (or any other downstream consumers, the rule is mostly generic). If you wanted to get fancy, you could write your own custom provider that this rule returns, allowing you to pass custom build metadata across targets.

A benefit of using py_library (as opposed to globbing everything in one target), is that it granulates the build graph, allowing Bazel to make more effective caching assumptions at build time. For interpreted languages, incremental builds are not particularly useful unless you're building dependencies that require expensive actions.

You can find the full implementation here.

Using Our Rules to Build a Target

To test that everything works together, let's put together a small app that imports a small Python library. Using the Bazel rules we wrote, we can write build targets to articulate the dependency graph of the project and build our app.

Let's put together a small library that exports a single function:

 # tests/py_binary/app/lib/lib.py
def foo():
    return "foo"
 # tests/py_binary/apps/lib/__init__.py
from .lib import foo

And build it with the py_libary we wrote:

 # tests/py_binary/app/lib/BUILD.bazel
load("@rules_py_simple//:defs.bzl", "py_library")

py_library(
    name = "lib",
    srcs = [
        "__init__.py",
        "lib.py",
    ],
    visibility = ["//visibility:public"],
)

And a small app that imports the library...

 # tests/py_binary/app/app.py
import sys
import lib

if __name__ == "__main__":
    print("runtime: ", sys.version)
    print("lib.foo(): ",  lib.foo())

And the py_binary target for it:

 # tests/py_binary/app/BUILD.bazel
load("@rules_py_simple//:defs.bzl", "py_binary")

py_binary(
    name = "app",
    main = "app.py",
    visibility = ["//visibility:public"],
    deps = [
        "@rules_py_simple//tests/py_binary/app/lib",
    ],
)

Building our target produces the app artifact, which is the Bash launcher we wrote.

$ bazel build //tests/py_binary/app
INFO: Analyzed target //tests/py_binary/app:app (0 packages loaded, 0 targets configured).
INFO: Found 1 target...
Target //tests/py_binary/app:app up-to-date:
  bazel-bin/tests/py_binary/app/app
INFO: Elapsed time: 0.088s, Critical Path: 0.00s
INFO: 1 process: 1 internal.
INFO: Build completed successfully, 1 total action

Running the binary, things work like expected:

$ bazel-bin/tests/py_binary/app/app
runtime:  3.10.0 (main, Feb 27 2022, 08:13:44) [Clang 13.0.1 ]
lib.foo():  foo
Github Email LinkedIn