Hermetic Python Toolchain with Bazel

Anthony A. Vardaro, Dec 2021

Motivation

The primary benefit in implementing your own Python toolchain is that it allows you to escape the dreaded autodetecting toolchain that ships with Bazel. This toolchain is not hermetic. It depends on the host machines installation of Python, whatever and wherever that might be. The autodetecting toolchain depends on the host machines $PATH, which will be different between execution platforms and will cause flakes in what version of Python is running your code.

The goal here is to implement our own hermetic Python toolchain, so that everyone is using the same version of Python for building, testing, and running their code. We'll do this by writing a repository rule that downloads a pinned Python distribution and generates a BUILD.bazel that declares a constrained toolchain target. We'll be working out of this example repository.

Concepts and Terminology

If you're not familiar with Bazels toolchain ecosystem, the official documentation on Toolchains and Platforms are worth your time.

A platform is a named collection of system constraints that describe where code is intended to run. This is done by creating a platform rule. At runtime, Bazel identifies three platform targets which serve different purposes. The host platform is the platform on which Bazel is running. For me, it'd be my laptop. The execution platform is the platform that Bazel executes actions on. For me, it's still my laptop. If I were to use remote execution, the execution platform would be the platform of the execution environment, the host platform would still be my laptop. The target platform is the platform in which the final build is intended to run. If I were cross-compiling a target from a darwin host machine to a linux container, the host platform would be my laptop and the target platform would be the linux container.

A constraint is an individual criteria that a platform must satisfy. Remember, a platform is simply a named collection of these constraints. Constraints allow you to add cardinality to your platform definitions. Bazel ships with predefined platform definitions, which is what we'll be using here.

A toolchain is a target that bundles a language-agnostic toolchain implementation, a toolchain type, and a set of supported platforms.

A repository rule is a Bazel rule that registers an external repository in the Bazel workspace. It can only be called from the WORKSPACE file, and it provides access to non-hermetic actions in the loading phase, which are sometimes necessary when building workspace dependencies. The hermetic Python we'll be implementing here will be implemented as a repository rule. It will generate unique Bazel repositories for each toolchain.

How does this fit together? At runtime, Bazel will identify the registered toolchains, the host platform, and the execution platform. Using the constraints described in the toolchains, and the constraints described in the platform, Bazel will attempt to select suitable toolchain pairings for the host and execution platforms. I recommend building something with the --toolchain_resolution_debug=True set, as it will help you conceptualize Bazels fascinating toolchain resolution process.

Writing our BUILD File Template

Our repository rule is going to download a Python distribution archive and generate a BUILD.bazel file that gets injected at the workspace root of the Python repository. Before implementing our repository rule, we first need to create the framework of the build file that's dropped in the Python workspace root. This is where we also specify our py_runtime, py_runtime_pair, and toolchain targets.

Let's write some filegroup targets for clustering all the files we find in the Python distribution.

# internal/BUILD.dist.bazel.tpl
filegroup(
    name = "files",
    srcs = glob(["install/**"], exclude = ["**/* *"]),
    visibility = ["//visibility:public"],
)

filegroup(
    name = "interpreter",
    srcs = ["{interpreter_path}"],
    visibility = ["//visibility:public"],
)

These targets are useful to the py_runtime target that we're going to pass them to later. You'll notice the usage of string substitution templates, namely with {interpreter_path}. When our repository rule implementation runs during the loading phase, our named substitutions will get injected into the BUILD.bazel file that this template generates.

Let's define our runtime and toolchain targets.

 # internal/BUILD.dist.bazel.tpl
load("@bazel_tools//tools/python:toolchain.bzl", "py_runtime_pair")

filegroup(
    name = "files",
    srcs = glob(["install/**"], exclude = ["**/* *"]),
    visibility = ["//visibility:public"],
)

filegroup(
    name = "interpreter",
    srcs = ["{interpreter_path}"],
    visibility = ["//visibility:public"],
)

# The py_runtime target denotes a platform runtime or a hermetic runtime.
# The platform runtime (system runtime) by its nature is non-hermetic.
# This py_runtime target is for our hermetic Python.
py_runtime(
    name = "py_runtime",
    files = [":files"],
    interpreter = ":interpreter",
    python_version = "PY3",
    visibility = ["//visibility:public"],
)

# A py_runtime_pair is used to couple hermetic Python2 and Python3 runtimes into a toolchain.
# We're not supporting py2, hence we pass None.
py_runtime_pair(
    name = "py_runtime_pair",
    py2_runtime = None,
    py3_runtime = ":py_runtime",
)

toolchain(
    name = "toolchain",
    exec_compatible_with = [
	{constraints},
    ],
    target_compatible_with = [
	{constraints},
    ],
    toolchain = ":py_runtime_pair",

    # We're just using the builtin Python toolchain type. 
    # A toolchain_type is simply a name that describes the type of the toolchain.
    # We could define our own toolchain_type but there is no need to for this use case.
    toolchain_type = "@bazel_tools//tools/python:toolchain_type",
)

Here we create a py_runtime target for accepting the filegroup targets described above. The py_runtime is a special target for describing a Python runtime, regardless of whether its a system runtime or a hermetic runtime. In our case, we're describing a hermetic runtime (it's pretty much the whole point of this article).

We then take our py_runtime and pass it to a py_runtime_pair target. The py_runtime_pair rule is a rule for coupling a Python2 and Python3 runtime into the same toolchain. It returns a ToolchainInfo provider, which is an object propogated to rules that consume this toolchain. Because this is a hermetic Python3 implementation, we can pass None to the py2_runtime.

Now we can pass our runtime pair to a toolchain target. This target describes a toolchains compatible host and execution platforms, and it's toolchain type. The toolchain type is simply an alias for naming certain kinds of toolchains. You can implement your own toolchain_type, but the one that ships with Bazel is sufficient for us. Again, we are taking advantage of string substitution to inject the compatible OS and CPU constraints.

That's pretty much it for our BUILD.bazel file template. Now we can implement our supporting repository_rule.

Writing our Repository Rule

For our toolchain, we're going to implement a repository_rule that downloads a Python distribution and generates a BUILD.bazel file that describes a toolchain.

Let's describe our repository rule. Let's start by just giving it a name.

# internal/python_interpreter.bzl
def _py_download(ctx):
    """
    Downloads a Python distribution and registers a toolchain target.
  
    Args:
        ctx: Repository context.
    """
    pass

py_download = repository_rule(
    implementation = _py_download,
    attrs = {},
)

The repository_rule function is a special builtin function for declaring a new repository rule. We're describing a new rule, named py_download, which will generate Python toolchain targets from a Python distribution. The implementation argument passes a callback function, _py_download, which implements the rule. The repository context gets passed to the implementation function, which provides an API for handling the non-hermetic operations a repository rule allows you to do. Unlike conventional Bazel rules, repository rules do not have a return value. Their purpose is typically some variation of downloading an archive and generating a BUILD.bazel for it. The attrs argument takes a dictionary describing the attributes of the rule. Right now, it doesn't have any. We're able to call this from WORKSPACE now.

# WORKSPACE
workspace(
    name = "rules_py_simple",
)

load("@rules_py_simple//internal:python_interpreter.bzl", "py_download")

py_download(
    name = "my_py_download",
)

Our rule doesn't do anything yet, let's give it some attributes.

 # internal/python_interpreter.bzl
def _py_download(ctx):
    """
    Downloads a Python distribution and registers a toolchain target.
  
    Args:
        ctx: Repository context.
    """
    pass

py_download = repository_rule(
    implementation = _py_download,
    attrs = {
        "urls": attr.string_list(
            mandatory = True,
            doc = "String list of mirror URLs where the Python distribution can be downloaded.",
        ),
        "sha256": attr.string(
            mandatory = True,
            doc = "Expected SHA-256 sum of the archive.",
        ),
        "os": attr.string(
            mandatory = True,
            values = ["darwin", "linux", "windows"],
            doc = "Host operating system.",
        ),
        "arch": attr.string(
            mandatory = True,
            values = ["x86_64"],
            doc = "Host architecture.",
        ),
        "_interpreter_path": attr.string(
            default = "bin/python3",
            doc = "Path you'd expect the python interpreter binary to live.",
        ),
        "_build_tpl": attr.label(
            default = "@rules_py_simple//internal:BUILD.dist.bazel.tpl",
            doc = "Label denoting the BUILD file template that get's installed in the repo.",
        )
    },
)

We've created an attribute set for our users to describe their toolchain. All they need to do is provide our rule with a mirror URL, a digest, a target platform, and a target CPU, and our rule can handle the heavy lifting of generating the necessary toolchain bits.

The compatible OS platforms that this rule can support are restricted to darwin, linux, and windows. The compatible CPU architectures are restricted to x86_64. Our implementation function should be able to take these flags and generate compatibility constraints for our toolchain.

You'll notice the rule comes with the private attributes _interpreter_path and _build_tpl. The _interpreter_path path describes the path to the Python interpreter binary. In most cases, it lives in bin/python3, hence the attributed privacy and default setting. The _build_tpl attribute describes the base template of the injected BUILD.bazel. We wrote our own, but let's give users the option to patch what gets injected in the repository.

Let's start implementing our callback function. Let's start by downloading the provided Python mirror URL.

 # internal/python_interpreter.bzl
def _py_download(ctx):
    """
    Downloads a Python distribution and registers a toolchain target.

    Args:
        ctx: Repository context.
    """
    ctx.report_progress("downloading python")
    ctx.download_and_extract(
        url = ctx.attr.urls,
        sha256 = ctx.attr.sha256,
        stripPrefix = "python",
    )

    return None

Pretty self-explanatory! The repository_ctx object comes pre-baked with a bunch of helper functions that help us achieve what we're trying to do here. In this case we can just give download_and_extract() the mirror URLs and expected digest, and it will extract the Python distribution into the repositores workspace root.

We need to figure out a way to map the values passed in the os and arch attributes to something Bazel can understand. Let's create a simple mapping that maps an attributes value to a Bazel label pointing to a @platforms// target.

 # internal/python_interpreter.bzl
_OS_MAP = {
        "darwin": "@platforms//os:osx",
        "linux": "@platforms//os:linux",
        "windows": "@platforms//os:windows",
}

_ARCH_MAP = {
        "x86_64": "@platforms//cpu:x86_64",
}

Where do these labels come from? They come shipped with Bazel, but you can find their implementation here. The target_compatible_with and exec_compatible_with attributes of the toolchain will accept these labels, as they return the appropriate providers.

Let's go ahead and generate some Starlark code for a label list of platform targets to get injected in the generated build file.

ctx.report_progress("generating build file")
os_constraint = _OS_MAP[ctx.attr.os]
arch_constraint = _ARCH_MAP[ctx.attr.arch]

constraints = [os_constraint, arch_constraint]
    
# So Starlark doesn't throw an indentation error when this gets injected.
constraints_str = ",\n        ".join(['"%s"' % c for c in constraints])
We're taking our platform labels and generating Starlark code for a list literal. Since Starlark relies on indentation for scoping blocks of code, we need to prefix the strings with whitespace, ugh. constraints_str will get propogated to the toolchain in the build file, which will end up looking something like this.
 # Generated BUILD.bazel file
toolchain(
    name = "toolchain",
    exec_compatible_with = [
	"@platforms//os:osx",
        "@platforms//cpu:x86_64",
    ],
    target_compatible_with = [
	"@platforms//os:osx",
        "@platforms//cpu:x86_64",
    ],
    toolchain = ":py_runtime_pair",
    toolchain_type = "@bazel_tools//tools/python:toolchain_type",
)

Now all that's left is templating the BUILD.bazel, which is again made easy by the repository_ctx.

# Inject our string substitutions into the BUILD file template, and drop said BUILD file in the WORKSPACE root of the repository.
substitutions = {
    "{constraints}": constraints_str,
    "{interpreter_path}": ctx.attr._interpreter_path,
}

ctx.template(
    "BUILD.bazel",
    ctx.attr._build_tpl,
    substitutions = substitutions,
)

return None

That's it! The implementation is very straightforward. I believe the concepts are more difficult to grasp than the implementation, but they are important to know.

Here's the entire rule. Let's take a moment to marvel and be amazed, then we'll head over to the WORKSPACE file to test out our new fancy rule.

# internal/python_interpreter.bzl
_OS_MAP = {
        "darwin": "@platforms//os:osx",
        "linux": "@platforms//os:linux",
        "windows": "@platforms//os:windows",
}

_ARCH_MAP = {
        "x86_64": "@platforms//cpu:x86_64",
}

def _py_download(ctx):
    """
    Downloads a Python distribution and registers a toolchain target.

    Args:
        ctx: Repository context.
    """
    ctx.report_progress("downloading python")
    ctx.download_and_extract(
        url = ctx.attr.urls,
        sha256 = ctx.attr.sha256,
        stripPrefix = "python",
    )
    
    ctx.report_progress("generating build file") 
    os_constraint = _OS_MAP[ctx.attr.os]
    arch_constraint = _ARCH_MAP[ctx.attr.arch]

    constraints = [os_constraint, arch_constraint]
    
    # So Starlark doesn't throw an indentation error when this gets injected.
    constraints_str = ",\n        ".join(['"%s"' % c for c in constraints])

    # Inject our string substitutions into the BUILD file template, and drop said BUILD file in the WORKSPACE root of the repository.
    substitutions = {
        "{constraints}": constraints_str,
        "{interpreter_path}": ctx.attr._interpreter_path,
    }

    ctx.template(
        "BUILD.bazel",
        ctx.attr._build_tpl,
        substitutions = substitutions,
    )

    return None

py_download = repository_rule(
    implementation = _py_download,
    attrs = {
        "urls": attr.string_list(
            mandatory = True,
            doc = "String list of mirror URLs where the Python distribution can be downloaded.",
        ),
        "sha256": attr.string(
            mandatory = True,
            doc = "Expected SHA-256 sum of the archive.",
        ),
        "os": attr.string(
            mandatory = True,
            values = ["darwin", "linux", "windows"],
            doc = "Host operating system.",
        ),
        "arch": attr.string(
            mandatory = True,
            values = ["x86_64"],
            doc = "Host architecture.",
        ),
        "_interpreter_path": attr.string(
            default = "bin/python3",
            doc = "Path you'd expect the python interpreter binary to live.",
        ),
        "_build_tpl": attr.label(
            default = "@rules_py_simple//internal:BUILD.dist.bazel.tpl",
            doc = "Label denoting the BUILD file template that get's installed in the repo.",
        )
    },
)
Using our Repository Rule and Registering the Toolchain

Now that we have a rule that downloads a Python archive and produces toolchain targets, we just need to specify our toolchains in the WORKSPACE file and register them. Before diving in, let's first think about what kind of Python distribution we want to use.

The Python interpreter by its nature has runtime dependencies on the host that do not get shipped with the distribution. This is annoying. It introduces unnecessary flakiness in our hermetic toolchain, as different host machines will have different versions of ssl, zlib, ncurses, etc. These installations can come from HomeBrew, Apt, or god knows what else. For this reason, I do not recommend using the official distributions from python.org, you will be unhappy. Fortunately for us, there is an awesome open-source project, python-build-standalone, which aims to produce a "self-contained, highly portable" distribution of Python. We'll use this.

Let's define our toolchain repositories and then register them with the register_toolchains rule.

 # WORKSPACE
workspace(
    name = "rules_py_simple",
)

load("@rules_py_simple//internal:python_interpreter.bzl", "py_download")

py_download(
    name = "py_darwin_x86_64",
    arch = "x86_64",
    os = "darwin",
    sha256 = "fc0d184feb6db61f410871d0660d2d560e0397c36d08b086dfe115264d1963f4",
    urls = ["https://github.com/indygreg/python-build-standalone/releases/download/20211017/cpython-3.10.0-x86_64-apple-darwin-install_only-20211017T1616.tar.gz"],
)

py_download(
    name = "py_linux_x86_64",
    arch = "x86_64",
    os = "linux",
    sha256 = "eada875c9b39cc4bf4a055dd8f5188e99c0c90dd5deb05b6c213f49482fe20a6",
    urls = ["https://github.com/indygreg/python-build-standalone/releases/download/20211017/cpython-3.10.0-x86_64-unknown-linux-gnu-install_only-20211017T1616.tar.gz"],
)

# The //:toolchain target points to the toolchain target we wrote in the BUILD file template.
register_toolchains(
    "@py_darwin_x86_64//:toolchain",
    "@py_linux_x86_64//:toolchain",
)

Our py_download rule generates Bazel repositories for each toolchain, which we can then pass to register_toolchains. At runtime, Bazel will evaluate the host and execution platforms, and select the correct toolchain to bundle with target you're trying to build.

Let's do a quick test to see if everything works properly. Let's write a program that simply prints the version of its runtime.

 # bin.py
import sys

print(sys.version)

Write a py_binary target for it.

 # BUILD.bazel
py_binary(
    name = "bin",
    srcs = ["bin.py"],
    visibility = ["//visibility:public"],
)

Running our binary, it should print 3.10.0 as the version (which is the one we pinned in the workspace file). And it does! The reason py_binary selects our toolchain instead of the builtin autodetecting one is because we registered our toolchain_type as bazel_tools//tools/python:toolchain_type, which is the type that py_binary natively expects.

$ bazel run bin
INFO: Analyzed target //:bin (15 packages loaded, 92 targets configured).
INFO: Found 1 target...
Target //:bin up-to-date:
  bazel-bin/bin
INFO: Elapsed time: 0.531s, Critical Path: 0.00s
INFO: 1 process: 1 internal.
INFO: Build completed successfully, 1 total action
INFO: Build completed successfully, 1 total action
3.10.0 (default, Oct 18 2021, 00:33:47) [Clang 13.0.0 ]

The learning curve for extending Bazel is steeper than tranditional build systems like Make. Learning how to write effective rules will take some time. As I said earlier in the post, the implementation of this stuff tends to be very straightforward, but the surrounding concepts are complex. I recommend you look around the example repo to help you conceptualize how everything fits together.

Github Email LinkedIn