Recently I've been thinking about operational risk (op. risk). Operational risks arise from failures of processes, for instance a missing email, or an automated software system not running properly. Many commercial institutions are interested in minimizing op. risk, since it is risk that produces no value, as opposed to risks associated with investing. This is also something I think about in my job at Lab49, where I'm a software engineering consultant focusing on financial institutions. I think there is also a good analogy for Conda-Forge, even though we are not a commercial outfit. In this case the risk we incur isn't the potential for lost earnings but frustration for our users and maintainers in the form of bugs and lackluster user experience. In this post I explore three main sources of operational risk for Conda-Forge: Automation, Top-Down Control, and Self-Service Structure.
A brief conda-forge primer
Conda-Forge is an ecosystem and community that grew around building packages for the conda package manager. Conda-Forge uses continuous integration services to build packages from GitHub repos called feedstocks. This structure enables teams of contributors to maintain packages via a pull request based workflow. At time of writing Conda-Forge has over 10000 feedstocks and ships more than 120 million packages a month.
Conda-Forge is built around a self-service structure for each stage in a feedstock's lifecyle. The creation of new feedstocks relies on would be maintainers to submit PRs to staged-recipes. Although language specific help teams and staged-recipes reviewers provide some assistance and oversight, the PR submitter plays the most important role in proposing the package and shepherding it to acceptance. Once the feedstock is accepted the maintenance is federated with most upkeep being performed by the maintainers, who have extensive permissions and control over the feedstock. If fixes or updates are needed for a package, maintainers and users are encouraged to open their own pull requests.
This structure can present a few challenges for minimizing op. risk. The most important challenge is the disconnect between feedstock maintainers and users. While most maintainers are package users, most of our users are not maintainers, and are unlikely to become maintainers. The disparity between maintainers and users can come from a few sources, some under our control and others not. For instance we can write better documentation, lowering the barrier to entry, but we don't have control over how our user's incentive structures value Conda-Forge contributions. This produces a gap in representation in the Conda-Forge organizational structure, where non-maintainer users' issues and desires are not communicated to maintainers and Core.
For instance, are we servicing the needs of developers using our binaries as dependencies to code they are compiling locally. As another example, are there support gaps for developers and scientists using Conda-Forge in academic and government laboratories, who might not have the skills or capacity to fix feedstocks. Our reliance on the public GitHub platform may prevent some users without access from raising their concerns. Since these users may be under-represented we don't even know if we are meeting their needs and how best to help.
While the majority of Conda-Forge's permissions structure is federated, certain important parts are centralized, with the Core developers making key decisions. Often these decisions are focused on stability of the ecosystem, for instance what versions of languages to support. Additionally, maintenance and enhancements to the Conda-Forge infrastructure are mostly performed by Core developers.
However, the Core developers are usually experienced feedstock maintainers, expert conda users, and have bought into the Conda-Forge ecosystem and mission. This means that decisions can be made without the perspective of new users or maintainers, or from potential users that are skeptical of the Conda-Forge approach.
For instance, decisions about application binary interface pins are usually made by core, although these changes have impacts on downstream maintainers. It is possible that most maintainers don't know about what these pins are, how they are changed and how that affects their feedstocks.
Automation has been used to great effect to make Conda-Forge possible. The various bots and web services enable Conda-Forge's current scale, providing help and support from running builds, bumping versions, and checking feedstock quality. However, this automation presents its own operational risks and magnifies existing operational risks.
Automation has a tendency to fail when we least expect it and often we lack the ability to fix it. The January 2018 Travis-CI outage is a great example of this, where the CI service we were using for macOS builds experienced reduced capacity and then a complete outage, causing builds to queue for days. Recently there was a sudden decrease in the number of parallel builds on Azure causing a similar queue of builds. Automation can cause issues by enabling users to make decisions without all the needed information. While many feedstocks have effective smoke tests for their packages the autotick bot doesn't currently check for new dependencies, potentially leading to missing or incorrect package metadata.
Overall Conda-Forge has managed its operational risk well. Most importantly Conda-Forge's transparent open source nature allows us to address these issues head on by engaging with the community.