tl;dr Depending on specific version numbers of underlying libraries may be too inaccurate and cause headaches as upstream libraries evolve and change. A more detailed approach is needed. In this post I outline current and potential work on a path towards a more complete inspection of requirements based on APIs and dynamic pinning of libraries.
What Constitutes a Good Version Number
Version numbers should constitute a set that has the following properties
- The set must be unbounded
- The set must be orderable (maybe)
Of course sets that meet these requirements might not convey a lot of information about the software they represent other than if two things are equivalent and their comparative ages. Note that the requirement to be orderable may not be needed, but is generally useful when considering the idea of an "upgrade" since it provides a clear delineation between older and newer packages. In many cases, the structure of the version number provides additional information. For some projects the version number includes the date of the release, often using cal ver. Many projects use semantic versioning, which attempts to encode information about the underlying source code's API in the version number.
Version Numbers and API Pinning
One of the most important places where version numbers are specified is during the pinning of APIs. Source code often requires specific APIs from the libraries it uses. This requires a pin specifying which versions of the underlying libraries can be used. The package manager then uses these pins to make certain a compatible environment is created.
However, these pins (or even the lack of pins) produce problems.
Firstly, the pins are a one-time, local statement about the current and
future, global ecosystem of packages. For instance a pin of
the current major version number may not hold up over time, newer
scipy may break the API while not changing the major
version number. Similarly the lack of pin for
scipy could be false as
the API breaks. Even pins that establish firm upper and lower bounds may
be false as new versions of the pinned library restore the missing API.
These issues are particularly problematic for dependency systems that
tie the pins to a particular version of the source code, requiring a new
version to be created to update the pins. Conda-Forge is able to avoid
some of these issues via repodata
dynamically updating a package's stated requirements. Overall this
process is fraught, as each package depends on different portions of a
library's API, a version bump that breaks one package may leave others
A Potential Path Forward
All of the above issues are caused by the confusion of the map for the territory. The map, in this case the version number of a library, can not accurately represent the territory, the API itself. To fix this issue we need a more accurate description of the territory. Achieving this will not be easy, but I think there is an approach that gets close enough to limit the number of errors.
We need a programmatic way to check if a particular library, for a particular version, provides the required API. I think this can be achieved iteratively, with each step providing additional clarity and difficulty of implementation. Note that in the steps below, I'm using python packaging as an example, but I imagine that these steps are general enough to apply to other languages and ecosystems.
- Determine which libraries are requirements of the code, this is provided by tools like depfinder and are starting to be integrated into the Conda-Forge bot systems (although they are still highly experimental and being worked on).
- Determine if the a version of the library provides the needed modules. This could be accomplished by using depfinder to find the imports and use the mapping provided by libcfgraph between the import names and the versions of packages that ship those imports.
- Determine if an imported module provides the symbols being imported. This would require a listing of all the symbols in a given python module, including top level scoped variables, function names, class names, methods, etc.
- For callables determine if the used call signature matches the method or function definition.
The depfinder project has made
significant advances along this path, providing an easy to use tool to
extract accurate import and package requirement data from source code.
Depfinder even has cases to handle imports that are within code blocks
that might make the requirement optional or use the python standard
library. Future work on depfinder, including using more accurate maps
between imports and package names and providing metadata on package
requirements that are collectively exhaustive (for instance imports of
pyqt5 in a
try: except: block), will provide even more
accurate information on requirements.
At each one of the above stages we can provide significant value to users, maintainers and source code authors by helping them to keep their requirements consistent and warning when there are conflicts. Conda-Forge can update its repodata as new versions of imported libraries are created, to properly represent if that version is API compatible with it's downstream consumers. Additionally the tables that list all the symbols and call signatures can be provided to 3rd party consumers that may want to patch their own metadata or check if a piece of source code is self consistent in its requirements. This will also help with the loosening of pins, creating more solvable environments for Conda-Forge and other packaging ecosystems. Furthermore, as this tooling matures and becomes more accurate it can be incorporated into the Conda-Forge bot systems to automatically update dependencies during version bumps and repodata patches, helping reduce maintenance burden.
Tools built from the symbol table can also have impacts far beyond Conda-Forge. For instance, the symbol tables could allow source code authors to have a line by line inspection of their code, revealing which lines force the use of older or newer versions of dependencies. This could enable large scale migrations of source code with surgical precision, enabling developers to extract and re-write the few lines of code preventing the use of a new version of a library.
There are some important caveats to this approach that need to be kept in mind.
- All of this work is aimed at understanding the API of a given library, this approach can not provide insight into the code inside of the API, or if changes there impact downstream consumers. For instance, version updates that fix bugs and security flaws in library code may not change the API at all. From this tooling's perspective there is no reason to upgrade since the API is not different. Of course there is a strong reason to upgrade in this case, since buggy or vulnerable libraries could be a huge headache and liability for downstream code and should be removed as quickly as possible.
- Some features may depend on broader adoption by the community. For instance, this approach would benefit greatly from python type hints, since the API could be constrained down to the expected types. Such type constraints would provide much more accuracy to the API version range as any changes could be detected. However, type hints may not be adopted in the python community at a high enough rate to truly be useful for this application.
- Source code is fundamentally flexible. There may be knots of code that even this approach could not cut through, especially as multiple languages and runtime module loading come into the picture. My personal hope would be that the code recognizes when these situations occur, provides its best guess of what is going on, and provides sufficient metadata to users so that they understand the decreased accuracy of the results. Fundamentally the tooling can only provide very educated guesses and context to users, who then need to go figure out what is actually going on inside the code.
Version number based pins are imprecise representations of API compatibility. More accurate representations based on source code inspection would make the Conda-Forge ecosystem more robust and flexible while reducing maintenance burden. Some of the path to achieving this is built, and near future steps can be achieved with current tooling and databases.