Skip to main content

Compilation concepts

There are two main models for source code to be executed on a particular system: it can either be compiled into machine code, or interpreted. Programming languages such as C, C++ and Rust usually follow the former path, requiring the source code to be compiled into machine code and then linked into binaries that can be used directly, irrespective of the original sources. On the other hand, languages such as Python or shell script tend to be run via an interpreter that processes and runs them directly from the source code. Both approaches have their advantages: compiling into machine code can produce efficient stand-alone executables, at the cost of portability; on the other hand, an intepreter can run programs straight from sources and handle code changes immediately, at the cost of worse performance.

Strictly speaking, programming languages aren't bound into compiler-interpreter dichotomy. There do exist interpreters for programming languages like C, and there do exist compilers that can turn Python code into stand-alone native executables. Nevertheless, they primarily follow a single model and packagers rarely need to be concerned about the alternative possibilities.

Many programming languages fall somewhere across the compiler-interpreter spectrum. Many interpreted languages use Just-in-Time (JIT) compilation: rather than interpreting the input directly, they compile it into machine code upon loading, achieving better performance with one-time startup cost. Python interpreters commonly compile source code into more efficient bytecode upon loading. On the other hand, Java sources need to be manually compiled into bytecode that is afterwards interpreted by a Java Virtual Machine.

This document specifically focuses on the concepts related to languages compiled into machine code, such as C, C++ and Rust.

The compilation pipeline

A typical pipeline in a C-style language consists of three elements:

  1. A preprocessor, that is responsible for performing initial processing of the source code, handling #include statements and substituting macros. Preprocessing is usually done as part of the compilation and is not visible to the user.
  2. A compiler backend, that is responsible for processing the source code and outputting object files (.o on Unix, .obj on Windows), containing a compiled machine code or bytecode.
  3. A linker (or link editor), that is responsible for combining multiple object files together, along with the program's dependent libraries, and outputting the actual binary.

Typically, these programs are not run separately, but rather invoked automatically via the compiler frontend. Usually, compiler frontends are invoked by build systems, from simple Makefiles to complex solutions such as CMake or Meson. These often split compilation and linking steps, compiling every object file separately to enable running multiple compiler processes simultaneously to benefit from multiple logical CPUs.

Build systems

While for trivial projects, compiling software manually may be feasible, more realistically software involves a degree of complexity that justifies using a build system. A build system provides a layer of automation that takes care of compiling, testing and installing the software, at the same time providing a degree of configurability for end users and providing support for multiple platforms and toolchains.

The build system pipeline typically involves four stages:

  1. Configuration -- which covers accepting user input as to how the package should be built, discovering the necessary tooling and dependencies, and preparing the actual build.
  2. Compilation -- which covers compiling all source files into binaries, as well as any other necessary processing. Many software packages can be run from the build directory after this stage.
  3. Testing -- which covers running the package's test suite to verify that is working correctly. In some build system, tests are compiled in this stage; in others, they need to be enabled at configuration stage and built at compilation stage.
  4. Installation -- which covers installing compiled files and package's data files to the actual system, or a staging directory.

Projects usually adapt one of the existing build systems, such as GNU autotools, CMake or Meson. These often are not entirely standalone tools but rather serve the purpose of configuring the build and generating build scripts for other tools, such as make, ninja or various Integrated Development Environments. These tools in turn consume the build scripts to perform the actual build.

Symbols

The code written in C-style languages contains functions, global variables, etc. that are compiled into what is collectively called symbols. The compiled binaries contain symbol tables that map the symbol names into their compiled counterpart, and therefore enable programs to reference them across code units and across different binaries. This enables not only splitting individual programs into separate code files, but also creating libraries of code that can be used across different projects.

These programming languages split interface and implementation. The compiled objects and binaries contain the implementation, which is sufficient for already compiled programs to use them. However, in order to compile a new program, the compiler needs to additionally know the interface; for example, the function prototype that provides the function name, arguments and return type.

Typically, this split is accomplished through splitting the program code across two types of files: header files specifying the interface, and source files containing the implementation. The source files are compiled into the actual binary, while header files are used during the compilation but also distributed along with the compiled library afterwards.

Note that in some cases the header files may contain implementation in addition to interface. For example, the implementation of inline functions is provided in header files, so that the compiler can use it while compiling other programs.

API and ABI

A C-style library essentially defines two related interfaces: an Application Programming Interface (API) that is used by the compiler, and an Application Binary Interface (ABI) that is used at runtime. Generally, the API is defined by the header files, and ABI is inferred from it. Both concepts are critical for compatibility between libraries and the programs using them.

APi defines the interface that is used in the source code of programs. Conversely, ABI is used by compiled binaries. For example, consider a library with the following prototype:

void foo(int32_t a);

Such a function accepts a single int32_t parameter. From programmer's perspective, it can accept any parameter that can be converted into an int32_t. However, from binary perspective the library has a strict contract with the program that an int32_t value must be passed.

Now consider that the library changes prototype into:

void foo(int64_t a);

From programmer's perspective this can be fine, as long as the previous int32_t input can be converted into int64_t. However, the binary contract changes -- a previously compiled program passes an int32_t type where a wider int64_t type is expected now. This is a trivial example of an ABI breakage. If a program was compiled against the old prototype but used the new library, running it could lead to arbitrary results, from crashes to hard-to-debug bugs affecting other code (so-called Heisenbugs).

Systems often feature mechanisms to protect against this class of issues. For example, shared libraries often use various versioning schemes to ensure that the programs remain linked to a single compatible version, and need to be recompiled to use the ABI from a new version.

Note that ABI incompatibilities are not limited to deliberate changes of program interface. They can also be caused by using different compilation parameters, different compilers or even compiler versions.

Linking to libraries

In order to use a library, the program needs to link to it. It can link either statically or dynamically. Static linking means that the library code is embedded into the program directly, and the library file is no longer needed at runtime. Dynamic linking means that the program merely carries a reference to an external library, and the library file is loaded when the program starts. Both approaches to linking have their use cases, and their proponents.

Static linking creates a standalone program that is easier to distribute, and may benefit from additional optimization as the optimizer is able to determine how the library is used exactly. However, statically linked programs are less space efficient, especially if the same library is used across multiple programs. Since a specific library version is embedded into the program, the risk of breakage on updates is minimized. However, this means that in order to update the library, the whole package needs to be rebuilt, which may negatively impact security response time if the library turns out to be vulnerable.

Dynamic linking creates programs that reuse a shared copy of the library. As such, the library either needs to be installed separately or distributed along with the program. However, their main advantage is that the same library is shared across multiple packages, and can be quickly swapped for another version as necessary. Unfortunately, this requires one to take special care for different library versions to be compatible in the Application Binary Interface (ABI) exposed to programs.

In conda-forge, dynamic linking to libraries provided by conda-forge packages is strongly preferred. Many of the concerns related to dynamic linking do not apply here, as proper packaging practices ensure that library dependencies are annotated and installed in compatible versions.

Development files

Building against libraries requires additional development files to be available. In conda-forge, these files may be distributed as part of the package installing the library itself; or split into separate "development" packages, depending on criteria such as their complexity, popularity and size.

The necessary development files include:

  • header files and include files, providing function prototypes and inline code from the library,
  • static libraries, needed for static linking,
  • shared libraries or import libraries, needed for dynamic linking,
  • pkg-config files or build system-specific files, used to indicate how to build against the library.

Include files are installed into the include directory or its subdirectories. They usually use .h or .inc suffix for C. For C++, sometimes names without a suffix are used to follow the standard library #include scheme, or .hpp suffix.

Static libraries are installed into the lib directory and carry an .a suffix on Unix, or .lib on Windows. On Unix, shared libraries are used directly for linking, and they are described in the binaries section. On Windows, import libraries are used instead. They use .lib suffix like static libraries.

Finally, packages often provide additional files that are used at build time to determine how to compile against the library in question. For example, these can include pkg-config files, CMake files, autotools macros. These files are usually used by the build systems.

Binaries

The primary kind of artifacts produced by compiled programming languages are binaries. In this context, binaries mean files containing machine code. There are two main kind of binaries:

  1. Executables: programs that can be run directly by the user.
  2. Shared libraries: collections of compiled code that programs usually link to and therefore they are loaded by programs at start time.
  3. Loadable modules: collections of compiled code that are loaded by programs at runtime (e.g. plugins).

On Unix platforms, executables don't feature any suffix, and common shells only start executables when their filename matches the specified command exactly. On Windows, executables commonly use .exe suffix, and shells account for that. For example, the Python executable will be named python on Unixes, and python.exe on Windows; in both cases, typing python will execute it. Executables are usually installed into the bin directory, except on Windows where there may be installed into a variety of directories, including top-level Prefix directory and Scripts tree.

Shared libraries use filenames with a lib prefix on Unixes, and .so suffix, except for macOS where they use .dylib suffix instead. They are installed into the lib directory. They often include a version string to indicate ABI compatibility between different library versions, as explained in shared library versioning.

On Windows, shared libraries use .dll suffix, and no obligatory prefix. They are installed along with the executable programs (usually under bin directory or equivalent). There is also no standard filename versioning scheme, though many libraries include a version in the filename. The .dll files are only used at runtime. To build programs against a shared library, an additional import library of .lib format must be used, which essentially describes the (visible) content of a .dll file to use.

On most systems, loadable modules are the same file type as shared libraries. macOS is an exception to that: the binaries explicitly distinguish between shared libraries and "bundles", as loadable modules are called. The recommended suffix for these files is .bundle, though much software (including Python) uses .so instead. They are usually installed into tool-specific directories.

note

The term "bundle" can be used both to refer to loadable binary files discussed here, and bundle directories used to encapsulate code and resources. These are distinct concepts, though there can be some confusing overlap, as plugins may be distributed as bundle directories.

macOS Frameworks

In addition to the traditional Unix filesystem hierarchy where binaries and development files from different packages are installed into shared bin, include, lib, etc. directories, macOS features a concept known as "frameworks". Frameworks constitute integrated packages combining shared libraries, development files and other resources in a single .framework directory.

Frameworks are installed into /Library/Frameworks and ~/Library/Frameworks directories. Multiple versions of the same framework can be installed simultaneously. They need to be explicitly included in projects (for example, using the -framework compiler option).

Conda-forge packages do not install frameworks. However, individual software may include system frameworks when built on macOS.

Shared library versioning

Shared libraries are often versioned to indicate ABI compatibility. Typically, at least two version components are used: a minor version that is incremented whenever backwards-compatible ABI changes occur (e.g. new interfaces are added), and a major version that is incremented whenever backwards-incompatible changes happen. Often additional version components are used to indicate library updates without ABI changes.

When such a scheme is used, the installed library usually consists of three files:

  • the actual library with a full version string, such as lib{name}.so.1.2.3 or lib{name}.1.2.3.dylib,
  • a symbolic link including only the major version, such as lib{name}.so.1 or lib{name}.1.dylib,
  • an unversioned symbolic link, such as lib{name}.so or lib{name}.dylib.

When building a new program, the linker -- if passed -l{name} -- uses the unversioned library name. If it is a symbolic link, it is resolved to the actual library. That library is used during the linking process, and its contents (not the filename pointed by the symbolic link) are used to determine the used library version.

On Linux, an entry in the file, called DT_SONAME specifies what filename should be used to load the library at runtime. Typically, it corresponds to the filename with the major version, though it can be any filename, and e.g. libraries that do not provide cross-version compatibility at all often use the full version. This name is often called "soname", and the version part itself is called "soversion".

On macOS, library version information is used instead. It consists of a major version number, a minor (current) version number and a compatibility version number. The major version number functions much like "soversion" -- it is used to construct the "major version" symbolic link and the install name, it can be any string and it needs to change whenever backwards-incompatible changes occur. The minor version number consists of one to three version components, and indicates the current library version; it usually starts with the major number. The compatibility version number indicates the earliest version of the library that remains compatible with programs compiled against the current version.

When a program is started, the library is loaded based on the major version number. Then, the loader compares the current version number of the loaded library against the compatibility version of the library used at link time (stored in the program). If the current version is older than the compatibility version, the program refuses to start.

For example, a GNU-style lib{name}.so.1.2.3 would correspond to a major version of 1, current version of 1.2.3 and compatibility version of 1.2.0. Programs compiled against that version would be compatible with >=1.2.0,<2, but the library would also remain compatible with programs compiled against earlier versions. While Linux technically encodes the equivalent of compatibility versions in the filename, they aren't strictly enforced.

Finding shared libraries at runtime

Typically, a binary linked to a shared library does not embed the complete path to the library, but rather the library name. When a program is started, the dynamic loader is responsible for finding all the needed libraries, recursively, and loading them. The exact behavior differs from platform to platform. Appropriately, the ways conda-forge builds binaries account for these differences.

The behavior of GNU/Linux dynamic loader is documented in the ld.so(8) manpage. The following directories are searched for dependent libraries (provided they do not specify a full path):

  1. The directories specified in the DT_RPATH entry of the program, provided DT_RUNPATH is not present. Specifying DT_RPATH is discouraged, since the resulting behavior is suboptimal.
  2. The directories specified in the LD_LIBRARY_PATH environment variable. This variable is typically set locally when library search paths need to be overridden.
  3. The directories specified in the DT_RUNPATH entry of the binary. Note that these paths do not apply recursively -- the program's DT_RUNPATH is used for the libraries used directly by the program, and these libraries's entries are used for their own dependent libraries, and so on.
  4. The standard system search paths.

Furthermore, the paths in DT_RPATH and DT_RUNPATH can use the $ORIGIN placeholder to reference the directory containing the binary. Conda-forge packages typically ensure that the correct libraries are used by embedding a DT_RUNPATH pointing to the appropriate directory within the conda-forge environment, usually $ORIGIN/../lib. This can be done e.g. by linking with -Wl,-rpath,\$ORIGIN/../lib flag.

On macOS, the equivalent behavior is achieved using Run-Path Dependent Libraries. Libraries are created with install names conaining a @rpath/ path prefix, e.g. @rpath/libpython3.15.dylib, and therefore such names are embedded in the binaries linking to them. As described in the library search process, the dynamic loader searches in the following directories (since the name contains a slash):

  1. The directories specified in the DYLD_LIBRARY_PATH environment variable.
  2. The specified path, with @rpath being substituted for the library's run path.
  3. The directories specified in the DYLD_FALLBACK_LIBRARY_PATH environment variable, that defaults to system library directories.

The run paths in binaries specify the appropriate conda-forge environment directory using a @loader_path placeholder, such as @loader_path/../lib.

The Windows Dynamic-link library search order is quite complex. However, for our purposes it suffices to list the following variants:

  1. A number of overrides such as "DLL Redirection" and "Known DLLs".
  2. The directory containing the application.
  3. A number of system directories.
  4. The current directory.
  5. The directories listed in the PATH environment variable.

Note that steps 2. and 5. specifically focus on program directories. To account for this, conda-forge generally installs .dll libraries into program directories such as the bin directory rather than the lib directory used on Unixes.

Architecture-dependent and architecture-independent packages

Conda-forge packages can be built for specific architectures, or made architecture-independent, also known as noarch.

When C code is compiled into binaries containing machine code and installs them needs to be separately built for every supported platform. Therefore, it is distributed as architecture-dependent packages. The Python language interpreter is an example of such software.

A package that installs only data files and interpreted scripts can be made architecture-independent. It can be built on any supported platform, and the build should always result in the same files being installed, irrespective of platform used to perform it. An example of this is a so-called pure Python package, i.e. a distribution that installs .py modules and no compiled extensions. The equivalent Python packaging concept is a *-none-any.whl package.

A special case of this are pure Python packages with entry points. Installed entry points are platform-specific: on Unixes they are pure Python scripts, but on Windows they are compiled executables. In order to facilitate noarch: python packaging for them, entry points are stored not as final executables, but as the original list. When the package is installed, they are recreated in the appropriate platform-specific format.

Conversely, a Python distribution that installs compiled extension modules in addition to .py modules needs to be built for every platform separately, and requires using arch-dependent packages. Furthermore, since Python extensions interface with the Python interpreter, they also need to be concerned about ABI compatibility with it. Python exposes two kinds of ABI: the regular extension ABI and the stable ABI.

The regular extension ABI preserves compatibility across patch releases of the Python interpreter, but is not compatible across major or minor versions. For example, an extension compiled against Python 3.12 cannot be used on 3.11 or 3.13. Appropriately, the respective conda-forge package needs to be built not only for all supported platforms, but also as separate variants for all supported Python versions. At the time of writing, NumPy is an example of such a package.

The stable ABI, on the other hand, guarantees forward compatibility with future minor releases of Python. Therefore, an extension built for the stable ABI of Python 3.12 can be successfully used on Python 3.13 and 3.14 (but not 3.11). This is represented by *-abi3-*.whl Python packages. In conda-forge, such packages are built for the oldest supported Python version, and therefore are independent of Python version. However, they are still architecture-dependent. An example of such a package is rustworkx.