Tutorial on Compilation, Linking, and Dependency Management in C++

Note: the following is an in-progress draft. I recently ran across https://fabiensanglard.net/dc/ which does a good job explaining some of this.

Effective use of external (often third-party) libraries in C++ projects can accelerate the development process. These are often called the dependencies of the project. Integrating libraries into a project is one of the more confusing/confounding things for students. This essay is an introductory level description of the background, problems, and potential solutions ¹. It ignores entirely the decision making process on whether to use a third-party library in the first place.

Background on Compilation, Linking, and Running

What follows is a brief, high-level description of how we go from source code (what is in your editor or IDE) to an executing program. This is broadly referred to as the build process. I ignore many details just to set the stage for talking about libraries.

C++ can be hosted (run on top of an OS) or non-hosted, so called bare-metal. This distinction is not important for our purpose here so I will assume hosted C++ since that is what you are likely more familiar with and where you are more likely to encounter dependency issues. You likely have experience with non-hosted code from embedded system design and dependency problems can occur there as well (remember DriverLib?), but the solutions are largely the same. I will point out differences as needed.

To be sure we are speaking the same jargon, here is a glossary of terms:

header : a file containing C++ declarations only
implementation file : a file containing C++ definitions (implementations)
translation unit : the collection of code formed by processing all preprocessor directives in an implementation file or header file
object file : a file containing the machine code resulting from compilation of a single translation unit
executable file: a file containing a program that can be loaded and run on a host
shell: a program that can be used to invoke or run executable files

Note the distinction between header and implementation files is not required. Headers can include definitions. But is simplifies the presentation and is usually how code in organized in C++.

Some notation (for reference below):

*.hpp are C++ header files
*.cpp are C++ implementation files
*.bin are object files (storing machine code)
*.exe are executables on a given host
*.archive are static library files
*.shared are shared library files
> is a generic command-line shell prompt

Note, I am purposefully not using the actual program names or file extensions that vary by OS and compiler. This means you can't copy-paste this code directly and get it to work. This is on purpose. The goal is to get you to internalize the concepts of what these programs do in an abstract sense.

Compilation, translation units, object files, and linking

The raw material of a C++ program is source code, typically stored in a plain text file. For example mycode.cpp might contain:

int myfunc(int x){
    return x+1;
}

int main(){
    int x = 3;
    return myfunc(x);
}

The compiler is a program running on the host, that takes the source code as input, called a translation unit and produces (roughly speaking) machine code that corresponds to the target hardware architecture, stored in an object file. The host and target architecture do not have to be the same -- this is called cross-compiling and is very common when programming embedded systems.

Let suppose that this machine code is written to an object file called mycode.bin. Let's use cpp.exe for the compiler executable.

> cpp.exe mycode.cpp mycode.bin

produces the object file mycode.bin. If the source code is not valid C++, this step might fail, i.e. a compiler error and not produce the output. We are all familiar with those.

The filemycode.bin is machine code, but is not (yet) an executable. To create one, I need to use the linker to add code specific to the operating system so it can place the machine code into memory, and start its execution. This is just another program (which may be integrated with your compiler driver), lets call it link.exe. Given our mycode.bin and the previously compiled OS object code, os.bin, we might invoke it like so

> link.exe mycode.bin os.bin mycode.exe

Which produces the file mycode.exe.

At this point we have an executable that we can (almost) run. But it is worth stopping to see how this might fail. Suppose in the source file mycode.cpp I had named the function main start instead (not unreasonable). This would compile just fine. However the linking would fail because the os.bin machine code expects to be able to call a compiled function named main, but there is no such function in our modified source (main does not match start). This is a linker error, specifically an unresolved symbol error. I imagine you have encountered this error many times as well.

The linker must ensure that all called functions and objects (collectively symbols) are present and are copied to the resulting executable. This is called static linking.

One complication is that the location of some symbols may be loaded later, when running the program. This is called dynamic linking. The code to be resolved when actually running the program is called a dynamic shared object. When linking this way, the shared object file must be provided, however it is not actually copied into the output executable.

Note dynamic linking is specific to run-time and so pertains to hosted C++ only or to non-hosted code only when using a boot loader at power-on.

Loading and running a program

So, given we have an executable, how do we run/execute the program. We invoke a function in the OS kernel called the loader. The loader performs the dynamic linking, resolving any last-minute missing symbols by either locating them in already loaded memory, or loading them from a file before continuing. If this succeeds then the shell calls a function of the OS kernel to start the executable, resulting in a running process. This is done transparently to the user by the shell program when you just enter the name of the executable at the prompt and press enter.

> mycode.exe

On a non-hosted system, this process may be handled by a boot loader, which might or might not support dynamic linking. Without a boot loader the machine code is flashed into a specific location where it starts on power-on. In that case only static linking is used.

We all know the running process (our code) can fail in many ways, segmentation fault anyone?, but how could the program itself fail to start running?

One obvious way is the executable came from a different OS or target architecture. For example you cannot directly run an executable from MacOS on Windows. Similarly you can't run a Windows ARM executable on x64 Windows. If you try, it will fail to start.

Another way is that the loader could fail to dynamically link all the code in. This is typically because the dynamic shared object has not been previously loaded or is incompatible, and cannot be found on the file-system to be loaded.

This frustrating set of affairs is sometimes called DLL hell after the filename extension of shared library files on Windows, *.dll. However it occurs on all Operating Systems that support dynamic linking.

As an aside, how does the kernel know where to find the dynamic library files to load? It typically uses a combination of hard coded file-system paths and an environment variable, for example PATH on Windows and LD_LIBRARY_PATH on Linux.

Building code with multiple translation units

In the example above there was just one C++ source file and two translation units, one resulting from compiling mycode.cpp (mycode.bin) and one that came from the OS, os.bin.

It is common to organize C++ code into multiple files. For example, a pair of a header file and an implementation file typically define a module of C++ code, which corresponds to a translation unit. Suppose we reorganize our code from above so that mycode.cpp now contains:

#include "mymodule.hpp"
int main(){
    int x = 3;
    return myfunc(x);
}

mymodule.hpp contains:

int myfunc(int);

and mymodule.cpp contains:

int myfunc(int x){
    return x+1;
}

all in the current directory. To build this new code we first compile both translation units:

> cpp.exe mycode.cpp mycode.bin
> cpp.exe mymodule.cpp mymodule.bin

and then link them together with the OS object code:

> link.exe mycode.bin mymodule.bin os.bin mycode.exe

to create the final executable. Note, I am ignoring precompiled headers here to keep things simple.

Why complicate things with multiple translation units? Why not just put all the code into one file?

One reason is that multiple files allows us to organize code so that related objects and functions (modules) are together and not mixed up with code from different modules. This reduces cognitive overhead in larger projects. We could just #include all these files into one and compile it. If the project were small enough then this might be a viable approach.

However, another reason is that using multiple translation units allows us to improve build times. If code in only one translation unit changes then only it needs to be compiled and then linking done. For example if we modify mymodule.cpp we can skip recompiling mycode.cpp since its object code will not change. While this is a trivial saving in this example, in larger projects this can have a large effect on build times.

Keeping build times short and easy is important for a good development workflow. We want to automate this so we don't have to manually decide which files to recompile. While this could be part of the compiler (Oh how I wish it was), in C++ this is delegated to another program, generically called a build tool. The build tool keeps track of which source files are needed to compile a given object file, and which object files are needed to link a given executable, a relationship called the dependency graph. When run a build tool looks for changes in the source files (e.g. using file-system timestamps or hashes) and uses the dependency graph to decide what to recompile and link to minimize the time required.

Building and Using Libraries

As described above, code in different translation units can be combined to create the final executable using the linker and running process using the OS loader.

We can also combine object files without creating an executable. This is called a library and come in two flavors depending on when they are intended to be used.

Static libraries are object files bundled together into one file and are intended to be copied into the executable at link time.
Dynamic or shared libraries are dynamic shared object files bundled together into one file and are intended to be loaded at run-time by the OS or boot-loader.

This is done using a tool (sometimes built into the compiler) called an archiver. This tool generally takes flags to tell it whether to generate a static or shared library.

Although static and shared libraries can be used within a project, a more common use is as a way of sharing code among projects. Libraries very closely tied to the language itself, implementing functionality many programmers might want, are called standard libraries. The C++ standard library is one you should be familiar with. Others offer alternative or extensions to the standard library (e.g. Boost) or provide functionality specific to various problem domains. These are called external libraries or third-party libraries. Examples include graphical interfaces, networking, and numerical programming. Libraries enable developers to solve problems much faster, leveraging abstractions and code already written and tested by others. However they come at a cost as we shall see.

One question you might have is why is there a distinction between static and dynamic libraries? Would it not be simpler to just statically link everything and simplify the OS loader so that missing shared libraries are not a problem?

The answer to these questions is that it is possible to do this, and might even be a good idea in a non-hosted setting, or even a hosted one running a dedicated program. Some languages like Go take this route to avoid exactly that issue.

The benefit of dynamic/shared libraries is hinted at in the second common name for them: shared. A system consisting of many concurrently running processes, possibly each using the same libraries, would have duplicates of the library code loaded. This increases the amount of memory used. Significant memory savings can occur if instead multiple running processes, each using the same library, can share the same (read-only) memory holding that libraries object code.

This was more of an issue when memory was very limited but still has relevance for system performance, and in any case we are stuck with it in hosted C++.

ABI

The result of compiled languages is machine/object code. For a given translation unit, object code is specific to:

the target hardware architecture
if hosted, the operating system and possibly it's version
the compiler used and possibly it's version and options used

This is called it's application binary interface, or ABI. Changing any of the above, with no changes to the original source, leads to a possibly incompatible change in the ABI for a specific translation unit's machine code, meaning that the linker or loader will fail. This is called breaking the ABI, and is a particular problem with C++ since:

C++ code targets a large number of hardware architectures
C++ programs run on many different operating systems
there are multiple C++ compilers (MSVC, gcc, Clang, Intel, etc) and versions supporting different versions of the language: C++98, C++11, C++17, C++20, etc.

As of 2023 C++ does not have a standard ABI defined, so each combination of these leads to a different, possibly incompatible, ABI. For example mycode.cpp might compile to machine code:

mycode-x86-msvc-win32.bin
mycode-amd64-msvc_-win34.bin
mycode-amd64-gcc_11-win34.bin
mycode-amd64-gcc_11-linux_5.3.1.bin
and on and on ...

In practice not every combination is incompatible as OS and compiler vendors go to great lengths (sometimes at the cost of improvements) to preserve backward compatibility and not break the ABI. However, libraries often change, adding features and fixing bugs, depending on how these are integrated into your project, it increases the chance of ABI incompatibilities.

Note interpreted or JIT compiled code suffers much less from these issues as the variations are handled in the run-time itself. This is part of the popularity of languages like Javascript, Python, and Java. It is also less of a problem for single-implementation compilers like Rust since one need only worry about architecture and OS differences.

Integrating External Libraries

The cost associated with external libraries we are concerned with here is that they complicate the build process significantly. The root cause of this depends on how the library is distributed.

Ignoring licensing and other concerns, external libraries come in two major kinds:

Open (proprietary or non-proprietary) source: all code, declarations and definitions are available for the library code. Note I am using open source here in the sense of available, not free to share or reuse.
Closed source: only declarations are available (headers) with an already compiled library file (static or shared) provided for each compatible OS-Compiler-Version (ABI) supported.

Note this distinction is not always clear. Open source libraries may also be distributed in closed form.

Open Source Libraries

Adding open source libraries to your project could be as simple as copying them into your source tree and adding them to your build system. This complicates tracking and updating of files from the original project (called the upstream), however this can be handled using version control tools like git sub-modules.

A bigger issue is that most libraries themselves have many translation units compiled from many different files with specific compiler settings. Handling this is the job of the build system as described above. However, C++ does not have a single build system. If library alpha uses build tool B1 and library beta uses build tool B2, how do you get them to work with your build tool (B1, B2, or some other) to orchestrate the overall build? This is the central problem when integrating open source libraries into your project.

In some cases, for example in libraries composed entirely of templates, the library source consists only of header files. This kind of library is particularly easy to integrate. So much so that many vendors have started providing "header-only" versions of their open source libraries by concatenating the source into a single header, often several thousand lines long. The advantage of this is ease of integration since one need only add this files or set of files to the include path of their build tool. The disadvantage is that it leads to longer build times, larger object files, and virtually unreadable/modifiable source code. A good example of this is the popular Catch test framework for C++.

Closed Source Libraries

Closed source libraries are often used in proprietary code libraries as the source itself is not visible (just the API via the headers). This offers more intellectual-property protection than just a license or non-disclosure agreement that might come with a proprietary open source library.

However, another reason, and the reason some open source libraries are also sometimes distributed as closed, is that it removes the build tool compatibility problem. You just add the headers and library files (static and/or dynamic) to your build tool. Unfortunately though the ABI problem raises its head. You have to use a library compiled with a compatible ABI or it will not link with your code or fail to load at run-time. Also you need a way to store the library as a part of your project repository, but as these are (often large) binary files, this is not an ideal solution. It might not be a big deal for single architecture and OS projects using one or two external libraries, but for cross-platform development and projects using many external libraries it quickly becomes a pain.

Diamond Problem

If the above were not bad enough, a further complication is that libraries themselves often depend on other libraries, and sometimes the same library. The relationship of libraries to one another is called the dependency graph. For example your project needs library A and library B, both of which need library C, but A needs version 1.2.3 of C and B needs version 2.1.0 of C. This is called the "diamond problem" since the dependency graph forms a diamond shape but both version 1.2.3 and version 2.1.0 cannot be in the same executable (you would have multiple definitions).

Note this is just the simplest case, dependency graphs can get complicated with many conflicts like the diamond problem.

Summary

The set of libraries your project needs to build is called its dependencies.

Now we come to the central point of this essay, namely how do we manage all this?

Problems Building Code

Complexity within project
Build time

How long does it take to build? Why this matters.

Cross-platform
Integrating open source libraries
Integrating closed source libraries

ABI compatibility

Best Practices for Library Integration

Use a build tool

make, nmake, MSBuild, CMake, premake, Buck, Bazel, ....

This solves Problems 1,2 and to some extent 3

3 is a problem if your project is cross-platform but your build tool is not.

4 is a problem if the library uses a different build tool than the one you are using. If you have two libraries each using a different one then you can't just switch to one or the other

5 is suffers from the ABI problem

With Open Source "Build the World"

Use the build tool of choice and integrate the source directly by adding it to your build system, and not using the libraries preferred one

manually manage the integration

use a package manager

vcpkg, Conan

Detailed example

Use a specific project, OS set, and compiler set to illustrate the above.

Footnotes

This is not specific to C++, C has the same issues, as do all compiled languages to some extent. ↩