BPF CO-RE reference guide

Categories: BPF



The missing manual

BPF CO-RE (Compile Once – Run Everywhere) is a modern approach to writing portable BPF applications that can run on multiple kernel versions and configurations without modifications and runtime source code compilation on the target machine. This is in direct opposition to the more traditional approach provided by BCC framework in which BPF application source code compilation is delayed until runtime on the target host, carrying a heavy-weight compiler toolchain to achieve that. Please see the blog post that introduced the concept of BPF CO-RE and explained the problem statement for why all this is important and necessary for a lot of real-world BPF applications, as well as what makes it hard without kernel BTF.

As BPF CO-RE matured as an approach, some practical guidance on all its features and how to use them in practice was sorely missing. In this blog post, I'll try to fill that gap and will go over all the different features that BPF CO-RE (and libbpf as a canonical implementation of it) provides. If you've written a BPF CO-RE application before, you most probably have used at least some of the described features. But some of them are still little known, unfortunately. Yet, those little known BPF CO-RE secrets is exactly what sometimes makes real-world BPF applications feasible, simple and easy to implement and support, avoiding complexities of on the host compilation or pre-compiling multiple variants (flavors) of the same BPF application, each targeted for a different kernel.

This post is large, but it's also intended as a reference guide of sorts, so keeping it as one big piece felt better, compared to splitting it into chunks and posting gradually over a few weeks. It is divided into three sections, going from the most commonly used features towards more advanced and less frequently needed ones, hopefully following a natural progression of someone who just is getting started with writing BPF applications in the BPF CO-RE paradigm.

Throughout this article I'll assume that you are using vmlinux.h which provides CO-RE-relocatable type definitions for the kernel and can be generated by the bpftool tool. See libbpf-bootstrap blog post if you are not familiar with vmlinux.h. I'll also go over how to use BPF CO-RE without vmlinux.h in the section on more advanced uses closer to the end of the post.

As we go along I'll try to keep things pretty high-level without going into nitty-gritty implementation details, unless absolutely necessary. If you'd like to learn a bit more, I suggest taking a look at bpf_core_read.h header and ask question on BPF mailing list.

Reading kernel data

By far the most common BPF CO-RE operation is reading the value of a field from some kernel structure. libbpf provides a whole family of helpers to make reading a field easy and CO-RE-relocatable. CO-RE-relocatable means that regardless of the actual memory layout of a struct (which can change depending on actual kernel version and kernel configuration used), BPF program will be adjusted to read the field at the correct actual offset relative to the start of a struct.

bpf_core_read()

The most basic helper to read a field in a CO-RE-relocatable manner is bpf_core_read(dst, sz, src), which will read sz bytes from the field referenced by src into the memory pointed to by dst:

struct task_struct *task = (void *)bpf_get_current_task();
struct task_struct *parent_task;
int err;

err = bpf_core_read(&parent_task, sizeof(void *), &task->parent);
if (err) {
    /* handle error */
}

/* parent_task contains the value of task->parent pointer */

bpf_core_read() is just like bpf_probe_read_kernel() BPF helper, except it records information about the field that should be relocated on the target kernel. I.e., if the parent field gets shifted to a different offset within struct task_struct due to some new field added in front of it, libbpf will automatically adjust the actual offset to the proper value.

One important thing to keep in mind is that the size of the field is not automatically relocated, only its offset. So if the field you are reading is, say, a struct and its size changes, you can run into trouble. See "Sizing kernel types and fields" section on ways to deal with that. The general recommendation is to not read struct fields as a whole, if at all possible. Prefer reading just the primitive fields you are ultimately interested in.

bpf_core_read_str()

Just like there are bpf_probe_read_kernel() and bpf_probe_read_kernel_str() duo of BPF helpers, the former reading specified amount of bytes while the latter reads a variable-length zero-terminated C string, there is also a bpf_core_read_str() counterpart to the bpf_core_read(). It works just like bpf_probe_read_kernel_str(), except it records CO-RE relocation information of the source character array field, which contains a zero-terminated C string. So bpf_core_read_str() is a CO-RE-relocatable version of bpf_probe_read_kernel_str().

Note the important but subtle difference between character array field and character pointer field. In C they can be used interchangeably when reading the string value, because an array is automatically treated as a pointer by the compiler. This distinction is very important in the context of CO-RE, though.

Let's look at the hypothetical kernel type we'd like to read:

struct my_kernel_type {
    const char *name;
    char type[32];
};

name field points to where a string is stored, but type field actually is the memory that contains a string. If you need to read a string pointed to by name with CO-RE, the proper way to handle that would be to read the value of the pointer in CO-RE-relocatable way first, and then do a plain (non-CO-RE) bpf_probe_read_kernel_str() read (examples below ignore error handling for brevity):

struct my_kernel_type *t = ...;
const char *p;
char str[32];

/* get string pointer, CO-RE-relocatable */
bpf_core_read(&p, sizeof(p), &t->name);
/* read the string, non-CO-RE-relocatable, pointer is valid regardless */
bpf_probe_read_kernel_str(str, sizeof(str), p);

The equivalent example if we need to read type string would be:

struct my_kernel_type *t = ...;
char str[32];

/* read string as CO-RE-relocatable */
bpf_core_read_str(str, sizeof(str), &t->type);

Take a second to think why the first example wouldn't work with bpf_core_read_str() (hint: you'd be interpreting pointer value as a C string itself) and why the second one can't be done as a pointer read and then string read (hint: the string itself is part of a struct, so there is no dedicated pointer, it's located at an offset relative to where t pointer points to). It's subtle and thankfully doesn't come up often, but it's extremely confusing in practice if one doesn't recognize the difference.

BPF_CORE_READ()

bpf_core_read(), while allowing for a lot of control and careful error handling, is quite a burden to use directly, especially when reading fields that need to be accessed through longer chains of pointer dereferences.

Let's take a look at an example of reading running process's main executable name. If you were writing a plain kernel code in C and wanted to do that, you'd have to do something like this:

struct task_struct *t = ...;
const char *name;

name = t->mm->exe_file->fpath.dentry->d_name.name;

/* now read string contents with bpf_probe_read_kernel_str() */

Note that sequence of pointer dereferences, mixed in with some sub-struct accesses (i.e., fpath.dentry and d_name.name). Doing something like that with bpf_core_read() quickly turns into a mess:

struct task_struct *t = ...;
struct mm_struct *mm;
struct file *exe_file;
struct dentry *dentry;
const char *name;

bpf_core_read(&mm, 8, &t->mm);
bpf_core_read(&exe_file, 8, &mm->exe_file);
bpf_core_read(&dentry, 8, &exe_file->path.dentry);
bpf_core_read(&name, 8, &dentry->d_name.name);

/* now read string contents with bpf_probe_read_kernel_str() */

Granted, this is a pretty extreme example and usually the pointer dereference chain won't be as long, but the point stands: it is painful to do using this approach. And all that despite completely ignoring error handling in the example above.

To make such multi-step reads easier to write, libbpf provides the BPF_CORE_READ() macro. Let's take a look at how the above code is simplified with the use of BPF_CORE_READ():

struct task_struct *t = ...;
const char *name;

name = BPF_CORE_READ(t, mm, exe_file, fpath.dentry, d_name.name);

/* now read string contents with bpf_probe_read_kernel_str() */

Compare a "native C" example vs the one with BPF_CORE_READ():

/* direct pointer dereference */
name = t->mm->exe_file->fpath.dentry->d_name.name;

/* using BPF_CORE_READ() helper */
name = BPF_CORE_READ(t, mm, exe_file, fpath.dentry, d_name.name);

Basically, each pointer dereference turns into a comma in the macro invocation. Each sub-struct access is kept as is. Pretty straightforward.

You've probably noticed that BPF_CORE_READ() returns read value directly and doesn't propagate errors back. If any of the pointers is NULL or point to invalid memory, you'll get 0 (or NULL) back. If you do need error propagation and handling, though, you'd have to use low-level bpf_core_read() primitive and handle errors explicitly. This is usually not a problem or necessity in practice.

BPF_CORE_READ_INTO()

In some cases it's necessary or more convenient to read the result into a destination memory instead of directly returning it as BPF_CORE_READ() does. E.g., a common case where direct value return won't work is when you are reading a C array (e.g., IPv4 address from the socket struct), because C language doesn't allow returning arrays directly from expressions. For such cases, libbpf provides the BPF_CORE_READ_INTO() macro, which behaves similarly to BPF_CORE_READ() but reads the value of a final field into a destination memory. Converting last example to BPF_CORE_READ_INTO() we get:

struct task_struct *t = ...;
const char *name;
int err;

err = BPF_CORE_READ_INTO(&name, t, mm, binfmt, executable, fpath.dentry, d_name.name);
if (err) { /* handle errors */ }
/* now `name` contains the pointer to the string */

Note the additional &name argument into BPF_CORE_READ_INTO(), as well as the fact that it's possible to get error code of the last operation (i.e., reading d_name.name). Overall, BPF_CORE_READ() is much more convenient in practice and easier to read, though.

BPF_CORE_READ_STR_INTO()

For cases when the last field is a character array field (just like in hypothetical example above with name vs type), there is a BPF_CORE_READ_STR_INTO() macro, which by now you should have a good guess on how it works. If not, please revisit bpf_core_read_str() section.

BTF-enabled BPF program types with direct memory reads

Having talked about the BPF_CORE_READ() family of macros above, it's important to note that you don't always need to use them to do CO-RE-relocatable reads. Or, rather, you don't always have to "probe read" (i.e., use BPF helper to read) the memory. Sometimes you can just directly access kernel memory.

Some BPF program types are "BTF-enabled", which means that BPF verifier in the kernel knows type information associated with input arguments passed into a BPF program. This allows BPF verifier to know which memory can be safely read directly from the kernel without bpf_core_read() or bpf_probe_read_kernel() invocations. Among such BTF-enabled BPF program types are:

  • BTF-enabled raw tracepoint (SEC("tp_btf/...") in libbpf lingo);
  • fentry/fexit/fmod_ret BPF programs;
  • BPF LSM programs;
  • probably some more but I'm too lazy to go and check.

With such programs, if they are getting a pointer to some kernel type (e.g., struct task_struct *), BPF program code can do direct memory dereference and even follow the pointers. So, for the elaborate example we've been using above to demonstrate BPF_CORE_READ() usage, when using, say, fentry BPF program, all you'd have to do would be:

struct task_struct *t = ...;
const char *name;

name = t->mm->binfmt->executable->fpath.dentry->d_name.name;

And yes, it is exactly identical to the "native C" hypothetical example. Keep in mind, though, that to get the contents of the string itself, you'd still need to use bpf_probe_read_kernel_str().

Such direct memory access is fast, convenient and simple, and you should definitely use that when possible. Unfortunately, there are still a lot of real-world cases when you have to rely on "probe reading" explicitly, so BPF_CORE_READ() is going to be your friend for a foreseeable future, so definitely familiarize yourself with it.

Reading bitfields and integers of varying sizes

Reading bitfields from BPF has always been a challenge. BPF application developers have to go to great lengths and write pretty unmaintainable and painful code to be able to extract bitfield values from kernel types. Take struct tcp_sock as an example. It has lots of useful information encoded as bitfields. Even when using BCC with its source code compilation approach, it's a major hassle and maintenance burden to extract those bitfields.

Luckily, libbpf provides two easy to use macros for reading bitfields in CO-RE-relocatable way: BPF_CORE_READ_BITFIELD() and BPF_CORE_READ_BITFIELD_PROBED(). _PROBED variant has to be used when the data to be read has to be "probe read", just like with BPF_CORE_READ(). BPF_CORE_READ_BITFIELD() should be used only when direct memory access is available (e.g., from fentry/ BPF programs, see "BTF-enabled BPF program types with direct memory reads" section above). Both macros return the value of a bitfield as an u64 integer. Here's an example of reading one of the bitfields from the struct tcp_sock:

static u64 sk_get_syn_data(const struct tcp_sock* tp)
{
    /* extract tp->syn_data bitfield value */
    return BPF_CORE_READ_BITFIELD_PROBED(tp, syn_data);
}

As simple as that. With BCC, achieving the same would result in something like this (it is left as an exercise to the reader on why this works and when it would break):

static u64 sk_get_syn_data(const struct tcp_sock* tp)
{
    u8 s;
    /* get byte before tlp_high_seq */
    bpf_probe_read(&s, 1, &(tp->tlp_high_seq) - 1);
    /* syn_data is the third bit of that byte in little-endian */
    return (s >> 2) & 0x1;
}

Horror to write, read, and maintain as struct tcp_sock changes across kernel versions. With BPF_CORE_READ_BITFIELD_PROBED() this becomes a no-brainer.

It's worth noting another important property of BPF_CORE_READ_BITFIELD() and BPF_CORE_READ_BITFIELD_PROBED(). They can read not just bitfields, but any integer field as well. Whatever the actual nature of a field (bitfield or integer of any size up to 8 bytes), macros return properly signed-extended 8-byte integers back. It keeps working even if a field changes from integer to bitfield and vice versa. It keeps working if the field changes from int to u8. As such, BPF_CORE_READ_BITFIELD() macros is a universal way to read any integer field regardless of its nature or size.

Sizing kernel types and fields

As mentioned in one of the previous sections, BPF_CORE_READ() doesn't automatically make reading fields of varying size (e.g., entire structs or arrays) CO-RE-relocatable, as it is generally pretty hard to preallocate the right amount of destination memory to accommodate any possible change in size in the kernel.

Nevertheless, there are situations in which knowing the size of a field or a type is important. To accommodate such needs, BPF CO-RE provides two helpers: bpf_core_type_size() and bpf_core_field_size(). Their use is similar to bpf_core_type_exists() and bpf_core_field_exists() (described in the next section), but instead of returning 0 or 1, they return the size of a field or type in bytes.

What you do with the value is up to you: you can pass it to bpf_core_read() as a second argument to make the read completely CO-RE-relocatable. If you are dealing with an array of structs, and need to skip the first few instances, you can use bpf_core_type_size() to calculate the right byte offset to get to the beginning of the N-th element. Or you can use it just for debugging and reporting purposes, it's completely up to you and BPF CO-RE doesn't prescribe how you use its features.

Dealing with kernel changes and feature detection

BPF_CORE_READ() family of macros is the workhorse of BPF CO-RE, but there is more to building practical BPF applications with BPF CO-RE.

One of the very common problems BPF applications have to deal with is the need to perform feature detection. I.e., detecting if a particular host kernel supports some new and optional feature, which BPF application can use to get more information or improve the efficiency. If not, though, BPF application would rather fall back to the code that supports older kernels, instead of just failing.

BPF CO-RE provides a bunch of different mechanisms to accommodate such needs. Of course, there is nothing that prevents the use of the below described mechanisms for use cases other than feature detection, but I'll describe everything with the feature detection as a primary use case.

bpf_core_field_exists()

bpf_core_field_exists() allows to check if a given kernel type contains a specified field. In the context of kernel feature detection, if some desired kernel feature was added along with adding some specific field to one of the kernel types, it's possible to detect such feature with a straightforward bpf_core_field_exists() check.

As a specific example, one way to detect whether kernel supports BPF cookie for perf-based BPF program types (tracepoints, kprobes, uprobes) (added in this commit) would be:

union bpf_attr *attr = ... /* could be NULL */;

if (bpf_core_field_exists(attr->link_create.perf_event.bpf_cookie)) {
    /* bpf_cookie is supported */
} else {
    /* bpf_cookie is NOT supported */
}

Example above assumes that BPF program has a variable of union bpf_attr * type. It can be just NULL, it doesn't really matter because the pointer itself is never read, it is only necessary for conveying type information to the compiler. For cases where there is no readily available variable of desired type, you can write equivalent check as (using C type system features):

if (bpf_core_field_exists(
        ((union bpf_attr *)0)->link_create.perf_event.bpf_cookie) {
    /* bpf_cookie is supported */
} else {
    /* bpf_cookie is NOT supported */
}

Here, the code in the first branch of if/else would never be executed (and neither will it be verified), if there is no link_create.perf_event.bpf_cookie in union bpf_attr in the host kernel.

It's worth reiterating that such code is correctly detected by BPF verifier as a dead code, and so is never validated. This means that such code can use kernel and BPF functionality (e.g., new BPF helpers) that do not exist on the host kernel without worrying about BPF verification failures. E.g., if the first branch above were to use bpf_get_attach_cookie() helper for the BPF cookie feature, the program would be validated properly on older kernels that don't yet have that helper.

bpf_core_type_exists()

For cases where type existence itself is what matters, BPF CO-RE provides a way to check type existence bpf_core_type_exists() helper. Here's an example of detecting whether kernel supports BPF ring buffer:

if (bpf_core_type_exists(struct bpf_ringbuf)) {
    /* BPF ringbuf helpers (e.g., bpf_ringbuf_reserve()) exist */
}

Be careful to ensure that you have a struct bpf_ringbuf definition (even if empty) defined somewhere, otherwise you'll be checking that bpf_ringuf forward declaration exists, which is almost certainly not what you wanted. This shouldn't be a problem with recent enough vmlinux.h, but be aware.

bpf_core_enum_value_exists()

It's quite useful to be able to detect the existence of a given enumerator value. One important practical application of such check is to detect support for a BPF helper.

Each BPF helper has a corresponding enum value in enum bpf_func_id:

enum bpf_func_id {
    ...
    BPF_FUNC_ringbuf_output = 130,
    BPF_FUNC_ringbuf_reserve = 131,
    ...
};

As such, the most straightforward way to check if a BPF helper bpf_xxx() exists is to check that BPF_FUNC_xxx exists in enum bpf_func_id. So instead of doing a type check in previous example with bpf_core_type_exists(struct bpf_ringbuf), we can more explicitly state our intent:

if (bpf_core_enum_value_exists(enum bpf_func_id, BPF_FUNC_ringbuf_reserve)) {
    /* use bpf_ringbuf_reserve() safely */
} else {
    /* fall back to using bpf_perf_event_output() */
}

A lot of other BPF functionality can be detected similarly. BPF program types and BPF map types support is just one other example.

This functionality isn't limited to BPF-related functionality, of course. Any kernel feature that can be detected through the existence of a field, type, or enumerator value can be easily performed with BPF CO-RE.

Feature detection also doesn't stop at type system-based checks. In the next few sections we'll look at some other BPF CO-RE mechanisms that can be used to perform kernel feature detection. And not just feature detection, they allow to extract kernel-specific information at runtime (like Kconfig values), which often can't be known ahead of time.

LINUX_KERNEL_VERSION

Sometimes the only way to detect the presence of the necessary feature is through checking the Linux kernel version. Libbpf allows to do that from BPF program code using a special extern variable:

extern int LINUX_KERNEL_VERSION __kconfig;

Once declared, LINUX_KERNEL_VERSION encodes the running kernel version in exactly the same way as it's done by the kernel itself. Such a variable can be used just like any other variable: you can compare against it, print it, record and send it to user-space, etc. In all such cases, BPF verifier knows its exact value and thus it can detect dead code, just like with type system-based checks described above.

Libbpf also provides a convenient KERNEL_VERSION(major, minor, patch) macro to be used for comparisons against LINUX_KERNEL_VERSION:

#include <bpf/bpf_helpers.h>

extern int LINUX_KERNEL_VERSION __kconfig;

...

if (LINUX_KERNEL_VERSION > KERNEL_VERSION(5, 15, 0)) {
    /* we are on v5.15+ */
}

Kconfig extern variables

In fact, libbpf allows to declare special extern variables for any kernel config (Kconfig) value. Keep in mind, this is only supported if kernel exposes its kernel config through /proc/config.gz, which fortunately is a very common case in modern Linux distros. There are a few different types of variables that are supported. Their use depends on the actual Kconfig value type:

  • for y/n/m tri-state Kconfig values, you can use extern enum libbpf_tristate variables which have three possible values defined: TRI_YES, TRI_NO, TRI_MODULE, respectively. Alternatively, declaring an extern char variable would capture the character value as is (i.e., you literally will have a variable with one of 'y', 'n', 'm' character values).
  • for y/n two-state (boolean) Kconfig values, you can also use bool type (in addition to already described char and enum libbpf_tristate types). In such a case, y maps to true and n will be turned into false.
  • for integer Kconfig values, use one of the C integer types: all 1-, 2-, 4-, and 8-byte signed and unsigned integers are supported. If the actual Kconfig value doesn't fit into a declared integer type, libbpf will emit an error instead of truncating the value.
  • for string Kconfig values, use const char[N] array variable. If the actual value doesn't fit, it will be truncated and zero-terminated, but libbpf will emit a warning.

Keep in mind, if the requested Kconfig value is missing from the /proc/config.gz, libbpf will abort program loading with an error. To handle that gracefully, declare such Kconfig extern variable as weak one with __weak attribute. In such case, if the value is missing, it will be assumed to be a false, TRI_NO, '\0' (zero character), 0, or "" (empty string), depending on the used type.

Here's a quick example of declaring and using different types of Kconfig extern variables:

extern int LINUX_KERNEL_VERSION __kconfig;

extern enum libbpf_tristate CONFIG_BPF_PRELOAD __kconfig __weak;
extern bool CONFIG_BPF_JIT_ALWAYS_ON __kconfig __weak;
extern char CONFIG_BPF_JIT_DEFAULT_ON __kconfig __weak;
extern int CONFIG_HZ __kconfig;
extern const char CONFIG_MODPROBE_PATH[256] __kconfig __weak;

...

if (LINUX_KERNEL_VERSION > KERNEL_VERSION(5, 15, 0)) { ... }

switch (CONFIG_BPF_PRELOAD) {
    case TRI_NO: ...; break;
    case TRI_YES: ...; break;
    case TRI_MODULE: ...; break;
}

if (!CONFIG_BPF_JIT_ALWAYS_ON)
    bpf_printk("BPF_JIT_DEFAULT_ON: %c\n", CONFIG_BPF_JIT_DEFAULT_ON ?: 'n');

bpf_printk("HZ is %d, MODPROBE_PATH: %s\n", CONFIG_HZ, CONFIG_MODPROBE_PATH);

Relocatable enums

One interesting challenge that some BPF applications run into is the need to work with "unstable" internal kernel enums. That is, enums which don't have a fixed set of constants and/or integer values assigned to them. One good example of this is enum cgroup_subsys_id, defined in include/linux/cgroup-defs.h, definition of which can differ depending on which cgroup features are enabled during kernel compilation (see include/linux/cgroup_subsys.h for details). So, if you need to know the actual integer value of, say, cgroup_subsys_id::cpu_cgrp_id, it can be a big problem, as this enum is internal to the kernel and dynamically generated.

And again, BPF CO-RE to the rescue. It allows to capture the actual value with the help of bpf_core_enum_value() macro:

int id = bpf_core_enum_value(enum cgroup_subsys_id, cpu_cgrp_id);

/* id will contain the actual integer value in the host kernel */

Guarding potentially failing relocations

It's not unusual for some fields to be missing on some kernels. If a BPF program attempts a read of a missing field with BPF_CORE_READ(), it will result in an error during BPF verification. Similarly, CO-RE relocations will fail when getting enum value (or type size) of an enumerator (or a type) that doesn't exist in the host kernel.

Currently the error is quite cryptic, unfortunately (but will be improved by libbpf soon), so it's good to be aware of it, just in case you run into it accidentally. If you encounter an error similar to the one below, know that it's because a CO-RE relocation failed to find a corresponding field/type/enum:

1: (85) call unknown#195896080
invalid func unknown#195896080

That 195896080 is 0xbad2310 in hex (for "bad relo") and is a constant that libbpf uses to mark instructions that failed CO-RE relocation. The reason libbpf doesn't just report such errors immediately is because missing field/type/enum and corresponding failing CO-RE relocation can be handled by the BPF application gracefully, if desired. This makes it possible to accommodate very drastic changes in kernel types with just a single BPF application (which is a crucial goal of "Compile Once – Run Everywhere" philosophy).

When it is possible for some field/type/enum to be missing, you can guard such code paths with one of the checks described in the section on dealing with kernel changes. If properly guarded, BPF verifier will know that such code path is impossible to hit at that particular kernel, and thus will eliminate it as dead code.

Such an approach allows to optionally capture pieces of kernel information opportunistically, if the actual running kernel does have those pieces. Otherwise, BPF application can cleanly fallback to an alternative logic and gracefully handle the missing feature or data. All this works great as long as potentially failing CO-RE relocations are guarded properly. CO-RE relocations here mean any use of BPF_CORE_READ() family of macros, type/field size relocations, or enumerator value capturing. Anything that has no meaning if the target field/type/enum doesn't exist or has some incompatible definition.

Continuing the previous example of cpu_cgrp_id enum value, to handle kernels that might not be defining such enumerator (e.g., due to not set CONFIG_CGROUP_PIDS Kconfig toggle) it's possible to use bpf_core_enum_value_exists() check (existence checks never fail!), which returns true/false (strictly speaking, it's 0 or 1 in C):

int id;

if (bpf_core_enum_value_exists(enum cgroup_subsys_id, cpu_cgrp_id))
    id = bpf_core_enum_value(enum cgroup_subsys_id, cpu_cgrp_id);
else
    id = -1; /* fallback value */

/* use id even if cpu_cgrp_id isn't defined */

Above example will work just fine on any kernel, regardless of cpu_cgrp_id enumerator existence, even though bpf_core_enum_value() fails on kernels without cpu_cgrp_id enumerator. All because of the properly guarded code paths.

Advanced topics

Previous parts covered most common CO-RE functionality. This section will cover some more advanced topics that you might need to deal with, depending on how complex internal kernel state and variations of it across different kernel versions your BPF application has to deal with.

Defining own CO-RE-relocatable type definitions

Up until now we've assumed that kernel types used in all the above examples were coming from vmlinux.h header, generated from a recent and full enough kernel BTF. But using vmlinux.h isn't a requirement of the BPF CO-RE. It's mostly just a convenience for BPF application developers.

Also, there are sometimes more advanced situations in which vmlinux.h might not be enough. Either because the desired type isn't in the kernel BTF (yet) or because something in the kernel changed in an incompatible way (e.g., field got renamed) and now you need to deal with two incompatible definitions of the same kernel type (we'll get to dealing with this unfortunate situation below).

Whatever the reason, it's very easy to define your own expectation of the kernel type and make it CO-RE-relocatable. Let's take struct task_struct as a typical example. It is a huge and complicated struct, but usually you'd need only a few simple fields out of its entire definition. With BPF CO-RE it's enough to declare only the fields you are going to need, skipping all the rest, keeping the type definition simple and succinct.

Let's say, you only care about pid, group_leader, and comm fields. Declaring struct task_struct as below is enough to get everything working:

struct task_struct {
    int pid;
    char comm[16];
    struct task_struct *group_leader;
} __attribute__((preserve_access_index));

First, the order of fields doesn't matter. At all.

Second, __attribute__((preserve_access_index)) is necessary for BPF programs that allow direct memory reads. E.g., BTF-enabled raw tracepoints (SEC(tp_btf)) and fentry/fexit BPF programs. With this attribute, any direct memory reads using this struct definition will be automatically CO-RE-relocatable.

When using the explicit BPF_CORE_READ() family of macro, __attribute__((preserve_access_index)) isn't required because those macros enforce it automatically. But if you were to use plain old bpf_probe_read_kernel() helper directly, if the struct has preserve_access_index attribute, such probe reads become CO-RE-relocated as well. So, in short, it's always a good idea to specify this attribute.

That's pretty much it. You can use such a type for any CO-RE read or check. As you can see, it doesn't have to match the real struct task_struct definition exactly. Only a necessary subset of fields have to be present and compatible. Everything else that your BPF program doesn't need out of struct task_struct is irrelevant for BPF CO-RE.

Handling incompatible field and type changes

As alluded in previous sections, there are cases when kernel types and fields are changed in a way that makes type definitions from two different kernels incompatible. E.g., think about a field rename within a struct. As a very real and specific example, let's take a recent rename of task_struct's state field into __state in this commit. If you were to write BPF application that needed to read task's state, then, depending on kernel version, you'd need to fetch the same field by two different names. Let's look at how BPF CO-RE allows to deal with this.

BPF CO-RE has one important naming convention (I'll call it an "ignored suffix rule"). It's a relatively little known feature, but it is a crucial mechanism for dealing with situations described above. For any type, field, enum, or enumerator, if the entity's name contains a suffix of the form ___something (three underscores plus some text after it), such name suffix is ignored for the purposes of CO-RE relocation as if it was never there.

This means that if you were to define a struct task_struct___my_own_copy and use it in your BPF application, as far as BPF CO-RE is concerned, that struct is equivalent to the kernel struct task_struct and will be matched and relocated accordingly. The same applies for field names (so state or state___custom are effectively the same) and enums (both enum type name itself, and enumerator names within that enum). It actually works both ways, so if kernel has struct task_struct and struct task_struct___2, as an example (and it does sometimes due to C type system and header includes interactions in the kernel source code) both structs will be candidates for matching struct task_struct___my defined in BPF program source code.

What this means in practice is that you can now have multiple independent and conflicting definitions of the same kernel type/field/enum, and yet be able to both compile the code as a valid C, as well as pick the right definition at runtime based on whatever feature detection approach you are using.

Let's look at an example of how to deal with the mentioned task_struct->state rename into task_struct->__state:

/* latest kernel task_struct definition, which can also come from vmlinux.h */
struct task_struct {
    int __state;
} __attribute__((preserve_access_index));

struct task_struct___old {
    long state;
} __attribute__((preserve_access_index));

...

struct task_struct *t = (void *)bpf_get_current_task();
int state;

if (bpf_core_field_exists(t->__state)) {
    state = BPF_CORE_READ(t, __state);
} else {
    /* recast pointer to capture task_struct___old type for compiler */
    struct task_struct___old *t_old = (void *)t;

    /* now use old "state" name of the field */
    state = BPF_CORE_READ(t_old, state);
}

...

There are two most crucial pieces in the above example.

First, field existence check, based on the latest struct task_struct definition. If the running kernel is older and doesn't yet have a __state field, bpf_core_field_exists(t->__state) will return 0, and BPF verifier will skip and eliminate as dead code the first branch of the if statement, so t->__state won't ever be attempted to be read.

Second, recasting of a struct task_struct * pointer into a struct task_struct___old * pointer. This is necessary to allow C compiler to keep track of the type information of an "alternative definition" of struct task_struct (that is, struct task_struct___old in this case). Compiler will recognize and compile the t_old->state field reference (hidden inside the BPF_CORE_READ() implementation) as a valid C expression and will record a corresponding CO-RE relocation information to let libbpf know which type and field is expected to be read by the BPF program.

With the ___suffix rule this all works correctly. At the time when a BPF program is prepared by libbpf to be sent to the kernel for verification, libbpf will perform CO-RE relocations and will adjust the offsets properly. One of the CO-RE relocations won't be resolved (because either __state or state can't exist in the kernel at the same time) and will result in "poisoning" of a corresponding BPF instruction (recall 0xbad2310 described previously), but that instruction will be guarded by the field existence logic and eliminated by the verifier during the program load.

As BPF CO-RE applications grow in numbers and complexity, and as Linux kernel evolves and inevitably goes through internal changes and refactorings, having the ability to deal with incompatible kernel changes will only grow in importance, so please take a note of this technique. The above description glances over a bunch of implementation details, but hopefully it helps to understand how to use the feature in practice.

Reading kernel data structures from user-space memory

One (admittedly unusual) need that might come up in some applications is the need to read kernel types from user-space memory. Most probably it will be one of the kernel UAPI types, passed in as a syscall input argument. To accommodate such cases (as well as for completeness), libbpf provides user-space equivalents of its BPF_CORE_READ() family of macros:

  • bpf_core_read_user();
  • bpf_core_read_user_str();
  • BPF_CORE_READ_USER_STR_INTO();
  • BPF_CORE_READ_USER_INTO();
  • BPF_CORE_READ_USER().

They function and behave exactly like their non-user variants, with the sole distinction that all the memory reads are done with bpf_probe_read_user() and bpf_probe_read_user_str() BPF helpers and thus should be passed a user-space pointer.

Capturing BTF type IDs

If you are familiar with BTF, you know that any type definition in BTF has a corresponding BTF type ID. Whether it's for debugging and logging purposes, or as part of some BPF API, knowing BTF type IDs for the types/fields/enums that a BPF program is working with might be important. BPF CO-RE provides a way to capture these BTF type IDs as integer values from inside the BPF program code. Actually, it provides a way to capture two different BTF type IDs. One for the target kernel BTF (kernel type ID) and another for BPF program's own BTF (local type ID):

  • bpf_core_type_id_kernel() returns resolved type ID from running kernel's BTF;
  • bpf_core_type_id_local() captures a type ID as captured by the compiler during BPF program compilation.

Note, with BPF CO-RE relocations there are always two BTF types involved. One is the BPF program's local expectation of the type definition (e.g., vmlinux.h types or types defined manually with preserve_access_index attribute). This local BTF type provides the means for libbpf to know what to search for in the kernel BTF. As such, it can be a minimal definition of the type/field/enum with only a necessary subset of fields and enumerators.

Libbpf then can use local BTF type definition to find a matching actual complete kernel BTF type. The above helpers allow capturing BTF type IDs for both of the types involved in a CO-RE relocation. They could be useful for distinguishing different kernel or local types at runtime, for debugging and logging purposes, or potentially for future BPF APIs that would accept BTF type IDs as input arguments. Such APIs don't exist yet, but they are coming for sure in the near future.

Conclusion

I hope that this post gives enough information and practical guidance to make for an effective use of BPF CO-RE technology. Feel free to use it creatively for your BPF needs. If anything doesn't seem right or doesn't work, please report issues to BPF mailing list.