BPF tips & tricks: the guide to bpf_trace_printk() and bpf_printk()
Any non-trivial BPF program always needs some amount of debugging to get it
working correctly. Unfortunately, there isn't a BPF debugger yet, so the next
best thing is to sprinkle printf()
-like statements around and see what's
going on in the BPF program. BPF equivalent of printf()
is the
bpf_trace_printk()
helper. In this blog post we'll look at how to use it,
what are its limitations, and how to work around them. I'll also describe a few
important changes that happened to bpf_trace_printk()
over last few kernel
releases and how BPF CO-RE can be used to
detect and handle those changes.
Preliminaries
I'll be using libbpf-bootstrap's
minimal
example as a base for all the examples. It has everything wired up
and triggers a simple BPF program that we’ll use to test bpf_trace_printk()
output. If you'd like to follow along, make sure to clone
libbpf-bootstrap and open
minimap.bpf.c
in your editor:
$ # note --recursive to checkout libbpf submodule
$ git clone --recursive https://github.com/libbpf/libbpf-bootstrap
$ cd libbpf-bootstrap/examples/c
$ vim minimal.bpf.c
$ make minimal
$ sudo ./minimal
Intro to bpf_trace_printk()
Linux kernel provides BPF helper, bpf_trace_printk()
, with the following
definition:
long bpf_trace_printk(const char *fmt, __u32 fmt_size, ...);
It's first argument, fmt
, is a pointer to a printf
-compatible format string
(with some kernel-specific extensions and limitations). fmt_size
is the size
of that string, including terminating \0
. The varargs are arguments
referenced from format string.
bpf_trace_printk()
supports a limited subset of what you can expect from
libc's implementation of printf()
. Basic things like %s
, %d
, and %c
work, but, say, positional arguments (%1$s
) don't. Argument width specifiers
(%10d
, %-20s
, etc) works only on very recent kernels, but won't work on
earlier ones. Additionally, a bunch of kernel-specific modifiers (like %pi6
to print out IPv6 addresses or %pks
for kernel strings) are supported as
well.
If a format string is invalid or is using unsupported features,
bpf_trace_printk()
will return negative error code.
There are few more important restrictions on usage of bpf_trace_printk()
helper, unfortunately.
First, a BPF program using bpf_trace_printk()
has to have a GPL-compatible
license. For libbpf-based BPF application
that means specifying license with a special variable:
char LICENSE[] SEC("license") = "GPL";
For completeness, here are all the GPL-compatible licenses that kernel recognizes:
- "GPL";
- "GPL v2";
- "GPL and additional rights";
- "Dual BSD/GPL";
- "Dual MIT/GPL";
- "Dual MPL/GPL".
Another hard limitation is that bpf_trace_printk()
can accept only up to 3
input arguments (in addition to fmt
and fmt_size
). This is quite often
pretty limiting and you might need to use multiple bpf_trace_printk()
invocations to log all the data. This limitation stems from the BPF helpers
ability to accept only up to 5 input arguments in total.
Once you get past these limitations, though, you'll find out that
bpf_trace_printk()
dutifully emits data according to your format string to a
special file at /sys/kernel/debug/tracing/trace_pipe
. You need to be root to
read it, so use sudo cat
to watch your debug logs:
$ sudo cat /sys/kernel/debug/tracing/trace_pipe
<...>-2328034 [007] d... 5344927.816042: bpf_trace_printk: Hello, world, from BPF! My PID is 2328034
<...>-2328034 [007] d... 5344928.816147: bpf_trace_printk: Hello, world, from BPF! My PID is 2328034
^C
Let's dissect this. <...>-2328034 [007] d... 5344927.816042: bpf_trace_printk:
part is emitted by the kernel automatically for each bpf_trace_printk()
invocation. It contains information like process name (sometimes shortened as
<...>
), PID (2328034
), timestamp since system boot (5344927.816042
), etc.
But Hello, world, from BPF! My PID is 2328034
is the part controlled by a BPF
program and emitted via a simple code like this:
int pid = bpf_get_current_pid_tgid() >> 32;
const char fmt_str[] = "Hello, world, from BPF! My PID is %d\n";
bpf_trace_printk(fmt_str, sizeof(fmt_str), pid);
Note how fmt_str
is defined as a variable on the stack. Unfortunately,
currently you can't just do something like bpf_trace_printk("Hello, world!", ...);
due to libbpf limitations. But even if it was possible, the need to
specify fmt_size
explicitly is quite inconvenient. Libbpf helpfully provides
a simple wrapper macro, bpf_printk(fmt, ...)
, which takes care of such
details, though. It is currently defined in
<bpf/bpf_helpers.h>
like this:
/* Helper macro to print out debug messages */
#define bpf_printk(fmt, ...) \
({ \
char ____fmt[] = fmt; \
bpf_trace_printk(____fmt, sizeof(____fmt), \
##__VA_ARGS__); \
})
With it, the above "Hello, world!" example becomes more succinct and convenient:
int pid = bpf_get_current_pid_tgid() >> 32;
bpf_printk("Hello, world, from BPF! My PID is %d\n", pid);
Much nicer! Unfortunately, while convenient, this implementation is not ideal,
as it has to initialize char array on the stack with the contents of the format
string every single time bpf_printk()
is called. Due to backwards
compatibility concerns, libbpf is stuck with such a suboptimal implementation
as it is the only one that will keep reliably working on old kernels and thus
won't break any BPF application, which is a high priority for libbpf as a
generic library.
I, on the other hand, am not constrained with backwards compatibility in this blog post. So I can and will show how to improve upon this implementation significantly in the rest of this post.
Improving bpf_printk()
Avoiding format string array on the stack
First thing we are going to address is the need to initialize an array on the
stack for format string. Starting with Linux 5.2, d8eca5bbb2be ("bpf:
implement lookup-free direct value access for
maps") adds support
for BPF global (and static) variables, which we are going to use here to get
rid of on-the-stack array. The change to bpf_printk()
implementation is
deceivingly minimal:
#undef bpf_printk
#define bpf_printk(fmt, ...) \
({ \
static const char ____fmt[] = fmt; \
bpf_trace_printk(____fmt, sizeof(____fmt), \
##__VA_ARGS__); \
})
char[]
becomes static const char[]
. static const
modifiers ensure that
Clang puts ____fmt
variable into read-only .rodata
ELF section and libbpf
will take care of wiring everything up during BPF application loading time.
When bpf_printk()
needs to print something, underlying BPF code will just
need to fetch an address of ____fmt
in .rodata
BPF map. This is fast and
efficient, compared to filling out a potentially big char array every single
time.
That’s all. In short, if your BPF application runs on Linux 5.2 (or newer) you should always prefer this implementation over the one in libbpf.
Newline behavior changes
Up until Linux 5.9, bpf_trace_printk()
would take format string and use it as
is. So if you forgot (or chose not to) add \n
to your format string, you'd
get a mess in trace_pipe
output. bpf_printk("Hello, world!")
executed few
times would result in:
<...>-179528 [065] .... 1863682.484368: 0: Hello, world! <...>-179528 [065] .... 1863682.484381: 0: Hello, world! <...>-179528 [065] .... 1863683.484447: 0: Hello, world!
Starting with ac5a72ea5c89 ("bpf: Use dedicated bpf_trace_printk event instead
of trace_printk()")
(went into upstream Linux 5.9), bpf_trace_printk()
will now always append
newline at the end, so for bpf_printk("Hello, world!");
you'll see a tidy
output:
<...>-200501 [001] .... 1863840.478848: 0: Hello, world!
<...>-200501 [002] .... 1863841.478916: 0: Hello, world!
<...>-200501 [002] .... 1863842.478991: 0: Hello, world!
Which is great, but if you were careful (as you should have) before and added
\n
at the end of your format string, bpf_printk("Hello, world!\n")
on
kernels before Linux 5.9 would result in a nice output like above. But starting
from Linux 5.9, you'll get an annoyingly sparse and wasteful output:
<...>-3658431 [048] d... 5362570.510814: bpf_trace_printk: Hello, world!
<...>-3658431 [048] d... 5362571.510933: bpf_trace_printk: Hello, world!
<...>-3658431 [048] d... 5362572.511048: bpf_trace_printk: Hello, world!
While not the end of the world, it would be great to have consistent behavior
and not care about kernel version differences in handling that pesky \n
,
wouldn’t it?
The good news is that with the help of BPF CO-RE we can transparently detect
and accommodate such kernel differences. If you look at the commit
ac5a72ea5c89
mentioned above you'll see that it adds a new kernel tracepoint
bpf_trace_printk
and cleverly uses it to emit data to
/sys/kernel/debug/tracing/trace_pipe
. Note also that each tracepoint in the
kernel has a corresponding struct trace_event_raw_<tracepointname>
type. We
are going to use the existence of struct trace_event_raw_bpf_trace_printk
to
detect whether a newline is added by bpf_trace_printk()
or not. If not, we'll
make sure to add a newline silently and transparently in our own bpf_printk()
macro. Let’s see how all that is put together:
[1] #include <bpf/bpf_core_read.h>
/* define our own struct definition if our vmlinux.h is outdated */
[2] struct trace_event_raw_bpf_trace_printk___x {};
#undef bpf_printk
#define bpf_printk(fmt, ...) \
({ \
[3] static char ____fmt[] = fmt "\0"; \
[4] if (bpf_core_type_exists(struct trace_event_raw_bpf_trace_printk___x)) {\
[5] bpf_trace_printk(____fmt, sizeof(____fmt) - 1, ##__VA_ARGS__); \
} else { \
[6] ____fmt[sizeof(____fmt) - 2] = '\n'; \
[7] bpf_trace_printk(____fmt, sizeof(____fmt), ##__VA_ARGS__); \
} \
})
Let's break it down a bit.
[1]
includes libbpf's
bpf_core_read.h
header which defines all the BPF CO-RE macros.
[2]
defines our own local minimal (empty) definition of bpf_trace_printk
tracepoint struct to avoid dependency on having latest vmlinux.h
. This is
important for cases when vmlinux.h
header might be slightly out of date being
generated from kernel BTF before Linux 5.9. Adding ___x
suffix makes sure
that it won't conflict with the definition in up-to-date vmlinux.h
. Libbpf
and BPF CO-RE will ignore ___
and everything after it, so this will still
match the actual struct trace_event_raw_bpf_trace_printk
in the kernel. If
you are sure your vmlinux.h
is recent enough, you can just skip this step.
[3]
has two changes. We dropped the const
modifier because we are going to
modify this string at runtime (on older kernels) so it has to be allocated in
the writable .data
ELF section and corresponding BPF map. We also appended
extra \0
at the end to reserve a space for \n
characters, if we happen to
have a need for it. Replacing an existing character is way simpler than
appending one at runtime, so that’s what we are doing here.
[4]
is BPF CO-RE-based detection of tracepoint presence. If a specified
struct exists in the kernel bpf_core_type_exists()
evaluates to 1, otherwise
0 is substituted.
[5]
is the case of Linux 5.9+, so we don't need to add a newline. The only
thing we should be careful about is to not pass two \0
s in a format string,
as some kernels will reject this at runtime (and you won’t see any output in
the trace_pipe
file). That's why sizeof(____fmt) - 1
is specified as the
size of the format string, skipping the implicit '\0' added by the compiler
when allocating a string.
[6]
-[7]
is the case of older Linux, so we'll have to replace explicitly
reserved \0
with \n
to ensure that we'll get properly wrapped output. We
pass full ____fmt
size to bpf_trace_printk()
, including implicit \0
.
With this, bpf_printk("Hello, world!")
will always have a newline emitted at
the end, without callers having to care about kernel version. You just need to
make sure that you always pass format strings without explicit ‘\n’.
Detecting full-powered bpf_trace_printk()
In (upcoming) Linux 5.13 release bpf_trace_printk()
implementation got a
really nice boost in capabilities thanks to Florent
Revest's work in d9c9e4db186a ("bpf:
Factorize bpf_trace_printk and
bpf_seq_printf").
Previously, bpf_trace_printk()
allowed the use of only one string (%s
)
argument, which was quite limiting. Linux 5.13 release lifts this restriction
and allows multiple string arguments, as long as total formatted output doesn't
exceed 512 bytes. Another annoying restriction was the lack of support for
width specifiers, like %10d
or %-20s
. This restriction is gone now as well.
Here's the list of other great improvements (from the above commit’s
description):
- bpf_trace_printk always expected fmt[fmt_size] to be the terminating NULL character, this is no longer true, the first 0 is terminating.
- bpf_trace_printk now supports %% (which produces the percentage char).
- bpf_trace_printk now skips width formatting fields.
- bpf_trace_printk now supports the X modifier (capital hexadecimal).
- bpf_trace_printk now supports %pK, %px, %pB, %pi4, %pI4, %pi6 and %pI6
- bpf_trace_printk now supports the %ps and %pS specifiers to print symbols.
This means that on recent enough kernels you can do quite a lot more with
bpf_trace_printk()
. But if you want to support older kernels, you'll need to
have a fallback to a simpler logic. The question is whether it's possible to
reliably detect whether more powerful bpf_trace_printk()
behavior can be
expected.
BPF CO-RE and libbpf can actually help with this nicely. One way would be to
check for the upstream Linux version explicitly by using the extern int LINUX_KERNEL_VERSION __kconfig;
variable, but that's not very reliable in the
presence of backports in Linux kernels. For such backported features Linux
kernel version doesn't correspond to the included features in the kernel. So
it’s always better to detect desired functionality support directly, if
possible.
It so happens that bpf_trace_printk()
refactoring coincides with adding a new
BPF helper, bpf_snprintf()
, for which those refactorings and improvements
were done in the first place. So instead of relying on kernel version checks we
are going to detect the support for bpf_snprintf()
helper.
Each BPF helper has a corresponding BPF_FUNC_<helpername>
enum value in enum
bpf_func_id.
So by checking if a given enum value is present in vmlinux BTF, it's possible
to determine the presence of the corresponding BPF helper. Let's see how we can
do this in code:
/* don't rely on up-to-date vmlinux.h */
[1] enum bpf_func_id___x { BPF_FUNC_snprintf___x = 42 /* avoid zero */ };
[2] #define printk_is_powerful \
(bpf_core_enum_value_exists(enum bpf_func_id___x, BPF_FUNC_snprintf___x))
...
const char power[] = "POWER";
int pid = bpf_get_current_pid_tgid() >> 32;
if (printk_is_powerful)
[4] bpf_printk("I've got the =%%= %7s, %s, %-7s =%%=!", power, power, power);
else
[5] bpf_printk("Sorry, NO %s! :( But my PID is %d", power, pid);
[1]
defines our own minimal definition of enum bpf_func_id
and
BPF_FUNC_snprintf
enum value within it. This is, again, to avoid depending on
having the most up-to-date vmlinux.h
, so feel free to skip this if this
doesn't concern you. Notice the use of ___x
suffix both on enum and enum
value, in both cases ___x
suffix is going to be ignored by libbpf. The actual
value, 42
, doesn't matter as well, but it's a good idea to avoid using zero
(default value, unless explicitly specified) due to some older versions of
Clang having problems with it.
[2]
uses bpf_core_enum_value_exists()
to detect the presence of
BPF_FUNC_snprintf
enum value in the running kernel. It's similar to
previously used bpf_core_type_exists()
except applicable for enums. It will
evaluate to 1 if an enum value exists, otherwise 0 will be returned.
[4]
handles the case of having a more feature-rich bpf_trace_printk()
implementation and shows off using 3 string arguments with some fancier
formatting. Also, just for fun, it makes use of %%
escaping.
[5]
is a fallback case using more primitive and restricted formatting.
And that's all. If you are running on Linux 5.13+, you should see:
minimal-2167 [002] d..5 20804.858999: bpf_trace_printk: I've got the =%= POWER, POWER, POWER =%=!
minimal-2167 [002] d..5 20805.859180: bpf_trace_printk: I've got the =%= POWER, POWER, POWER =%=!
On older kernels you'll get:
<...>-3998551 [008] d... 5367146.858854: bpf_trace_printk: Sorry, NO POWER! :( But my PID is 3998551
<...>-3998551 [008] d... 5367147.858990: bpf_trace_printk: Sorry, NO POWER! :( But my PID is 3998551
Conclusion
bpf_trace_printk()
(or rather, practically speaking, bpf_printk()
wrapper)
is an extremely useful instrument that eases debugging BPF applications
immensely. It allows you to dump a lot of useful information from the BPF side
of your BPF application and watch it through the trace_pipe
file. There are,
unfortunately, inconveniences associated with gradual changes in behavior and
capabilities of bpf_trace_printk()
, but hopefully this blog post showed how
that can be abstracted away reasonably well and transparently enough by a
careful use of BPF CO-RE and other libbpf capabilities (e.g., BPF static
variables). Hopefully this information will save you some time in the future
and will let you get more out of your BPF applications.
The logical continuation of bpf_trace_printk()
evolution is the support for
passing in more than 3 input arguments, similarly to how modern printf()
-like
BPF helpers, bpf_seq_printf()
and bpf_snprintf()
, do this. This,
undoubtedly, will be added very soon, so keep an eye out on
bpf@vger.kernel.org
mailing
list.