BPF ring buffer

Oct 26, 2020

Categories: BPF

Tags: bpf

Table of Contents

There is now a new BPF data structure available: BPF ring buffer. It solves memory efficiency and event re-ordering problems of the BPF perf buffer (a de facto standard today for sending data from kernel to user-space) while meeting or beating its performance. It provides both perfbuf-compatible for easy migration, but also has the new reserve/submit API with better usability. Also, both synthetic and real-world benchmarks show that in almost all cases so think about making it a default choice for sending data from the BPF program to user-space.

BPF ringbuf vs BPF perfbuf

Today, whenever a BPF program needs to send collected data to user-space for post-processing and logging, it usually uses BPF perf buffer (perfbuf) for this. Perfbuf is a collection of per-CPU circular buffers, which allows to efficiently exchange data between kernel and user-space. It works great in practice, but due to its per-CPU design it has two major short-comings that prove to be inconvenient in practice: inefficient use of memory and event re-ordering.

To address these issues, starting from Linux 5.8, BPF provides a new BPF data structure (BPF map): BPF ring buffer (ringbuf). It is a multi-producer, single-consumer (MPSC) queue and can be safely shared across multiple CPUs simultaneously.

BPF ringbuf support a familiar functionality from BPF perfbuf:

variable-length data records;
ability to efficiently read data from user-space through memory-mapped region without extra memory copying and/or syscalls into the kernel;
both epoll notifications support, as well as an ability to busy-loop for the absolutely minimal latency.

At the same time, BPF ringbuf solves the following issues with BPF perfbuf:

memory overhead;
data ordering;
wasted work and extra data copying.

Memory overhead

BPF perfbuf allocates a separate buffer for each CPU. This often means that BPF developers have to make a trade off between allocating big enough per-CPU buffers (accommodating possible spikes of emitted data) or being memory-efficient (by not wasting unnecessary memory for mostly empty buffers in a steady state, but dropping data during data spikes). This is especially tricky for applications that have big swings between being mostly idle most of the time, but going through periodic big influx of events produced in a short period of time. It's quite hard to find just the right balance, so BPF applications would typically either over-allocate perfbuf memory to be on the safe side, or will suffer inevitable data drops from time to time.

Being shared across all CPUs, BPF ringbuf allows using one big common buffer to deal with this. Bigger buffer can absorb bigger spikes, but also might allow using less RAM overall, compared to BPF perfbuf. BPF ringbuf memory usage also scales better with increased amount of CPUs, because going from 16 to 32 CPUs doesn’t necessarily require twice as big a buffer to accommodate more load. With BPF perfbuf, you would have little choice in this matter, unfortunately, due to per-CPU buffers.

Event ordering

If a BPF application has to track correlated events (e.g., process start and exit, network connection lifetime events, etc), proper ordering of events becomes critical. This is problematic with BPF perfbuf, though. If correlated events happen in rapid succession (within a few milliseconds) on different CPUs, they might get delivered out of order. This is, again, due to the per-CPU nature of BPF perfbuf.

As a real-world example, an application written by me a few years ago had to track process fork/exec/exit events and collect per-process resource usage stats. My application had to emit events into BPF perfbuf for each of those, but quite often they were arriving out of order. This is because fork(), exec(), and exit() can happen in a very rapid succession on different CPUs for short-lived processes due to the kernel scheduler migrating them from one CPU to another. To deal with that required a significant increase of complexity in application's handling logic, much more than one would expect given the otherwise pretty straightforward nature of the problem.

This wouldn't be an issue at all with BPF ringbuf, which solves this problem by emitting events into a shared buffer and guarantees that if event A was submitted before event B, then it will be also consumed before event B. This often will noticeably simplify handling logic.

Wasted work and extra data copying

With BPF perfbuf, BPF program has to prepare a data sample, before copying it into perf buffer to send to user-space. This means that the same data has to be copied twice: first into a local variable or per-CPU array (for big samples that can’t fit on a small BPF stack), and then into perfbuf itself. What's worse, all this work could be wasted if it turns out that perfbuf doesn't have enough space left in it.

BPF ringbuf supports an alternative reservation/submit API to avoid this. It’s possible to first reserve the space for data. If reservation succeeds, the BPF program can use that memory directly for preparing a data sample. Once done, submitting data to user-space is an extremely efficient operation that can't possibly fail and doesn't perform any extra memory copies at all. If reservation failed due to running out of space in a buffer, at least you know this before you’ve spent all that work trying to record the data, just to be later dropped on the floor. ringbuf-reserve-submit example below will show what that looks like in practice.

Performance and applicability

BPF ringbuf performance beats BPF perfbuf for all practical purposes (especially given a somewhat suboptimal default settings in BCC/libbpf for consuming perfbuf data). You can find extensive synthetic benchmarking results of various scenarios in this patch, if you like hard numbers.

BPF perfbuf, theoretically, can support higher throughput of data due to per-CPU buffers, but this becomes relevant only when we are talking about millions of events per second. But experiments with writing a real-world high-throughput application confirmed that BPF ringbuf is still a more performant replacement for BPF perfbuf, if used (similarly to BPF perfbuf) as a per-CPU buffer. Especially, if employing manual data availability notification. You can check out basic multi-ringbuf example (BPF side, user-space side) in one of kernel selftests. We'll look at an example of manual control over data availability notification a bit later.

The only case where you might need to be careful and experiment first is when a BPF program has to run from NMI (non-maskable interrupt) context (e.g., for handling perf events, such as cpu-cycles). BPF ringbuf internally uses a very lightweight spin-lock, which means that data reservation might fail, if lock is contended in NMI context. So in NMI context, if CPU contention is high, there might be some data drops even when ringbuf itself still has some space available.

In all other cases it is a pretty clear choice in favor of the new BPF ringbuf. BPF ringbuf provides a better performance and memory efficiency, better ordering guarantees, and better API (both kernel-side and in user-space).

Show me the code!

To show BPF ringbuf APIs, compare them to BPF perfbuf ones, and see how they would be typically used in practice, I wrote a small bpf-ringbuf-examples project, that we’ll go over in the rest of this post.

This repo implements three variants of the same BPF application, which traces all process execs capturing spawning of new processes. For each exec(), process ID (pid), process name (comm), and executable file path (filename) are captured into a sample and sent to the user-space for post-processing (in our demo, just printf()'ing all that into the stdout). Here's how the output from all three examples should look like (don't forget to run it with sudo):

$ sudo ./ringbuf-reserve-submit    # or ./ringbuf-output, or ./perfbuf-output
TIME     EVENT PID     COMM             FILENAME
19:17:39 EXEC  3232062 sh               /bin/sh
19:17:39 EXEC  3232062 timeout          /usr/bin/timeout
19:17:39 EXEC  3232063 ipmitool         /usr/bin/ipmitool
19:17:39 EXEC  3232065 env              /usr/bin/env
19:17:39 EXEC  3232066 env              /usr/bin/env
19:17:39 EXEC  3232065 timeout          /bin/timeout
19:17:39 EXEC  3232066 timeout          /bin/timeout
19:17:39 EXEC  3232067 sh               /bin/sh
19:17:39 EXEC  3232068 sh               /bin/sh
^C

Here's the C struct definition of the sample data sent from BPF program and consumed on the user-space part of the application:

#define TASK_COMM_LEN 16
#define MAX_FILENAME_LEN 512

/* definition of a sample sent to user-space from BPF program */
struct event {
	int pid;
	char comm[TASK_COMM_LEN];
	char filename[MAX_FILENAME_LEN];
};

BPF perfbuf: bpf_perf_event_output()

Let's start with a BPF perfbuf use case, BPF part of which can be found here.

First, we include some basic BPF definitions from kernel's <linux/bpf.h>, BPF helper definitions from libbpf's <bpf/bpf_helpers.h>, and our application's types from "common.h", which is shared between BPF and user-space code. We also specify that our program is under the dual GPL-2.0/BSD-3 license:

// SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause
/* Copyright (c) 2020 Andrii Nakryiko */
#include <linux/bpf.h>
#include <bpf/bpf_helpers.h>
#include "common.h"

char LICENSE[] SEC("license") = "Dual BSD/GPL";

Next, we define BPF perfbuf itself as a BPF_MAP_TYPE_PERF_EVENT_ARRAY map. There is no need to define max_entries property, because libbpf will take care of it and will automatically size it to the number of CPUs available on the system. The size of each per-CPU buffer is specified separately from user-space, which we'll get to in a second.

/* BPF perfbuf map */
struct {
	__uint(type, BPF_MAP_TYPE_PERF_EVENT_ARRAY);
	__uint(key_size, sizeof(int));
	__uint(value_size, sizeof(int));
} pb SEC(".maps");

Because the sample (struct event defined in common.h) is pretty big (>512 bytes, as I set the maximum captured size of a filename to 512 bytes), we can't prepare the data on the stack. So we use a single-element per-CPU array as a temporary storage:

struct {
	__uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
	__uint(max_entries, 1);
	__type(key, int);
	__type(value, struct event);
} heap SEC(".maps");

Now we define the BPF program itself and specify that it should be attached to the sched:sched_process_exec, which is triggered on each successful exec() syscall. struct trace_event_raw_sched_process_exec is also defined in common.h and is just copy/pasted from Linux sources. It defines the layout of input data for that particular tracepoint.

SEC("tp/sched/sched_process_exec")
int handle_exec(struct trace_event_raw_sched_process_exec *ctx)
{
	unsigned fname_off = ctx->__data_loc_filename & 0xFFFF;
	struct event *e;
	int zero = 0;

	e = bpf_map_lookup_elem(&heap, &zero);
	if (!e) /* can't happen */
		return 0;

	e->pid = bpf_get_current_pid_tgid() >> 32;
	bpf_get_current_comm(&e->comm, sizeof(e->comm));
	bpf_probe_read_str(&e->filename, sizeof(e->filename), (void *)ctx + fname_off);

	bpf_perf_event_output(ctx, &pb, BPF_F_CURRENT_CPU, e, sizeof(*e));
	return 0;
}

The BPF program logic is pretty straightforward. It fetches a temporary storage for our sample and fills it out with data from tracepoint context. When done, it sends the sample down to the BPF perfbuf with bpf_perf_event_output() call. This API reserves the space for struct event in perf buffer on the current CPU, copies sizeof(*e) bytes of data from e to that reserved space, and when done it will signal user-space that the new data is available. At that point epoll subsystem will wakeup the user-space handler and will pass the pointer to that copy of data for processing.

And that's literally the entirety of the BPF side of things. Pretty straightforward, if you have ever seen any other modern BPF application.

Let's skim through the user-space side now. It's relying on BPF skeleton (you can read more about it here), so it's pretty short and straightforward. After doing a minimal initial setup (setting libbpf logging handler, interrupt handler, bumping RLIMIT_MEMLOCK limit for BPF system), it just opens and loads the BPF skeleton. If all that succeeds, we then use libbpf's user-space perf_buffer__new() API to create an instance of perf buffer consumer:

	struct perf_buffer *pb = NULL;
	struct perf_buffer_opts pb_opts = {};
	struct perfbuf_output_bpf *skel;

	...

	/* Set up ring buffer polling */
	pb_opts.sample_cb = handle_event;
	pb = perf_buffer__new(bpf_map__fd(skel->maps.pb), 8 /* 32KB per CPU */, &pb_opts);
	if (libbpf_get_error(pb)) {
		err = -1;
		fprintf(stderr, "Failed to create perf buffer\n");
		goto cleanup;
	}

Here we specified that each per-CPU buffer is 32KB (8 pages x 4096 bytes per page), and for each submitted sample libbpf will call our handle_event() callback, which just outputs data with printf():

void handle_event(void *ctx, int cpu, void *data, unsigned int data_sz)
{
	const struct event *e = data;
	struct tm *tm;
	char ts[32];
	time_t t;

	time(&t);
	tm = localtime(&t);
	strftime(ts, sizeof(ts), "%H:%M:%S", tm);

	printf("%-8s %-5s %-7d %-16s %s\n", ts, "EXEC", e->pid, e->comm, e->filename);
}

The last step is just to continuously consume data as soon as it becomes available until it's time to exit (e.g., if user presses Ctrl-C):

	/* Process events */
	printf("%-8s %-5s %-7s %-16s %s\n",
	       "TIME", "EVENT", "PID", "COMM", "FILENAME");
	while (!exiting) {
		err = perf_buffer__poll(pb, 100 /* timeout, ms */);
		/* Ctrl-C will cause -EINTR */
		if (err == -EINTR) {
			err = 0;
			break;
		}
		if (err < 0) {
			printf("Error polling perf buffer: %d\n", err);
			break;
		}
	}

BPF ringbuf: bpf_ringbuf_output()

BPF ringbuf's bpf_ringbuf_output() API is designed to follow the semantics of BPF perfbuf's bpf_perf_event_output() to make the migration a complete no-brainer. To show how similar and close the usability is, I'll literally show the diff between the perfbuf-output and ringbuf-output examples. You can check out full BPF-side code and user-space code on Github.

Here's the BPF side diff:

--- src/perfbuf-output.bpf.c	2020-10-25 18:52:22.247019800 -0700
+++ src/ringbuf-output.bpf.c	2020-10-25 18:44:14.510630322 -0700
@@ -6,12 +6,11 @@
 
 char LICENSE[] SEC("license") = "Dual BSD/GPL";
 
-/* BPF perfbuf map */
+/* BPF ringbuf map */
 struct {
-	__uint(type, BPF_MAP_TYPE_PERF_EVENT_ARRAY);
-	__uint(key_size, sizeof(int));
-	__uint(value_size, sizeof(int));
-} pb SEC(".maps");
+	__uint(type, BPF_MAP_TYPE_RINGBUF);
+	__uint(max_entries, 256 * 1024 /* 256 KB */);
+} rb SEC(".maps");
 
 struct {
 	__uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
@@ -35,7 +34,7 @@
 	bpf_get_current_comm(&e->comm, sizeof(e->comm));
 	bpf_probe_read_str(&e->filename, sizeof(e->filename), (void *)ctx + fname_off);
 
-	bpf_perf_event_output(ctx, &pb, BPF_F_CURRENT_CPU, e, sizeof(*e));
+	bpf_ringbuf_output(&rb, e, sizeof(*e), 0);
 	return 0;
 }

So just two simple changes:

BPF ringbuf map is defined slightly differently. Its size (but now it's the size of the buffer shared across all CPUs) can be now defined on the BPF side. Keep in mind, it's still possible to omit it in BPF-side definition and specify (or override if you do specify it on BPF side) on user-space side with bpf_map__set_max_entries() API. The other difference is that the size (max_entries property) is specified in number of bytes, with the only restriction being that it should be a multiple of kernel page size (almost always 4096 bytes) and a power of 2 (similarly as for perfbuf's number of pages, which also has to be a power of 2). The BPF perfbuf size is specified from the user-space side and is in a number of memory pages.
bpf_perf_event_output() is replaced with the very similar bpf_ringbuf_output(), with the only difference that ringbuf API doesn't need a reference to the BPF program's context.

That’s the only two differences on the BPF side.

On the user-space side the changes are also minimal. Ignoring perf_buffer <-> ring_buffer renames, it comes down to just two changes. First, the definition of event handler callback can now return an error (which will terminate consumer loop) and it doesn't take the index of a CPU that produced the event:

-void handle_event(void *ctx, int cpu, void *data, unsigned int data_sz)
+int handle_event(void *ctx, void *data, size_t data_sz)
{
	const struct event *e = data;
	struct tm *tm;

If knowing the CPU index is important, you'll have to record that in your sample from the BPF side explicitly. Also, ring_buffer API doesn't provide a callback for lost samples, which perf_buffer does. That too would need to be handled explicitly from the BPF side, if necessary. This was done to minimize lock contention in a shared (across CPUs) ring buffer and not pay the price when it's not needed. Also, in practice, there is rarely much one can do about that, beyond just reporting this, which can be done more efficiently and conveniently with explicit BPF code.

The second difference is a slightly simpler ring_buffer__new() API, which allows specifying callback without resorting to extra options struct:

 	/* Set up ring buffer polling */
-	pb_opts.sample_cb = handle_event;
-	pb = perf_buffer__new(bpf_map__fd(skel->maps.pb), 8 /* 32KB per CPU */, &pb_opts);
-	if (libbpf_get_error(pb)) {
+	rb = ring_buffer__new(bpf_map__fd(skel->maps.rb), handle_event, NULL, NULL);
+	if (!rb) {
 		err = -1;
-		fprintf(stderr, "Failed to create perf buffer\n");
+		fprintf(stderr, "Failed to create ring buffer\n");
 		goto cleanup;
 	}

With that, simply replacing perf_buffer__poll() with ring_buffer__poll() allows you to start consuming ring buffer data in exactly the same way:

 	printf("%-8s %-5s %-7s %-16s %s\n",
 	       "TIME", "EVENT", "PID", "COMM", "FILENAME");
 	while (!exiting) {
-		err = perf_buffer__poll(pb, 100 /* timeout, ms */);
+		err = ring_buffer__poll(rb, 100 /* timeout, ms */);
 		/* Ctrl-C will cause -EINTR */
 		if (err == -EINTR) {
 			err = 0;
 			break;
 		}
 		if (err < 0) {
-			printf("Error polling perf buffer: %d\n", err);
+			printf("Error polling ring buffer: %d\n", err);
 			break;
 		}
 	}

BPF ringbuf: reserve/submit API

The goal for bpf_ringbuf_output() API was to allow a smooth transition from BPF perfbuf to BPF ringbuf without any substantial changes to BPF code. But it also means that it shares some of the shortcomings of BPF perfbuf APIs: extra memory copy and very late data reservation. The former means that you need extra space to construct a sample before copying it over to the buffer. This is both less efficient and often requires extra complexity with single-element per-CPU arrays. The latter means that all the work to construct a sample could be wasted, if there is no more space left in the buffer due to the lagging user-space Or because of a quick burst of incoming events overflowing the buffer. But if you knew that the data will be dropped regardless, you could just skip collecting it in the first place and save some resources for the consumer side to catch up faster. But it's not possible with the xxx_output()-style APIs.

That's where the bpf_ringbuf_reserve()/bpf_ringbuf_submit() APIs come in handy. Reserve allows you to do just that: reserve the space early on or determine that it's not possible (returning NULL in such case). If we can't get enough data to submit the sample, we can skip spending all the resources to capture data. But if the reservation succeeded, then we have a guarantee that, once we are done with data collection, publishing it to the user-space will never fail. I.e., if bpf_ringbuf_reserve() returns a non-NULL pointer, subsequent bpf_ringbuf_submit() will always succeed.

Further, the reserved space in the ring buffer itself is not visible to user-space until it is submitted, so it can be freely used to construct a sample, however complicated and multi-step operation it is. And it obviates the need for extra memory copying and temporary storage spaces. The only restriction is that the size of the reservation has to be known to the BPF verifier at verification time, so samples with dynamic size would have to be handled with bpf_ringbuf_output() and pay the price of extra copy.

But in most cases, reserve/submit is what you should prefer. Here's the diff for BPF program code (full BPF and user-space code is on Github as well):

--- src/ringbuf-output.bpf.c	2020-10-25 18:44:14.510630322 -0700
+++ src/ringbuf-reserve-submit.bpf.c	2020-10-25 18:36:53.409470270 -0700
@@ -12,29 +12,21 @@
 	__uint(max_entries, 256 * 1024 /* 256 KB */);
 } rb SEC(".maps");
 
-struct {
-	__uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
-	__uint(max_entries, 1);
-	__type(key, int);
-	__type(value, struct event);
-} heap SEC(".maps");
-
 SEC("tp/sched/sched_process_exec")
 int handle_exec(struct trace_event_raw_sched_process_exec *ctx)
 {
 	unsigned fname_off = ctx->__data_loc_filename & 0xFFFF;
 	struct event *e;
-	int zero = 0;
 	
-	e = bpf_map_lookup_elem(&heap, &zero);
-	if (!e) /* can't happen */
+	e = bpf_ringbuf_reserve(&rb, sizeof(*e), 0);
+	if (!e)
 		return 0;
 
 	e->pid = bpf_get_current_pid_tgid() >> 32;
 	bpf_get_current_comm(&e->comm, sizeof(e->comm));
 	bpf_probe_read_str(&e->filename, sizeof(e->filename), (void *)ctx + fname_off);
 
-	bpf_ringbuf_output(&rb, e, sizeof(*e), 0);
+	bpf_ringbuf_submit(e, 0);
 	return 0;
 }

Gone is the per-CPU array and instead we use the result of bpf_ringbuf_reserve() to fill the sample with data.

The user-space part is exactly the same (modulo the name of the BPF skeleton object), which makes sense because in the end you are consuming exactly the same data from the BPF ring buffer.

BPF ringbuf: data notification control

When dealing with high-throughput cases, often the biggest overhead comes from in-kernel signalling of data availability when a sample is submitted (this lets kernel’s poll/epoll system to wake up user-space handlers blocked on waiting for the new data). This is true for both perfbuf and ringbuf.

Perfbuf handles that with the ability to set up sampled notification, in which case only every Nth sample will send a notification. You can do that when creating a BPF perfbuf map from the user-space. And you need to make sure that it works for you that you won’t see the last N-1 samples, until the Nth sample arrives. This might or might not be a big deal for your particular case.

BPF ringbuf went a different route with this. bpf_ringbuf_output() and bpf_ringbuf_submit() accept an extra flags argument and you can specify either BPF_RB_NO_WAKEUP or BPF_RB_FORCE_WAKEUP flag. Specifying BPF_RB_NO_WAKEUP inhibits sending in-kernel data availability notification. While BPF_RB_FORCE_WAKEUP will force sending a notification. This allows for the precise manual control, if necessary. To see how that can be done, please check BPF ringbuf benchmark, which will send notifications only when a configurable amount of data is enqueued in the ring buffer.

By default, if no flag is specified, BPF ringbuf code will do an adaptive notification depending on whether the user-space consumer is lagging behind or not, which results in the user-space consumer never missing a single sample notification, but not paying unnecessary overhead. No flag is a good and safe default, but if you need to get an extra performance, manually controlling data notifications depending on your custom criteria (e.g., amount of enqueued data in the buffer) might give you a big boost in performance.

Conclusion

This post explains the problems BPF ring buffer is addressing and motivation behind API choices. Hopefully, code samples and extra links to the kernel selftests and benchmarks will let you get a good grasp on BPF ringbuf APIs and demonstrate both a simple and more advanced use of the APIs to meet the needs of your application.