Hello eBPF: Auto Layouting Structs (7)

Welcome back to my series on ebpf. In the last blog post, we learned how to use ring buffers with libbpf for efficient communication. This week, we’re looking into the memory layout and alignment of structs transferred between the kernel and user-land.

Alignment is essential; it specifies how the compiler layouts the structs and variables and where to put the data in memory. Take, for example, the struct that we defined in the previous blog post in the RingSample:

#define FILE_NAME_LEN 256
#define TASK_COMM_LEN  16
                
// Structure to store the data that we want to pass to user
struct event {
  u32 e_pid;
  char e_filename[FILE_NAME_LEN];
  char e_comm[TASK_COMM_LEN];
};

Struct Example

Using Pahole in the Compiler Explorer, we can see the memory layout on amd64:

struct event {
	unsigned int               e_pid;                /*     0     4 */
	char                       e_filename[256];      /*     4   256 */
	/* --- cacheline 4 boundary (256 bytes) was 4 bytes ago --- */
	char                       e_comm[16];           /*   260    16 */

	/* size: 276, cachelines: 5, members: 3 */
	/* last cacheline: 20 bytes */
};

This means that the know also knows how to transform member accesses to this struct and can adequately place the event in the allocated memory:

You’ve actually seen the layouting information before, as the hello-ebpf project requires you to hand layout all structs manually:

record Event(@Unsigned int pid,
             @Size(FILE_NAME_LEN) String filename,
             @Size(TASK_COMM_LEN) String comm) {}

// define the event records layout
private static final BPFStructType<Event> eventType =
        new BPFStructType<>("rb", List.of(
        new BPFStructMember<>("e_pid",
                BPFIntType.UINT32, 0, Event::pid),
        new BPFStructMember<>("e_filename",
                new StringType(FILE_NAME_LEN),
                4, Event::filename),
        new BPFStructMember<>("e_comm",
                new StringType(TASK_COMM_LEN),
                4 + FILE_NAME_LEN, Event::comm)
   ), new AnnotatedClass(Event.class, List.of()),
   fields -> new Event((int)fields.get(0),
       (String)fields.get(1), (String)fields.get(2)));

eBPF is agnostic regarding alignment, as the compiler on your system compiles the eBPF and the C code, so the compiler can decide how to align everything.

Alignment Rules

But where do these alignment rules come from? They come from how your CPU works. Your CPU usually only allows/is optimized for certain types of accesses. So, for example, x86 CPUs are optimized for accessing 32-bit integers that lay at addresses in memory that are a multiple of four. The rules are defined in the Application Binary Interface (ABI). The alignment rules for x86 (64-bit) on Linux are specified in the System V ABI Specification:

And more, but in general, scalar types are aligned by their size. Structs, unions, and arrays are, on the other hand, aligned based on their members:

Structures and unions assume the alignment of their most strictly aligned component. Each member is assigned to the lowest available offset with the appropriate alignment. The size of any object is always a multiple of the object‘s alignment.

An array uses the same alignment as its elements, except that a local or global array variable of length at least 16 bytes or a C99 variable-length array variable always has alignment of at least 16 bytes.

Structure and union objects can require padding to meet size and alignment constraints. The contents of any padding is undefined.

System V Application Binary Interface
AMD64 Architecture Processor Supplement
Draft Version 0.99.6

ARM 64-but has the same scalar alignments and struct alignment rules (see Procedure Call Standard for the Arm® 64-bit Architecture (AArch64)); we can therefore use the same layouting algorithm for both CPU architectures.

We can formulate the algorithm for structs as follows:

struct_alignment = 1
current_position = 0
for member in struct:
  # compute the position of the member
  # that is properly aligned
  # this introduces padding (empty space between members)
  # if there are alignment issues
  current_position = \
    math.ceil(current_position / alignment) * member.alignment
  member.position = current_position
  # the next position has to be after the current member
  current_position += member.size
  # the struct alignment is the maximum of all alignments
  struct_alignment = max(struct_alignment, member.alignment)

With this at hand, we can look at a slightly more complex example:

Struct Example with Padding

The compiler, at times, has to create an unused memory section between two members to satisfy the individual alignments. This can be seen in the following example:

struct padded_event {
  char c;  // single byte char, alignment of 1
  long l;  // alignment of 8
  int i;   // alignment of 4
  void* x; // alignment of 8
};

Using Pahole again in the Compiler Explorer, we see the layout that the compiler generates:

struct padded_event {
	char                       c;                    /*     0     1 */

	/* XXX 7 bytes hole, try to pack */

	long                       l;                    /*     8     8 */
	int                        i;                    /*    16     4 */

	/* XXX 4 bytes hole, try to pack */

	void *                     x;                    /*    24     8 */

	/* size: 32, cachelines: 1, members: 4 */
	/* sum members: 21, holes: 2, sum holes: 11 */
	/* last cacheline: 32 bytes */
};

Pahole tells us that it had to introduce 11 bytes of padding. We can visualize this as follows:

This means that we’re essentially wasting memory. I recommend reading The Lost Art of Structure Packing by Eric S. Raymond to learn more about this. If we really want to save memory, we could reorder the int with the long member, thereby only needing the padding after the char, leading to an object with 24 bytes and only 3 bytes of padding. This is really important when storing many of these structs in arrays, where the wasted memory accumulates.

But what do we do with this knowledge?

Auto-Layouting in hello-ebpf

The record that we defined in Java before contains all the information to auto-generate the BPFStructType for the class; we just need a little bit of annotation processor magic:

@Type
record Event(@Unsigned int pid,
             @Size(FILE_NAME_LEN) String filename,
             @Size(TASK_COMM_LEN) String comm) {}

This record is processed, and out comes the suitable BPFStructType:

We implemented the auto-layouting in the BPFStructType class to reduce the amount of logic in the annotation processor.

This results in a much cleaner RingSample version, named TypeProcessingSample:

@BPF
public abstract class TypeProcessingSample extends BPFProgram {

    static final String EBPF_PROGRAM = """...""";

    private static final int FILE_NAME_LEN = 256;
    private static final int TASK_COMM_LEN = 16;

    @Type
    record Event(@Unsigned int pid, 
                 @Size(FILE_NAME_LEN) String filename, 
                 @Size(TASK_COMM_LEN) String comm) {}


    public static void main(String[] args) {
        try (TypeProcessingSample program = BPFProgram.load(TypeProcessingSample.class)) {
            program.autoAttachProgram(
              program.getProgramByName("kprobe__do_sys_openat2"));

            // get the generated struct type
            var eventType = program.getTypeForClass(Event.class);

            var ringBuffer = program.getRingBufferByName("rb", eventType,
             (buffer, event) -> {
                System.out.printf("do_sys_openat2 called by:%s file:%s pid:%d\n", 
                                  event.comm(), event.filename(), event.pid());
            });
            while (true) {
                ringBuffer.consumeAndThrow();
            }
        }
    }
}

The annotation processor currently supports the following members in records:

  • integer types (int, long, …), optionally annotated with @Unsigned if unsigned
  • String types, annotated with @Size to specify the size
  • Other @Type annotated types in the same scope
  • @Type.Member annotated member to specify the BPFType directly

You can find the up-to-date list in the documentation for the Type annotation.

Conclusion

We have to model all C types that we use in both eBPF and Java in Java, too; this includes placing the different members of structs in memory and keeping them properly aligned. We saw that the general algorithm behind the layouting is straightforward. This algorithm can be used in the hello-ebpf library with an annotation processor to make writing eBPF applications more concise and less error-prone.

I hope you liked this introduction to struct layouts. See you in two weeks when we start supporting more features of libbpf.

This article is part of my work in the SapMachine team at SAP, making profiling and debugging easier for everyone.

To Brussels, Canada and back

Last year was my first year blogging, speaking at conferences, meeting incredible people, and seeing places I’ve never been before. It was at times quite arduous but at the same time energizing, as you can read in my post Looking back on one year of speaking and blogging. I didn’t want it to be a one-off year, so I dutifully started a new blog series on eBPF and applied for conferences… And I got accepted at a few of them, which was really great because I started missing traveling after almost three months of being home. In this blog post, I’ll cover my first three conferences this year: FOSDEM in Brussels, ConFoo in Montreal, and Voxxed Days Zurich; they all happened between early February and early March.

It was the most travel, distance (and continent) wise, that I ever did before, by quite some margin:

Continue reading

Hello eBPF: Ring buffers in libbpf (6)

Welcome back to my blog series on eBPF. Two weeks ago, I got started in using libbpf instead of libbcc. This week, I show you how to use ring buffers, port the code from Ansil H’s blog post eBPF for Linux Admins: Part IX from C to Java, and add tests to the underlying map implementation.

My libbpf-based implementation advances slower than the bcc-based, as I thoroughly test all added functionality and develop a proper Java API, not just a clone.

But first, what are eBPF ring buffers:

Ring buffers

In Hello eBPF: Recording data in event buffers (3), I showed you how to use perf event buffers, which are the predecessor to ring buffers and allow us to communicate between kernel and user-land using events. But perf buffers have problems:

It works great in practice, but due to its per-CPU design it has two major short-comings that prove to be inconvenient in practice: inefficient use of memory and event re-ordering.

To address these issues, starting from Linux 5.8, BPF provides a new BPF data structure (BPF map): BPF ring buffer (ringbuf). It is a multi-producer, single-consumer (MPSC) queue and can be safely shared across multiple CPUs simultaneously.

BPF ring buffer by Andrii Nakryiko

Ring buffers are still circular buffers:

Their usage is similar to the perf event buffers we’ve seen before. The significant difference is that we implemented the perf event buffers using the libbcc-based eBPF code, which made creating a buffer easy:

BPF_PERF_OUTPUT(rb);

Libbcc compiles the C code with macros. With libbpf, we have to write all that ourselves:

// anonymous struct assigned to rb variable
struct
{
  // specify the type, eBPF specific syntax
  __uint (type, BPF_MAP_TYPE_RINGBUF);
  // specify the size of the buffer
  // has to be a multiple of the page size 
  __uint (max_entries, 256 * 4096);
} rb SEC (".maps") /* placed in maps section */;

More on the specific syntax in the mail for the patch specifying it, more in the ebpf-docs.

On the eBPF side in the kernel, ring buffers have several important helper functions that allow their easy use:

bpf_ringbuf_output

long bpf_ringbuf_output(void *ringbuf, void *data, __u64 size, __u64 flags)

Copy the specified number of bytes of data into the ring buffer and send notifications to user-land. This function returns a negative number on error and zero on success.

bpf_ringbuf_reserve

void* bpf_ringbuf_reserve(void *ringbuf, __u64 size, __u64 flags)

Reserve a specified number of bytes in the ring buffer and return a pointer to the start. This lets us write events directly into the ring buffer’s memory (source).

bpf_ringbuf_submit

void *bpf_ringbuf_submit(void *data, __u64 flags)

Submit the reserved ring buffer event (reserved via bpf_ringbuf_reserve).

You might assume that you can build your own bpf_ringbuf_output with just bpf_ringbuf_reserve and bpf_ringbuf_submit and you’re correct. When we look into the actual implementation of bpf_ringbuf_output, we see that it is not that much more:

BPF_CALL_4(bpf_ringbuf_output, struct bpf_map *, map, 
           void *, data, u64, size,
	   u64, flags)
{
  struct bpf_ringbuf_map *rb_map;
  void *rec;
        
  // check flags
  if (unlikely(flags & ~(BPF_RB_NO_WAKEUP | BPF_RB_FORCE_WAKEUP)))
    return -EINVAL;

  // reserve the memory
  rb_map = container_of(map, struct bpf_ringbuf_map, map);
  rec = __bpf_ringbuf_reserve(rb_map->rb, size);
  if (!rec)
    return -EAGAIN;

  // copy the data into the reserved memory
  memcpy(rec, data, size);

  // equivalent to bpf_ringbuf_submit(rec, flags)
  bpf_ringbuf_commit(rec, flags, false /* discard */);
  return 0;
}

bpf_ringbuf_discard

void bpf_ringbuf_discard(void *data, __u64 flags)

Discard the reserved ring buffer event.

bpf_ringbuf_query

__u64 bpf_ringbuf_query(void *ringbuf, __u64 flags)

Query various characteristics of provided ring buffer. What exactly is queries is determined by flags:

  • BPF_RB_AVAIL_DATA: Amount of data not yet consumed.
  • BPF_RB_RING_SIZE: The size of ring buffer.
  • BPF_RB_CONS_POS: Consumer position (can wrap around).
  • BPF_RB_PROD_POS: Producer(s) position (can wrap around).

Data returned is just a momentary snapshot of actual values and could be inaccurate, so this facility should be used to power heuristics and for reporting, not to make 100% correct calculation.

Return: Requested value, or 0, if flags are not recognized.

bpf-Helpers man-Page

You can find more information in these resources:

Ring Buffer eBPF Example

After I’ve shown you what ring buffers are on the eBPF side, we can look at the eBPF example that writes an event for every openat call, capturing the process id, filename, and process name and comes as an addition from Ansil H’s blog post eBPF for Linux Admins: Part IX:

#include "vmlinux.h"
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_tracing.h>
#include <string.h>
                
#define TARGET_NAME "sample_write"
#define MAX_ENTRIES 10
#define FILE_NAME_LEN 256
#define TASK_COMM_LEN 256
                
// Structure to store the data that we want to pass to user
struct event
{
  u32 e_pid;
  char e_filename[FILE_NAME_LEN];
  char e_comm[TASK_COMM_LEN];
};
                
// eBPF map reference
struct
{
  __uint (type, BPF_MAP_TYPE_RINGBUF);
  __uint (max_entries, 256 * 4096);
} rb SEC (".maps");
                
// The ebpf auto-attach logic needs the SEC
SEC ("kprobe/do_sys_openat2")
     int kprobe__do_sys_openat2(struct pt_regs *ctx)
{
  char filename[256];
  char comm[TASK_COMM_LEN] = { };
  struct event *evt;
  const char fmt_str[] = "do_sys_openat2 called by:%s file:%s pid:%d";
                
  // Reserve the ring-buffer
  evt = bpf_ringbuf_reserve(&rb, sizeof (struct event), 0);
  if (!evt) {
      return 0;
  }
  // Get the PID of the process.
  evt->e_pid = bpf_get_current_pid_tgid();
                
  // Read the filename from the second argument
  // The x86 arch/ABI have first argument 
  // in di and second in si registers (man syscall)
  bpf_probe_read(evt->e_filename, sizeof(filename), 
        (char *) ctx->si);
                
  // Read the current process name
  bpf_get_current_comm(evt->e_comm, sizeof(comm));
            
  bpf_trace_printk(fmt_str, sizeof(fmt_str), evt->e_comm,
        evt->e_filename, evt->e_pid);
  // Also send the same message to the ring-buffer
  bpf_ringbuf_submit(evt, 0);
  return 0;
}
                
char _license[] SEC ("license") = "GPL";

Ring Buffer Java Example

With this in hand, we can implement the RingSample using the newly added functionality in hello-ebpf:

@BPF
public abstract class RingSample extends BPFProgram {

  static final String EBPF_PROGRAM = """
              // ...
            """;

  private static final int FILE_NAME_LEN = 256;
  private static final int TASK_COMM_LEN = 16;
  
  // event record
  record Event(@Unsigned int pid, 
               String filename, 
               @Size(TASK_COMM_LEN) String comm) {}

  // define the event records layout
  private static final BPFStructType<Event> eventType = 
          new BPFStructType<>("rb", List.of(
          new BPFStructMember<>("e_pid", 
                  BPFIntType.UINT32, 0, Event::pid),
          new BPFStructMember<>("e_filename", 
                  new StringType(FILE_NAME_LEN), 
                  4, Event::filename),
          new BPFStructMember<>("e_comm", 
                  new StringType(TASK_COMM_LEN), 
                  4 + FILE_NAME_LEN, Event::comm)
  ), new AnnotatedClass(Event.class, List.of()), 
  fields -> new Event((int)fields.get(0),
          (String)fields.get(1), (String)fields.get(2)));

  public static void main(String[] args) {
    try (RingSample program = BPFProgram.load(RingSample.class)) {
      // attach the kprobe
      program.autoAttachProgram(
              program.getProgramByName("kprobe__do_sys_openat2"));
      // obtain the ringbuffer
      // and write a message every time a new event is obtained
      var ringBuffer = program.getRingBufferByName("rb", eventType, 
              (buffer, event) -> {
        System.out.printf("do_sys_openat2 called by:%s file:%s pid:%d\n", 
                event.comm(), event.filename(), event.pid());
      });
      while (true) {
        // consume and throw any captured
        // Java exception from the event handler
        ringBuffer.consumeAndThrow();
      }
    }
  }
}

You can run the example via ./run_bpf.sh RingSample:

do_sys_openat2 called by:C1 CompilerThre file:/sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/app.slice/snap.intellij-idea-community.intellij-idea-community-a46a168b-28d0-4bb9-9e15-f3a966353efe.scope/memory.max pid:69817
do_sys_openat2 called by:C1 CompilerThre file:/sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/app.slice/snap.intellij-idea-community.intellij-idea-community-a46a168b-28d0-4bb9-9e15-f3a966353efe.scope/memory.max pid:69812
do_sys_openat2 called by:java file:/home/i560383/.sdkman/candidates/java/21.0.2-sapmchn/lib/libjimage.so pid:69797

Conclusion

The libbpf part of hello-ebpf keeps evolving. With this blog post, I added support for the first kind of eBPF maps and ring buffers, with a simplified Java API and five unit tests. I’ll most likely work on the libbpf part in the future, as it is far easier to work with than with libbcc.

Thanks for joining me on this journey to create a proper Java API for eBPF. Feel free to try the examples for yourself or even write new ones and join the discussions on GitHub. See you in my next blog post about my journey to Canada or in two weeks for the next installment of this series.

This article is part of my work in the SapMachine team at SAP, making profiling and debugging easier for everyone.