Hello eBPF: Concurrency Testing using Custom Linux Schedulers (19)

Welcome back to my hello-ebpf series. I was a FOSDEM earlier this month and gave three(ish) talks. This blog post covers one of them, which is related to concurrency testing.

Backstory

The whole story started in the summer of last year when I thought about the potential use cases for custom Linux schedulers. The only two prominent use cases at the time were:

Improve performance on servers and heterogenous systems
Allow people to write their own schedulers for educational purposes

Of course, the latter is also important for the hello-ebpf project, which allows Java developers to create their own schedulers. But I had another idea back then: Why can’t we use custom Linux schedulers for concurrency testing? I presented this idea in my keynote at the eBPF summit in September, which you can watch here.

I ended up giving this talk because of a certain Bill Mulligan, who liked the idea of showing my eBPF SDK in Java and sched-ext in a single talk. For whatever reason, I foolishly agreed to give the talk and spent a whole weekend in Oslo before JavaZone frantically implementing sched-ext support in hello-ebpf.

My idea then was one could use a custom scheduler to test specific scheduling orders to find (and possibly) reproduce erroneous behavior of concurrent code. One of the examples in my talk was dead-locks:

Task-Control

The result of this idea was my blog post Hello eBPF: Control task scheduling with a custom scheduler written in Java (16) and the accompanying code:

The scheduler checks on every scheduling decision to determine whether a task can be scheduled. Coupled with a proper API, this can already be useful.

Initial Idea

But then FOSDEM came up, and I wanted to submit a talk on this topic to the Testing and Continous Delivery track. With the help of Jake Hillion, I crafted the following idea (and submitted it): A scheduler that can fuzz concurrent programs. Taking the task-control code one step further, implementing the randomization of scheduling orders directly into a scheduler:

Consider you want to have a concurrency bug that requires threads to run in a specific order. Wouldn’t it be great if you could stop and start threads at random? Prevent them from being scheduled onto the CPU? And the best part: Without the application being able to prevent this, like it could do with POSIX STOP and START signals? In come the scheduler extensions for the Linux Kernel. Introduced in version 6.12, they allow you to quickly write your own schedulers with eBPF, and can be the base for simple libraries that enable you to start and stop threads directly in the Linux kernel. This opens the possibility of creating complex scheduler scenarios at a whim.

In this talk, we’ll show you a prototypical sched_ext-based library for concurrency testing.
Abstract of Our Talk

As you can read in any operating system text-book (and Wikipedia), scheduler developers typically have the following goals in mind:

maximizing throughput (the total amount of work completed per time unit);

minimizing wait time (time from work becoming ready until the first point it begins execution);

minimizing latency or response time (time from work becoming ready until it is finished in case of batch activity,^[1]^[2]^[3] or until the system responds and hands the first output to the user in case of interactive activity);^[4]

maximizing fairness (equal CPU time to each process, or more generally appropriate times according to the priority and workload of each process).

WIKIPEDIA

This means that all commonly known schedulers exhibit somewhat predictable behavior. So, the main idea of the proposed talk was to create a scheduler with erratic behavior – a scheduler that ultimately ignores goals like fairness or minimizing wait time. With sched-ext, implementing such a scheduler became possible.

To our surprise, the talk was accepted. Remember, this was the only sched-ext talk at FOSDEM in front of non-Kernel or eBPF people.

The Talk

And even more surprising was the full room at FOSDEM, with people apparently not getting in because there was too much interest. Sadly, despite preparing the talk and writing the code, I couldn’t join, because my train connection to Brussels got derailed (only figuratively). Jake told me later that he was still sleepy when he received my message at 8 am, but this changed rapidly when he realized that he would have to give the 12am talk alone. But luckily, as you see in the recording, the talk went well:

I’m happy that Jake Hillion rose to the challenge and delivered a really good (“Even my 17-year-old daughter understood it”, a colleague from SAP later told me).

Even better, it caught the attention of Daroc Alden, who wrote a great article on LWN, covering the talk and the topic better than I ever could:

Hillion and Bechberger’s scheduler does this deliberately: when a thread is ready to run, instead of assigning it to a CPU right away, their scheduler will put it to sleep for a random amount of time based on some configurable parameters. This gives threads that would ordinarily not have any problem keeping up with each other a chance to let one thread race ahead and overwhelm the other. It also ensures that threads run in a random order, which can help expose bugs as well. In a way, it is the fact that Linux’s default scheduler is so good that makes these problems hard to reproduce, Hillion explained. EEVDF is too fair, and only actually lets one thread overwhelm the other when the machine is already overloaded, and even then rarely.

The idea of running threads in a random order is not new. There are plenty of specialized concurrency testing tools that do something similar. But concurrency-fuzz-scheduler is not nearly as fine-grained as such tools usually are. It won’t help deterministically find violations of a particular memory model, for example. It will find the kinds of gross logic errors that are common in real-life applications, however.
Exposing concurrency bugs with a custom scheduler

I recommend reading the article and watching the recording of the talk.

Code

The tool I developed is in its early state; it’s more of a proof-of-concept, helping to grasp the concept and possibilities.

You can find the code on GitHub, written, of course, in Java. I’ll cover the eBPF part of the code in the following, skipping over the boilerplate CLI portion.

First, we define a few data structures:

@Type
public record DurationRange(@Unsigned long minNs, @Unsigned long maxNs) {}

@Type
public record SchedulerSetting(int scriptPID, 
    DurationRange sleepRange, 
    DurationRange runRange,
    // slice length of all non-program-under-test tasks
    long systemSliceNs,
    // slice length of all program-under-test tasks
    long sliceNs,
    // scale the slice by the number of already waiting tasks in the global queue
    boolean scaleSlice, 
    // log some information to the trace log
    boolean log, 
    // don't treat GC and JIT related tasks as related to the program-under-test
    boolean focusOnJava) {
}

@Type
enum TaskState implements Enum<TaskState> {
    START, RUNNING, SLEEPING
}

@Type
static class TaskContext {

    TaskState state;

    @Unsigned long timeAllowedInState;

    /** Time the task last has been started on the CPU */
    @Unsigned long lastStartNs;
    /** Total runtime of the task since it last slept */
    @Unsigned long runtimeSinceLastSleepNs;
    /** Time the task last has been moved off a CPU */
    @Unsigned long lastStopNs;
}

Every task that we care about (related to the program under test) can have one of three states:

START: the task just came into the system; we have to setup everything for it
RUNNING: The task can currently be scheduled and, therefore, is able to run
SLEEPING: The talk can’t be scheduled

We switch between RUNNING and SLEEPING whenever the current time in the state (timeAllowedInState) is up, as stored in the TaskContext. So, we essentially have a simple state machine:

We have to store the context somewhere, so we also need some maps:

// As always the ID of the global scheduling queue
private static final int SHARED_DSQ_ID = 0;

final GlobalVariable<SchedulerSetting> schedulerSetting = new GlobalVariable<>(new SchedulerSetting(0, new DurationRange(0, 0), new DurationRange(0, 0), 1000000, 1000000, true, false, false));

@BPFMapDefinition(maxEntries = 10000)
BPFLRUHashMap<@Unsigned Integer, TaskContext> taskContexts;

/** Is the task related to the fuzzed script? */
@BPFMapDefinition(maxEntries = 100000)
BPFLRUHashMap<@Unsigned Integer, Boolean> isScriptRelated;

We cache the information on whether a task is related to the program under the test in a map, so we don’t always have to recompute it:

@BPFFunction
@AlwaysInline
boolean isTaskScriptRelated(Ptr<TaskDefinitions.task_struct> task) {
    var curPid = task.val().tgid; // this is the process id
    var tgidRel = isScriptRelated.bpf_get(curPid);
    var scriptPid = schedulerSetting.get().scriptPID;
    if (scriptPid == 0) {
        return false;
    }
    if (tgidRel == null) { // recompute if needed
        var isRelated = curPid == scriptPid;
        if (!isRelated) {
            // bpf_trace_printk("Process %s has parent %s", task.val().comm, task.val().real_parent.val().comm);
            // check parent process
            var parentTGid = task.val().real_parent.val().tgid;
            var parentPid = task.val().real_parent.val().pid;
            isRelated = parentPid == scriptPid || parentTGid == scriptPid;
        }
        isScriptRelated.put(task.val().pid, isRelated);
        isScriptRelated.put(curPid, isRelated);
        return isRelated;
    }
    return tgidRel.val();
}

Of course, we then need to define some basic methods to get a random number in a specific range

@BPFFunction
@Unsigned long randomInRange(@Unsigned long min, @Unsigned long max) {
    if (min == max) {
        return min;
    }
    return min + bpf_get_prandom_u32() % (max - min);
}

and to initialize task contexts:

@BPFFunction
@AlwaysInline
void getTaskContext(Ptr<TaskDefinitions.task_struct> task, Ptr<Ptr<TaskContext>> contextPtr) {
    var id = task.val().tgid;
    var ret = taskContexts.bpf_get(id);
    if (ret == null) {
        var context = new TaskContext();
        context.state = TaskState.START;
        context.lastStartNs = 0;
        context.runtimeSinceLastSleepNs = 0;
        context.lastStopNs = 0;
        taskContexts.put(id, context);
    }
    var ret2 = taskContexts.bpf_get(id);
    contextPtr.set(ret2);
}

The initialization, depending on the randomly chosen start state of relevant new tasks, is done by different methods, which are also used when a task changes its state:

@BPFFunction
@AlwaysInline
void initSleeping(Ptr<TaskContext> context, Ptr<TaskDefinitions.task_struct> p) {
    context.val().state = TaskState.SLEEPING;
    context.val().lastStopNs = bpf_ktime_get_ns();
    // assign a random duration in this state
    context.val().timeAllowedInState = randomInRange(
      schedulerSetting.get().sleepRange.minNs(), 
      schedulerSetting.get().sleepRange.maxNs());
    if (schedulerSetting.get().log()) {
        bpf_trace_printk("%s is sleeping for %dms\n", p.val().comm, context.val().timeAllowedInState / 1_000_000);
    }
}

@BPFFunction
@AlwaysInline
void initRunning(Ptr<TaskContext> context, Ptr<TaskDefinitions.task_struct> p) {
    context.val().state = TaskState.RUNNING;
    context.val().lastStopNs = bpf_ktime_get_ns();
    context.val().timeAllowedInState = randomInRange(
      schedulerSetting.get().runRange.minNs(), 
      schedulerSetting.get().runRange.maxNs());
    if (schedulerSetting.get().log()) {
        bpf_trace_printk("%s is running for %dms\n", p.val().comm, context.val().timeAllowedInState / 1_000_000);
    }
}

The actual state change (and the check, if it’s needed) is done in the updateState method:

@BPFFunction
@AlwaysInline
boolean updateStateIfNeededAndReturnIfSchedulable(Ptr<TaskDefinitions.task_struct> p) {
    if (!isTaskScriptRelated(p)) {
        // don't schedule tasks that are not related to the program-under-test
        return true;
    }
    Ptr<TaskContext> context = null;
    getTaskContext(p, Ptr.of(context));
    if (context == null) { // to make the verifier happy
        return true;
    }
    if (schedulerSetting.get().focusOnJava) {
        // basic check to ignore GC and compiler threads of Java applications
        // these are usually not relevant and targeting them would cause
        // irrelevant problems with the execution of the program-under-test 
        if (((p.val().comm[0] == 'C' ||
            p.val().comm[0] == 'G') && (p.val().comm[1] == '1' || 
            p.val().comm[1] == '2')) || 
           (p.val().comm[0] == 'V' && p.val().comm[1] == 'M')) {
            // could I implement this better? Yes, but it works for the demo
            return true;
        }
    }
    if (context.val().state == TaskState.START) {
        // initialize the task, randomly choose if it should sleep or run
        if (randomInRange(0, 2) == 0) {
            initSleeping(context, p);
            return false;
        } else {
            initRunning(context, p);
            return true;
        }
    }

    if (context.val().state == TaskState.RUNNING) { // check if the task has to sleep
        if (bpf_ktime_get_ns() - context.val().lastStopNs 
                >= context.val().timeAllowedInState) { // sleep if the task has run too long
            initSleeping(context, p);
            return false;
        }
        return true;
    } else { // check if the task can be scheduled again
        if (bpf_ktime_get_ns() - context.val().lastStopNs 
                >= context.val().timeAllowedInState) {
            initRunning(context, p);
            return true;
        }
        return false;
    }
}

The actual scheduling is now pretty simple. We start with the init and enqueue methods that are implemented as always for the FIFO scheduler (see Hello eBPF: Writing a Linux scheduler in Java with eBPF (15)):

Override
public int init() {
    return scx_bpf_create_dsq(SHARED_DSQ_ID, -1);
}

@Override
public void enqueue(Ptr<TaskDefinitions.task_struct> p, long enq_flags) {
    var isScriptRelated = isTaskScriptRelated(p);
    @Unsigned long sliceLength = isScriptRelated ? 
      schedulerSetting.get().sliceNs() : 
      schedulerSetting.get().systemSliceNs();
    if (schedulerSetting.get().scaleSlice()) {
        sliceLength = sliceLength / scx_bpf_dsq_nr_queued(SHARED_DSQ_ID);
    }
    scx_bpf_dispatch(p, SHARED_DSQ_ID, sliceLength, enq_flags);
}

We must note that we schedule all tasks to the global queue, even if we don’t want to schedule them onto a CPU because of their current state. This could, and should, be improved by creating a secondary queue for tasks in the SLEEPING state.

Now to the dispatching, which takes care of only scheduling tasks that are not in the SLEEPING state and ignores tasks that have any CPU constraints:

@BPFFunction
@AlwaysInline
public boolean tryDispatching(Ptr<BpfDefinitions.bpf_iter_scx_dsq> iter, 
  Ptr<TaskDefinitions.task_struct> p, int cpu) {
    // check if the CPU is usable by the task
    if (!bpf_cpumask_test_cpu(cpu, p.val().cpus_ptr)) {
        return false;
    }
    return scx_bpf_dispatch_from_dsq(iter, p, SCX_DSQ_LOCAL_ON.value() | cpu, 
                                     SCX_ENQ_PREEMPT.value());
}

@BPFFunction
@AlwaysInline
public boolean hasConstraints(Ptr<TaskDefinitions.task_struct> p) {
    return ((p.val().flags & PF_KTHREAD) != 0) || 
      (p.val().nr_cpus_allowed != scx_bpf_nr_cpu_ids());
}

@BPFFunction
@AlwaysInline
public boolean canScheduleOnCPU(Ptr<TaskDefinitions.task_struct> p, int cpu) {
    return true; // this could be used to implement CPU sets
}

@Override
public void dispatch(int cpu, Ptr<TaskDefinitions.task_struct> prev) {
    Ptr<TaskDefinitions.task_struct> p = null;
    bpf_for_each_dsq(SHARED_DSQ_ID, p, iter -> {
        if (!updateStateIfNeededAndReturnIfSchedulable(p)) { // check state machine
            _continue();
        }
        if ((hasConstraints(p) || canScheduleOnCPU(p, cpu)) && tryDispatching(iter, p, cpu)) {
            return; // has different semantics than continue, return will return from the dispatch function
        }
    });
}

We also need to update the start and stop information in the context of every relevant task:

@Override
public void running(Ptr<TaskDefinitions.task_struct> p) {
    if (!isTaskScriptRelated(p)) {
        return;
    }
    Ptr<TaskContext> context = null;
    getTaskContext(p, Ptr.of(context));
    if (context != null) {
        context.val().lastStartNs = bpf_ktime_get_ns();
    }
}

@Override
public void stopping(Ptr<TaskDefinitions.task_struct> p, boolean runnable) {
    if (!isTaskScriptRelated(p)) {
        return;
    }
    Ptr<TaskContext> context = null;
    getTaskContext(p, Ptr.of(context));
    if (context != null) {
        context.val().runtimeSinceLastSleepNs = 
          context.val().runtimeSinceLastSleepNs + (bpf_ktime_get_ns() - context.val().lastStartNs);
    }
}

And remove tasks from the isScriptRelated map whenever they exit, as task IDs can be reused:

@Override
public void disable(Ptr<TaskDefinitions.task_struct> p) {
    isScriptRelated.bpf_delete(p.val().tgid);
}

This is the relevant part of the code. Yes, it can be improved in several ways, but this is for another day.

Usage

After you build the tool via mvn package, you can configure the SchedulerSettings object:

./scheduler.sh --help
Usage: scheduler.sh [-dhV] [--java] [--log] [-e=<errorCommand>]
                    [--error-check-interval=<errorCheckIntervalNs>]
                    [-i=<iterationTimeNs>] [-m=<maxIterations>] [-r=<runRange>]
                    [-s=<sleepRange>] [--slice=<sliceNs>]
                    [--system-slice=<systemSliceNs>] script

More on this in the README, but you essentially run ./scheduler.sh PROGRAM to run a program in the context of the fuzzer.

Example Program

In the talk, I used a toy example that is related to my work on a new profiler for Java:

This example involves a producer who produces items with a best-before date and should be used within a specific time frame. The consumer regularly consumes the items and crashes if they are too old. The items are passed via a concurrent queue. This is a simplification of the rather typical scenario in which items are not indefinitely valid but contain invalid information after some state changes.

You can find the code of the toy example on GitHub.

To build, use the samples/build_queue.sh. Then you can run using the tool:

> ./scheduler.sh samples/run_queue.sh --log --java
Iteration
[4.293] java is sleeping for 586ms
[4.886] java is running for 5ms
[4.886] java is sleeping for 2241ms
[7.132] java is running for 10ms
[7.144] java is sleeping for 299ms
[7.445] java is running for 3ms
[7.449] java is sleeping for 1038ms
[8.606] java is running for 6ms
[8.611] java is sleeping for 827ms
[9.435] java is running for 10ms
[9.446] java is sleeping for 1543ms
[10.990] java is running for 16ms
[11.012] java is sleeping for 754ms
[11.767] java is running for 15ms
[11.792] Producer is sleeping for 299ms
[12.092] Producer is running for 14ms
[12.111] Producer is sleeping for 1605ms
[13.719] Producer is running for 5ms
[13.731] Consumer is sleeping for 301ms
[14.034] Consumer is running for 12ms
[14.054] Producer is sleeping for 1783ms
[15.839] Producer is running for 15ms
[15.860] Producer is sleeping for 285ms
[16.146] Producer is running for 5ms
[16.156] Consumer is sleeping for 968ms
# ...
[22.494] Producer is running for 11ms
[22.514] Producer is sleeping for 1815ms
[24.330] Producer is running for 16ms
[24.358] Producer is sleeping for 1174ms
[25.528] Producer is running for 7ms
[25.538] Consumer is sleeping for 1912ms
[27.456] Consumer is running for 3ms
[27.465] Consumer is sleeping for 1177ms
[28.655] Consumer is running for 15ms
[28.662] Consumer is sleeping for 1525ms
[30.187] Consumer is running for 6ms
[30.195] Consumer is sleeping for 475ms
[30.674] Consumer is running for 12ms
Program failed after 30.774

This erratic scheduler caused the consumer thread to be paused randomly and, in one instance, processed an invalid item. Of course, this toy example is constructed to make the error occur quickly so that it can be used in a demo. But the same technique should work with proper applications, too, just a lot slower. Fuzzing takes time.

Improvements

There are two main areas of improvement: Making the fuzzing more targeted and improving the scheduler. For the latter, we could use multiple queues and potentially a more advanced scheduler than FIFO as the base scheduler. To improve the fuzzing, we could, for example, make the sleeps after yields of random length. These scheduler yields can happen when a task doesn’t get the lock it asks for and has to wait longer. Or, of course, when the task sleeps. We could get even more knowledge on the tasks if we instrument the application. This is quite easy in Java with bytecode instrumentation and allows us to push more information dynamically into our fuzzing scheduler.

There are many more ways to improve the fuzzer; the only issue is finding the time to implement and test them.

Conclusion

I want to conclude with the words of Daroc Alden:

Hillon and Bechberger’s concurrency-fuzz-scheduler is a promising example of a sched_ext scheduler that does something more than act as a testbed for scheduling ideas. It is a practical tool for shaking more bugs out of applications — something that is always helpful, in a world where software only grows more complex.

And, of course, it’s a testament to the versatility of hello-ebpf. The tool is in its early stages, but given enough time, it can hopefully mature into a helpful tool for finding bugs in OpenJDK and more.

Thank you for joining me on this journey to bring eBPF and sched-ext to the Java world.

P.S.: This wasn’t my only talk at FOSDEM; I also gave a talk on sched-ext in the eBPF track and participated in a talk on the future of Java profiling in the Free Java track. See my FOSDEM speaker page for the details.

P.S.: Jake Hillion created a version of this scheduler in the scx repository and used it to find a kernel bug.

This article is part of my work in the SapMachine team at SAP, making profiling and debugging easier for everyone. Thanks to Jake for helping with both the talk and this blog post.

A tiny street in Brussels in the early morning — Brussels in the early mornings

Author

Johannes Bechberger

Johannes Bechberger is a JVM developer working on profilers and their underlying technology in the SapMachine team at SAP. This includes improvements to async-profiler and its ecosystem, a website to view the different JFR event types, and improvements to the FirefoxProfiler, making it usable in the Java world. His work today comprises many open-source contributions and his blog, where he regularly writes on in-depth profiling and debugging topics. He also works on hello-ebpf, the first eBPF library for Java. His most recent contribution is the new CPU Time Profiler in JDK 25.

View all posts

New posts like these come out at least every two weeks, to get notified about new posts, follow me on BlueSky, Twitter, Mastodon, or LinkedIn, or join the newsletter:

Mostly nerdless

Every two weeks a text on profiling, debugging or eBPF