Welcome back to my hello-ebpf series. I was a FOSDEM earlier this month and gave three(ish) talks. This blog post covers one of them, which is related to concurrency testing.
Backstory
The whole story started in the summer of last year when I thought about the potential use cases for custom Linux schedulers. The only two prominent use cases at the time were:
Improve performance on servers and heterogenous systems
Allow people to write their own schedulers for educational purposes
Of course, the latter is also important for the hello-ebpf project, which allows Java developers to create their own schedulers. But I had another idea back then: Why can’t we use custom Linux schedulers for concurrency testing? I presented this idea in my keynote at the eBPF summit in September, which you can watch here.
I ended up giving this talk because of a certain Bill Mulligan, who liked the idea of showing my eBPF SDK in Java and sched-ext in a single talk. For whatever reason, I foolishly agreed to give the talk and spent a whole weekend in Oslo before JavaZone frantically implementing sched-ext support in hello-ebpf.
My idea then was one could use a custom scheduler to test specific scheduling orders to find (and possibly) reproduce erroneous behavior of concurrent code. One of the examples in my talk was dead-locks:
If you’re here for eBPF content, this blog post is not for you. I recommend reading an article on a concurrency fuzzing scheduler at LWN.
Ever wonder how the JDK Flight Recorder (JFR) keeps track of the classes and methods it has collected for stack traces and more? In this short blog post, I’ll explore JFR tagging and how it works in the OpenJDK.
Tags
JFR files consist of self-contained chunks. Every chunk contains:
The maximum chunk size is usually 12MB, but you can configure it:
java -XX:FlightRecorderOptions:maxchunksize=1M
Whenever JFR collects methods or classes, it has to somehow tell the JFR writer which entities have been used so that their mapping can be written out. Each entity also has to have a tracing ID that can be used in the events that reference it.
This is where JFR tags come in. Every class, module, and package entity has a 64-bit value called _trace_id (e.g., classes). Which consists of both the ID and the tag. Every method has an _orig_method_idnum, essentially its ID and a trace flag, which is essentially the tag.
In a world without any concurrency, the tag could just be a single bit, telling us whether an entity is used. But in reality, an entity can be used in the new chunk while we’re writing out the old chunk. So, we need to have two distinctive periods (0 and 1) and toggle between them whenever we write a chunk.
Tagging
We can visualize the whole life cycle of a tag for a given entity:
In this example, the entity, a class, is brought into JFR by the method sampler (link) while walking another thread’s stack. This causes the class to be tagged and enqueued in the internal entity queue (and is therefore known to the JFR writer) if it hasn’t been tagged before (source):
This shows that tagging also prevents entities from being duplicated in a chunk.
Then, when a chunk is written out. First, a safepoint is requested to initialize the next period (the next chunk) and the period to be toggled so that the subsequent use of an entity now belongs to the new period and chunk. Then, the entity is written out, and its tag for the previous period is reset (code). This allows the aforementioned concurrency.
But how does it ensure that the tagged classes aren’t unloaded before they are emitted? By writing out the classes when any class is unloaded. This is simple yet effective and doesn’t need any change in the GC.
Conclusion
Tagging is used in JFR to record classes properly, methods, and other entities while also preventing them from accidentally being garbage collected before they are written out. This is a simple but memory-effective solution. It works well in the context of concurrency but assumes entities are used in the event creation directly when tagging them. It is not supported to tag the entities and then push them into the queue to later create events asynchronously. This would probably require something akin to reference counting.
Thanks for coming this far in a blog post on a profiling-related topic. I chose this topic because I wanted to know more about tagging and plan to do more of these short OpenJDK-specific posts.