JFR and Equality: A tale of many objects

In the last blog post, I showed you how to silence JFR’s startup messages. This week’s blog post is also related to JFR, and no, it’s not about the JFR Events website, which got a simple search bar. It’s a short blog post on comparing objects from JFR recordings in Java and why this is slightly trickier than you might have expected.

Example

Getting a JFR recording is simple; just use the RecordingStream API. We do this in the following to record an execution trace of a tight loop using JFR and store it in a list:

List<RecordedEvent> events = new ArrayList<>();
// Know when to stop the loop
AtomicBoolean running = new AtomicBoolean(true);
// We obtain one hundred execution samples 
// that have all the same stack trace
final long currentThreadId = Thread.currentThread().threadId();
try (RecordingStream rs = new RecordingStream()) {
    rs.enable("jdk.ExecutionSample").with("period", "1ms");
    rs.onEvent("jdk.ExecutionSample", event -> {
        if (event.getThread("sampledThread")
                 .getJavaThreadId() != currentThreadId) {
            return; // don't record other threads
        }
        events.add(event);
        if (events.size() >= 100) {
            // we can signal to stop
            running.set(false);
        }
    });
    rs.startAsync();
    int i = 0;
    while (running.get()) { // some busy loop to produce sample
        for (int j = 0; j < 100000; j++) {
            i += j;
        }
    }
    rs.stop();
}
Continue reading

Silencing JFR’s Startup Message

TD;DR: -Xlog:jfr+startup=error is your friend.

Ever wondered why JFR emits something like

[0.172s][info][jfr,startup] Started recording 1. No limit specified, using maxsize=250MB as default.
[0.172s][info][jfr,startup] 
[0.172s][info][jfr,startup] Use jcmd 29448 JFR.dump name=1 to copy recording data to file.

when starting the Flight Recorder with -XX:StartFlightRecorder? Even though the default logging level is warning, not info?

This is what this week’s blog post is all about. After I showed you last week how to waste CPU like a Professional, this week I’ll show you how to silence JFR. Back to the problem:

Continue reading

How to waste CPU like a Professional

Or: Hey, keeping the CPU busy for a given amount of time should be easy?

Welcome back to my blog. Last week, I showed you how to profile your Cloudfoundry application, and the week before, how I made the CPU-time profiler a tiny bit better by removing redundant synchronization. This week’s blog post will be closer to the latter, trying to properly waste CPU.

As a short backstory, my profiler needed a test to check that the queue size of the sampler really increased dynamically (see Java 25’s new CPU-Time Profiler: Queue Sizing (3)), so I needed a way to let a thread spend a pre-defined number of seconds running natively on the CPU. You can find the test case in its hopefully final form here, but be aware that writing such cases is more complicated than it looks.

So here we are: In need to essentially properly waste CPU-time, preferably in user-land, for a fixed amount of time. The problem: There are only a few scant resources online, so I decided to create my own. I’ll show you seven different ways to implement a simple

void my_wait(int seconds);

method, and you’ll learn far more about this topic than you ever wanted to. That works both on Mac OS and Linux. All the code is MIT licensed; you can find it on GitHub in my waste-cpu-experiments, alongside some profiling results.

As another tangent: Apparently, my Java 25’s new CPU-Time Profiler (1) blog post blew up on Hacker News. Fun times.

Continue reading

Profiling with the Cloud Foundry CLI Java plugin

Welcome back to my blog, this time for a blog post on profiling your Java applications in Cloud Foundry and the tool I helped to develop to make it easier.

Cloud Foundry “is an open source, multi-cloud application platform as a service (PaaS) governed by the Cloud Foundry Foundation, a 501(c)(6) organization” (Wikipedia). It allows you to run your workloads easily in the cloud, including your Java applications. You just need to define a manifest.yml, like for example:

---
applications:
- name: sapmachine21
  random-route: true
  path: test.jar
  memory: 512M
  buildpacks: 
  - sap_java_buildpack
  env:
    TARGET_RUNTIME: tomcat
    JBP_CONFIG_COMPONENTS: "jres: ['com.sap.xs.java.buildpack.jdk.SAPMachineJDK']"
    JBP_CONFIG_SAP_MACHINE_JDK : "{ version: 21.+ }"
    JBP_CONFIG_JAVA_OPTS: "[java_opts: '-XX:+UnlockDiagnosticVMOptions -XX:+DebugNonSafepoints']"

But how would you profile this application? This and more is the topic of this blog post.

I will not discuss why you might want to use Cloud Foundry or how you can deploy your own applications. I assume you came this far in the blog post because you already have basic Cloud Foundry knowledge and want to learn how to profile your applications easily.

The Java Plugin

Cloud Foundry has a cf CLI with a proper plugin system with lots of plugins. A team at SAP, which included Tim Gerrlach, started to develop the Java plugin many years ago at SAP. It’s a plugin offering utilities to gain insights into JVMs running in your Cloud Foundry app.

Continue reading

Java 25’s new CPU-Time Profiler: Removing Redundant Synchronization (4)

The changes I described in this blog post led to segfaults in tests, so I backtracked on them for now. Maybe I made a mistake implementing the changes, or my reasoning in the blog post is incorrect. I don’t know yet.

In the last blog post, I wrote about how to size the request queue properly and proposed the sampler queue’s dynamic sizing. But what I didn’t talk about in this or the previous blog post are two topics; one rather funny and one rather serious:

  1. Is the sampler queue really a queue?
  2. Should the queue implementation use Atomics and acquire-release semantics?

This is what we cover in this short blog post. First, to the rather fun topic:

Is it a Queue?

I always called the primary data structure a queue, but recently, I wondered whether this term is correct. But what is a queue?

Definition: A collection of items in which only the earliest added item may be accessed. Basic operations are add (to the tail) or enqueue and delete (from the head) or dequeue. Delete returns the item removed. Also known as “first-in, first-out” or FIFO.

Dictionary of Algorithms and Data Structures by Paul E. Black

But how does my sampler use the sampler queue?

Continue reading

Java 25’s new CPU-Time Profiler: Queue Sizing (3)

Welcome back to my series on the new CPU-time profiler in Java 25. In the previous blog post, I covered the implementation of the new profiler. In this week’s blog post, I’ll dive deep into the central request queue, focusing on deciding its proper size.

The JfrCPUTimeTraceQueue allows the signal handler to record sample requests that the out-of-thread sampler and the safepoint handler process. So it’s the central data structure of the profiler:

This queue is thread-local and pre-allocated, as it’s used in the signal handler, so the correct sizing is critical:

  • If the size is too small, you’ll lose many samples because the signal handler can’t record sample requests.
  • If you size it too large, you waste lots of memory. A sampling request is 48 bytes, so a queue with 500 elements (currently the default) requires 24kB. This adds up fast if you have more than a few threads.

So, in this blog post, we’re mainly concerned about setting the correct default size and discussing a potential solution to the whole problem.

Continue reading

Java 25’s new CPU-Time Profiler: The Implementation (2)

I developed, together with others, the new CPU-time profiler for Java, which is now included in JDK 25. A few weeks ago, I covered the profiler’s user-facing aspects, including the event types, configuration, and rationale, alongside the foundations of safepoint-based stack walking in JFR (see Taming the Bias: Unbiased Safepoint-Based Stack Walking). If you haven’t read those yet, I recommend starting there. In this week’s blog post, I’ll dive into the implementation of the new CPU-time profiler.

It was a remarkable coincidence that safepoint-based stack walking made it into JDK 25. Thanks to that, I could build on top of it without needing to re-implement:

  • The actual stack walking given a sampling request
  • Integration with the safepoint handler

Of course, I worked on this before, as described in Taming the Bias: Unbiased Safepoint-Based Stack Walking. But Erik’s solution for JDK 25 was much more complete and profited from his decades of experience with JFR. In March 2025, whether the new stack walker would get into JDK 25 was still unclear. So I came up with other ideas (which I’m glad I didn’t need). You can find that early brain-dump in Profiling idea (unsorted from March 2025).

In this post, I’ll focus on the core components of the new profiler, excluding the stack walking and safepoint handler. Hopefully, this won’t be the last article in the series; I’m already researching the next one.

Main Components

There are a few main components of the implementation that come together to form the profiler:

Continue reading

Profiling idea (unsorted from March 2025)

This is my actual collection of ideas from March 2025, when it was unclear whether the updated JFR sampling at safepoints made it into JDK 25. It eventually did, so I scrapped the ideas. But it offers the reader an interesting, unfiltered look into my ideas and thoughts at the time, probably only useful for people who are really into profiling and the OpenJDK. Just be aware that it is therefore a document of its time (March 2025) and doesn’t reflect the actual current implementation. Also, don’t expect any deeper explanations.

Well, I warned you…

An Experimental Front-End for JFR Queries

Ever wondered how the views of the jfr tool are implemented? There are views like hot-methods which gives the most used methods, or cpu-load-samples that gives you the system load over time that you can directly use on the command line:

> jfr view cpu-load-samples recording.jfr

                                     CPU Load

Time                         JVM User           JVM System           Machine Total
------------------ ------------------ -------------------- -----------------------
14:33:29                        8,25%                0,08%                  29,65%
14:33:30                        8,25%                0,00%                  29,69%
14:33:31                        8,33%                0,08%                  25,42%
14:33:32                        8,25%                0,08%                  27,71%
14:33:33                        8,25%                0,08%                  24,64%
14:33:34                        8,33%                0,00%                  30,67%
...

This is helpful when glancing at JFR files and trying to roughly understand their contents, without loading the files directly into more powerful, but also more resource-hungry, JFR viewers.

In this short blog post, I’ll show you how the views work under the hood using JFR queries and how to use the queries with my new experimental JFR query tool.

I didn’t forget the promised blog post on implementing the new CPU-time profiler in JDK 25; it’ll come soon.

Under the hood, JFR views use a built-in query language to define all views in the view.ini file. The above is, for example, defined as:

[environment.cpu-load-samples]
label = "CPU Load"
table = "SELECT startTime, jvmUser, jvmSystem, machineTotal FROM CPULoad"

With my new query tool (GitHub), we can plot this as:

Continue reading

Java 25’s new CPU-Time Profiler (1)

This is the first part of my series; the other parts are

Back to the blog post:

More than three years in the making, with a concerted effort starting last year, my CPU-time profiler landed in Java with OpenJDK 25. It’s an experimental new profiler/method sampler that helps you find performance issues in your code, having distinct advantages over the current sampler. This is what this week’s and next week’s blog posts are all about. This week, I will cover why we need a new profiler and what information it provides; next week, I’ll cover the technical internals that go beyond what’s written in the JEP. I will quote the JEP 509 quite a lot, thanks to Ron Pressler; it reads like a well-written blog post in and of itself.

Before I show you its details, I want to focus on what the current default method profiler in JFR does:

Continue reading

Taming the Bias: Unbiased* Safepoint-Based Stack Walking in JFR

Two years ago, I still planned to implement a new version of AsyncGetCallTrace in Java. This plan didn’t materialize, but Erik Österlund had the idea to fully walk the stack at safepoints during the discussions. Walking stacks only at safepoints normally would incur a safepoint-bias (see The Inner Workings of Safepoints), but when you record some program state in signal handlers, you can prevent this. I wrote about this idea and its basic implementation in Taming the Bias: Unbiased Safepoint-Based Stack Walking. I’ll revisit this topic in this week’s short blog post because Markus Grönlund took Erik’s idea and started implementing it for the standard JFR method sampler:

Continue reading

A Glance into JFR Class and Method Tagging

If you’re here for eBPF content, this blog post is not for you. I recommend reading an article on a concurrency fuzzing scheduler at LWN.

Ever wonder how the JDK Flight Recorder (JFR) keeps track of the classes and methods it has collected for stack traces and more? In this short blog post, I’ll explore JFR tagging and how it works in the OpenJDK.

Tags

JFR files consist of self-contained chunks. Every chunk contains:

The maximum chunk size is usually 12MB, but you can configure it:

java -XX:FlightRecorderOptions:maxchunksize=1M

Whenever JFR collects methods or classes, it has to somehow tell the JFR writer which entities have been used so that their mapping can be written out. Each entity also has to have a tracing ID that can be used in the events that reference it.

This is where JFR tags come in. Every class, module, and package entity has a 64-bit value called _trace_id (e.g., classes). Which consists of both the ID and the tag. Every method has an _orig_method_idnum, essentially its ID and a trace flag, which is essentially the tag.

In a world without any concurrency, the tag could just be a single bit, telling us whether an entity is used. But in reality, an entity can be used in the new chunk while we’re writing out the old chunk. So, we need to have two distinctive periods (0 and 1) and toggle between them whenever we write a chunk.

Tagging

We can visualize the whole life cycle of a tag for a given entity:

In this example, the entity, a class, is brought into JFR by the method sampler (link) while walking another thread’s stack. This causes the class to be tagged and enqueued in the internal entity queue (and is therefore known to the JFR writer) if it hasn’t been tagged before (source):

inline void JfrTraceIdLoadBarrier::load_barrier(const Klass* klass) {
  SET_METHOD_AND_CLASS_USED_THIS_EPOCH(klass);
  assert(METHOD_AND_CLASS_USED_THIS_EPOCH(klass), "invariant");
  enqueue(klass);
  JfrTraceIdEpoch::set_changed_tag_state();
}

inline traceid JfrTraceIdLoadBarrier::load(const Klass* klass) {
  assert(klass != nullptr, "invariant");
  if (should_tag(klass)) {
    load_barrier(klass);
  }
  assert(METHOD_AND_CLASS_USED_THIS_EPOCH(klass), "invariant");
  return TRACE_ID(klass);
}

This shows that tagging also prevents entities from being duplicated in a chunk.

Then, when a chunk is written out. First, a safepoint is requested to initialize the next period (the next chunk) and the period to be toggled so that the subsequent use of an entity now belongs to the new period and chunk. Then, the entity is written out, and its tag for the previous period is reset (code). This allows the aforementioned concurrency.

But how does it ensure that the tagged classes aren’t unloaded before they are emitted? By writing out the classes when any class is unloaded. This is simple yet effective and doesn’t need any change in the GC.

Conclusion

Tagging is used in JFR to record classes properly, methods, and other entities while also preventing them from accidentally being garbage collected before they are written out. This is a simple but memory-effective solution. It works well in the context of concurrency but assumes entities are used in the event creation directly when tagging them. It is not supported to tag the entities and then push them into the queue to later create events asynchronously. This would probably require something akin to reference counting.

Thanks for coming this far in a blog post on a profiling-related topic. I chose this topic because I wanted to know more about tagging and plan to do more of these short OpenJDK-specific posts.

P.S.: I gave three talks at FOSDEM, on fuzzing schedulers, sched-ext, and profiling.

The slow Death of the onjcmd Debugger Feature

Almost to the day, a year ago, I published my blog post called Level-up your Java Debugging Skills with on-demand Debugging. In this blog post, I wrote about multiple rarely known and rarely used features of the Java debugging agent, including the onjcmd feature. To quote my own blog post:

JCmd triggered debugging

There are often cases where the code that you want to debug is executed later in your program’s run or after a specific issue appears. So don’t waste time running the debugging session from the start of your program, but use the onjcmd=y option to tell the JDWP agent to wait with the debugging session till it is triggered via jcmd.

A similar feature long existed in the SAPJVM. In 2019 Christoph Langer from SAP decided to add it to the OpenJDK, where it was implemented in JDK 12 and has been there ever since.

The alternative to using this feature is to start the debugging session at the beginning and only connect to the JDWP agent when you want to start debugging. But this was, for a time, significantly slower than using the onjcmd feature (source):

Continue reading

From C to Java Code using Panama

The Foreign Function & Memory API (also called Project Panama) has come a long way since it started. You can find the latest version implemented in JDK 21 as a preview feature (use --enable-preview to enable it) which is specified by the JEP 454:

By efficiently invoking foreign functions (i.e., code outside the JVM), and by safely accessing foreign memory (i.e., memory not managed by the JVM), the API enables Java programs to call native libraries and process native data without the brittleness and danger of JNI.

JEP 454

This is pretty helpful when trying to build wrappers around existing native libraries. Other languages, like Python with ctypes, have had this for a long time, but Java is getting a proper API for native interop, too. Of course, there is the Java Native Interface (JNI), but JNI is cumbersome and inefficient (call-sites aren’t inlined, and the overhead of converting data from Java to the native world and back is huge).

Be aware that the API is still in flux. Much of the existing non-OpenJDK documentation is not in sync.

Example

Now to my main example: Assume you’re tired of all the abstraction of the Java I/O API and just want to read a file using the traditional I/O functions of the C standard lib (like read_line.c): we’re trying to read the first line of the passed file, opening the file via fopen, reading the first line via gets, and closing the file via fclose.

#include "stdio.h"
#include "stdlib.h"

int main(int argc, char *argv[]) {
  FILE* file = fopen(argv[1], "r");
  char* line = malloc(1024);
  fgets(line, 1024, file);
  printf("%s", line);
  fclose(file);
  free(line);
}

This would have involved writing C code in the old JNI days, but we can access the required C functions directly with Panama, wrapping the C functions and writing the C program as follows in Java:

public static void main(String[] args) {
    var file = fopen(args[0], "r");
    var line = gets(file, 1024);
    System.out.println(line);
    fclose(file);
}

But do we implement the wrapper methods? We start with the FILE* fopen(char* file, char* mode) function which opens a file. Before we can call it, we have to get hold of its MethodHandle:

private static MethodHandle fopen = Linker.nativeLinker().downcallHandle(
        lookup("fopen"),
        FunctionDescriptor.of(/* return */ ValueLayout.ADDRESS, 
            /* char* file */ ValueLayout.ADDRESS, 
            /* char* mode */ ValueLayout.ADDRESS));

This looks up the fopen symbol in all the libraries that the current process has loaded, asking both the NativeLinker and the SymbolLookup. This code is used in many examples, so we move it into the function lookup:

public static MemorySegment lookup(String symbol) {
    return Linker.nativeLinker().defaultLookup().find(symbol)
                 .or(() -> SymbolLookup.loaderLookup().find(symbol))
                 .orElseThrow();
}

The look-up returns the memory address at which the looked-up function is located.

We can proceed with the address of fopen and use it to create a MethodHandle that calls down from the JVM into native code. For this, we also have to specify the descriptor of the function so that the JVM knows how to call the fopen handle properly.

But how do we use this handle? Every handle has an invokeExact function (and an invoke function that allows the JVM to convert data) that we can use. The only problem is that we want to pass strings to the fopen call. We cannot pass the strings directly but instead have to allocate them onto the C heap, copying the chars into a C string:

public static MemorySegment fopen(String filename, String mode) {
    try (var arena = Arena.ofConfined()) {
        return (MemorySegment) fopen.invokeExact(
                arena.allocateUtf8String(filename),
                arena.allocateUtf8String(mode));
    } catch (Throwable t) {
        throw new RuntimeException(t);
    }
}

In JDK 22 allocateUtf8String changes to allocateFrom (thanks Brice Dutheil for spotting this).

We use a confined arena for allocations, which is cleaned after exiting the try-catch. The newly allocated strings are then used to invoke fopen, letting us return the FILE*.

Older tutorials might mention MemorySessions, but they are removed in JDK 21.

After opening the file, we can focus on the char* fgets(char* buffer, int size, FILE* file) function. This function is passed a buffer of a given size, storing the next line from the passed file in the buffer.

Getting a MethodHandle is similar to fopen:

private static MethodHandle fgets = Linker.nativeLinker().downcallHandle(
        PanamaUtil.lookup("fgets"),
        FunctionDescriptor.of(ValueLayout.ADDRESS, 
                              ValueLayout.ADDRESS, 
                              ValueLayout.JAVA_INT, 
                              ValueLayout.ADDRESS));

Only the wrapper method differs because we have to allocate the buffer in the arena:

public static String gets(MemorySegment file, int size) {
    try (var arena = Arena.ofConfined()) {
        var buffer = arena.allocateArray(ValueLayout.JAVA_BYTE, size);
        var ret = (MemorySegment) fgets.invokeExact(buffer, size, file);
        if (ret == MemorySegment.NULL) {
            return null; // error
        }
        return buffer.getUtf8String(0);
    } catch (Throwable t) {
        throw new RuntimeException(t);
    }
}

Finally, we can implement the int fclose(FILE* file) function to close the file:

private static MethodHandle fclose = Linker.nativeLinker().downcallHandle(
        PanamaUtil.lookup("fclose"),
        FunctionDescriptor.of(ValueLayout.JAVA_INT, ValueLayout.ADDRESS));

public static int fclose(MemorySegment file) {
    try {
        return (int) fclose.invokeExact(file);
    } catch (Throwable e) {
        throw new RuntimeException(e);
    }
}

You can find the source code in my panama-examples repository on GitHub (file HelloWorld.java) and run it on a Linux x86_64 machine via

> ./run.sh HelloWorld LICENSE # build and run
                                 Apache License

which prints the first line of the license file.

Errno

We didn’t care much about error handling here, but sometimes, we want to know precisely why a C function failed. Luckily, the C standard library on Linux and other Unixes has errno:

Several standard library functions indicate errors by writing positive integers to errno.

CPP Reference

On error, fopen returns a null pointer and sets errno. You can find information on all the possible error numbers on the man page for the open function.

We only have to have a way to obtain the errno directly after a call, we have to capture the call state and declare the capture-call-state option in the creation of the MethodHandle for fopen:

try (var arena = Arena.ofConfined()) {
    // declare the errno as state to be captured, 
    // directly after the downcall without any interence of the
    // JVM runtime
    StructLayout capturedStateLayout = Linker.Option.captureStateLayout();
    VarHandle errnoHandle = 
        capturedStateLayout.varHandle(
            MemoryLayout.PathElement.groupElement("errno"));
    Linker.Option ccs = Linker.Option.captureCallState("errno");

    MethodHandle fopen = Linker.nativeLinker().downcallHandle(
            lookup("fopen"), 
            FunctionDescriptor.of(POINTER, POINTER, POINTER), 
            ccs);

    MemorySegment capturedState = arena.allocate(capturedStateLayout);
    try {
        // reading a non-existent file, this will set the errno
        MemorySegment result = 
            (MemorySegment) fopen.invoke(capturedState,
                // for our example we pick a file that doesn't exist
                // this ensures a proper error number
                arena.allocateUtf8String("nonexistent_file"),
                arena.allocateUtf8String("r"));
        int errno = (int) errnoHandle.get(capturedState);
        System.out.println(errno);
        return result;
    } catch (Throwable e) {
        throw new RuntimeException(e);
    }
}

To convert this error number into a string, we can use the char* strerror(int errno) function:

// returned char* require this specific type
static AddressLayout POINTER = 
    ValueLayout.ADDRESS.withTargetLayout(
        MemoryLayout.sequenceLayout(JAVA_BYTE));
static MethodHandle strerror = Linker.nativeLinker()
        .downcallHandle(lookup("strerror"),
                FunctionDescriptor.of(POINTER, 
                    ValueLayout.JAVA_INT));

static String errnoString(int errno){
    try {
        MemorySegment str = 
            (MemorySegment) strerror.invokeExact(errno);
        return str.getUtf8String(0);
    } catch (Throwable t) {
        throw new RuntimeException(t);
    }
}

When we then print the error string in our example after the fopen call, we get:

No such file or directory 

This is as expected, as we hard-coded a non-existent file in the fopen call.

JExtract

Creating all the MethodHandles manually can be pretty tedious and error-prone. JExtract can parse header files, generating MethodHandles and more automatically. You can download jextract on the project page.

For our example, I wrote a small wrapper around jextract that automatically downloads the latest version and calls it on the misc/headers.h file to create MethodHandles in the class Lib. The headers file includes all the necessary headers to run examples:

#include <errno.h>
#include <string.h>
#include <unistd.h>
#include <stdio.h>

For example the fgets function, jextract generates as an entry point the following:

public static MethodHandle fopen$MH() {
    return RuntimeHelper.requireNonNull(constants$48.const$0,"fopen");
}
/**
 * {@snippet :
 * FILE* fopen(char* __filename, char* __modes);
 * }
 */
public static MemorySegment fopen(MemorySegment __filename, MemorySegment __modes) {
    var mh$ = fopen$MH();
    try {
        return (java.lang.foreign.MemorySegment)mh$.invokeExact(__filename, __modes);
    } catch (Throwable ex$) {
        throw new AssertionError("should not reach here", ex$);
    }
}

Of course, we still have to take care of the string allocation in our wrapper, but this wrapper gets significantly smaller:

public static MemorySegment fopen(String filename, String mode) {
    try (var arena = Arena.ofConfined()) {
        // using the MethodHandle that has been generated 
        // by jextract
        return Lib.fopen( 
                arena.allocateUtf8String(filename),
                arena.allocateUtf8String(mode));
    }
} 

You can find the example code in the GitHub repository in the file HelloWorldJExtract.java. I integrated jextract via a wrapper directly into the Maven build process, so just mvn package to run the tool.

More Information

There are many other resources on Project Panama, but be aware that they might be dated. Therefore, I recommend reading JEP 454, which describes the newly introduced API in great detail. Additionally, the talk “The Panama Dojo: Black Belt Programming with Java 21 and the FFM API” by Per Minborg at this year’s Devoxx Belgium is a great introduction:

As well as the talk by Maurizio Cimadamore at this year’s JVMLS:

Conclusion

Project Panama greatly simplifies interfacing with existing native libraries. I hope it will gain traction after leaving the preview state with the upcoming JDK 22, but it should already be stable enough for small experiments and side projects.

I hope my introduction gave you a glimpse into Panama; as always, I’m happy for any comments, and I’ll see you next week(ish) for the start of a new blog series.

This article is part of my work in the SapMachine team at SAP, making profiling and debugging easier for everyone. Thank you to my colleague Martin Dörr, who helped me with Panama and ported Panama to PowerPC.

AsyncGetCallTrace Reworked: Frame by Frame with an Iterative Touch!

AsyncGetCallTrace is an API to obtain the top n Java frames of a thread asynchronously in a signal handler. This API is widely used but has its problems; see JEP 435 and my various blog posts (AsyncGetStackTrace: A better Stack Trace API for the JVM, jmethodIDs in Profiling: A Tale of Nightmares, …). My original approach with my JEP proposal was to build a replacement of the API, which could be used as a drop-in for AsyncGetCallTrace: Still a single method that populates a preallocated frame list:

No doubt this solves a few of the problems, the new API would be officially supported, return more information, and could return the program counter for C/C++ frames. But it eventually felt more like a band-aid, hindered by trying to mimic AsyncGetCallTrace. In recent months, I had a few discussions with Erik Österlund and Jaroslav Bachorik in which we concluded that what we really need is a completely redesigned profiling API that isn’t just an AsyncGetCallTrace v2.

The new API should be more flexible, safer, and future-proof than the current version. It should, if possible, allow incremental stack scanning and support virtual threads. So I got to work redesigning and, more crucially, rethinking the profiling API inspired by Erik Österlunds ideas.

This blog post is the first of two blog posts covering the draft of a new iterator-based stack walking API, which builds the base for the follow-up blog post on safepoint-based profiling. The following blog post will come out on Wednesday as a special for the OpenJDK Committers’ Workshop.

Iterators

AsyncGetCallTrace fills a preallocated list of frames, which has the most profound expected stack trace length, and many profilers just store away this list. This limits the amount the data we can give for each frame. We don’t have this problem with an iterator-based API, where we first create an iterator for the current stack and then walk from frame to frame:

The API can offer all the valuable information the JVM has, and the profiler developer can pick the relevant information. This API is, therefore, much more flexible; it allows the profiler writer to …

  • … walk at frames without a limit
  • … obtain program counter, stack pointer, and frame pointer to use their stack walking code for C/C++ frames between Java frames
  • … use their compression scheme for the data
  • don’t worry about allocating too much data on the stack because the API doesn’t force you to preallocate a large number of frames

This API can be used to develop your version of AsyncGetCallTrace, allowing seamless integration into existing applications.

Using the API in a signal handler and writing it using C declarations imposes some constraints, which result in a slightly more complex API which I cover in the following section.

Proposed API

When running in a signal handler, a significant constraint is that we have to allocate everything on the stack. This includes the iterator. The problem is that we don’t want to specify the size of the iterator in the API because this iterator is based on an internal stack walker and is subject to change. Therefore, we have to allocate the iterator on the stack inside an API method, but this iterator is only valid in the method’s scope. This is the reason for the ASGST_RunWithIterator which creates an iterator and passes it to a handler:

// Create an iterator and pass it to fun alongside 
// the passed argument.
// @param options ASGST_INCLUDE_NON_JAVA_FRAMES, ...
// @return error or kind
int ASGST_RunWithIterator(void* ucontext, 
    int32_t options, 
    ASGST_IteratorHandler fun, 
    void* argument);

The iterator handler is a pointer to a method in which the ASGST_RunWithIterator calls with an iterator and the argument. Yes, this could be nicer in C++, which lambdas and more, but we are constrained to a C API. It’s easy to develop a helper library in C++ that offers zero-cost abstractions, but this is out-of-scope for the initial proposal.

Now to the iterator itself. The main method is ASGST_NextFrame:

// Obtains the next frame from the iterator
// @returns 1 if successful, else error code (< 0) / end (0)
// @see ASGST_State
//
// Typically used in a loop like:
//
// ASGST_Frame frame;
// while (ASGST_NextFrame(iterator, &frame) == 1) {
//   // do something with the frame
// }
int ASGST_NextFrame(ASGST_Iterator* iterator, ASGST_Frame* frame);

The frame data structure, as explained in the previous section, contains all required information and is far simpler than the previous proposal (without any union):

enum ASGST_FrameTypeId {
  ASGST_FRAME_JAVA         = 1, // JIT compiled and interpreted
  ASGST_FRAME_JAVA_INLINED = 2, // inlined JIT compiled
  ASGST_FRAME_JAVA_NATIVE  = 3, // native wrapper to call 
                                // C/C++ methods from Java
  ASGST_FRAME_NON_JAVA     = 4  // C/C++/... frames
};

typedef struct {
  uint8_t type;         // frame type
  int comp_level;       // compilation level, 0 is interpreted, 
                        // -1 is undefined, > 1 is JIT compiled
  int bci;              // -1 if the bci is not available 
                        // (like in native frames)
  ASGST_Method method;  // method or nullptr if not available
  void *pc;             // current program counter 
                        // inside this frame
  void *sp;             // current stack pointer 
                        // inside this frame, might be null
  void *fp;             // current frame pointer 
                        // inside this frame, might be null
} ASGST_Frame;

This uses ASGST_Method instead of jmethodID, see jmethodIDs in Profiling: A Tale of Nightmares for more information.

The error codes used both by ASGST_RunWithIterator and ASGST_NextFrame are defined as:

enum ASGST_Error {
  ASGST_NO_FRAME            =  0, // come to and end
  ASGST_NO_THREAD           = -1, // thread is not here
  ASGST_THREAD_EXIT         = -2, // dying thread
  ASGST_UNSAFE_STATE        = -3, // thread is in unsafe state
  ASGST_NO_TOP_JAVA_FRAME   = -4, // no top java frame
  ASGST_ENQUEUE_NO_QUEUE    = -5, // no queue registered
  ASGST_ENQUEUE_FULL_QUEUE  = -6, // safepoint queue is full
  ASGST_ENQUEUE_OTHER_ERROR = -7, // other error, 
                                  // like currently at safepoint
  // everything lower than -16 is implementation specific
};

ASGST_ENQUEUE_NO_QUEUE and ASGST_ENQUEUE_FULL_QUEUE are not relevant yet, but their importance will be evident in my next blog post.

This API wouldn’t be complete without a few helper methods. We might want to start from an arbitrary frame; for example, we use a custom stack walker for the top C/C++ frames:

// Similar to RunWithIterator, but starting from 
// a frame (sp, fp, pc) instead of a ucontext.
int ASGST_RunWithIteratorFromFrame(void* sp, void* fp, void* pc, 
  int options, ASGST_IteratorHandler fun, void* argument);

The ability to rewind an iterator is helpful too:

// Rewind an interator to the top most frame
void ASGST_RewindIterator(ASGST_Iterator* iterator);

And just in case you want to get the state of the current iterator or thread, there are two methods for you:

// State of the iterator, corresponding 
// to the next frame return code
// @returns error code or 1 if no error
// if iterator is null or at end, return ASGST_NO_FRAME,
// returns a value < -16 if the implementation encountered 
// a specific error
int ASGST_State(ASGST_Iterator* iterator);

// Returns state of the current thread, which is a subset
// of the JVMTI thread state.
// no JVMTI_THREAD_STATE_INTERRUPTED, 
// limited JVMTI_THREAD_STATE_SUSPENDED.
int ASGST_ThreadState();

But how can we use this API? I developed a small profiler in my writing, a profiler from scratch series, which we can now use to demonstrate using the methods defined before. Based on my Writing a Profiler in 240 Lines of Pure Java blog post, I added a flame graph implementation. In the meantime, you can also find the base implementation on GitHub.

Implementing a Small Profiler

First of all, you have to build and use my modified OpenJDK. This JDK has been tested on x86 and aarch64. The profiler API implementation is still a prototype and contains known errors, but it works well enough to build a small profiler. Feel free to review the code; I’m open to help, suggestions, or sample programs and tests.

To use this new API, you have to include the profile2.h header file, there might be some linker issues on Mac OS, so add -L$JAVA_HOME/lib/server -ljvm to your compiler options.

One of the essential parts of this new API is that, as it doesn’t use jmethodID, we don’t have to pre-touch every method (learn more on this in jmethodIDs in Profiling: A Tale of Nightmares). Therefore we don’t need to listen to ClassLoad JVMTI events or iterate over all existing classes at the beginning. So the reasonably complex code

static void JNICALL OnVMInit(jvmtiEnv *jvmti, 
 JNIEnv *jni_env, jthread thread) {
  jint class_count = 0;
  env = jni_env;
  sigemptyset(&prof_signal_mask);
  sigaddset(&prof_signal_mask, SIGPROF);
  OnThreadStart(jvmti, jni_env, thread);
  // Get any previously loaded classes 
  // that won't have gone through the
  // OnClassPrepare callback to prime 
  // the jmethods for AsyncGetCallTrace.
  JvmtiDeallocator<jclass> classes;
  ensureSuccess(jvmti->GetLoadedClasses(&class_count,
      classes.addr()), 
    "Loading classes failed")

  // Prime any class already loaded and 
  // try to get the jmethodIDs set up.
  jclass *classList = classes.get();
  for (int i = 0; i < class_count; ++i) {
    GetJMethodIDs(classList[i]);
  }

  startSamplerThread();
}

is reduced to just

static void JNICALL OnVMInit(jvmtiEnv *jvmti, JNIEnv *jni_env, 
 jthread thread) {
  sigemptyset(&prof_signal_mask);
  sigaddset(&prof_signal_mask, SIGPROF);
  OnThreadStart(jvmti, jni_env, thread);
  startSamplerThread();
}

improving the start-up/attach performance of the profiler along the way. To get from the new ASGST_Method identifiers to the method name we need for the flame graph, we don’t use the JVMTI methods but ASGST methods:

static std::string methodToString(ASGST_Method method) {
  // assuming we only care about the first 99 chars
  // of method names, signatures and class names
  // allocate all character array on the stack
  char method_name[100];
  char signature[100];
  char class_name[100];
  // setup the method info
  ASGST_MethodInfo info;
  info.method_name = (char*)method_name;
  info.method_name_length = 100;
  info.signature = (char*)signature;
  info.signature_length = 100;
  // we ignore the generic signature
  info.generic_signature = nullptr;
  // obtain the information
  ASGST_GetMethodInfo(method, &info);
  // setup the class info
  ASGST_ClassInfo class_info;
  class_info.class_name = (char*)class_name;
  class_info.class_name_length = 100;
  // we ignore the generic class name
  class_info.generic_class_name = nullptr;
  // obtain the information
  ASGST_GetClassInfo(info.klass, &class_info);
  // combine all
  return std::string(class_info.class_name) + "." + 
    std::string(info.method_name) + std::string(info.signature);
}

This method is then used in the profiling loop after obtaining the traces for all threads. But of course, by then, the ways may be unloaded. This is rare but something to consider as it may cause segmentation faults. Due to this, and for performance reasons, we could register class unload handlers and obtain the method names for the methods of unloaded classes therein, as well as obtain the names of all still loaded used ASGST_Methods when the agent is unattached (or the JVM exits). This will be a topic for another blog post.

Another significant difference between the new API to the old API is that it misses a pre-defined trace data structure. So the profiler requires its own:

struct CallTrace {
  std::array<ASGST_Frame, MAX_DEPTH> frames;
  int num_frames;

  std::vector<std::string> to_strings() const {
    std::vector<std::string> strings;
    for (int i = 0; i < num_frames; i++) {
      strings.push_back(methodToString(frames[i].method));
    }
    return strings;
  }
};

We still use the pre-defined frame data structure in this example for brevity, but the profiler could customize this too. This allows the profiler only to store the relevant information.

We fill the related global_traces entries in the signal handler. Previously we just called:

static void signalHandler(int signo, siginfo_t* siginfo, 
 void* ucontext) {
  asgct(&global_traces[available_trace++], 
    MAX_DEPTH, ucontext);
  stored_traces++;
}

But now we have to use the ASGST_RunWithIterator with a callback. So we define the callback first:

void storeTrace(ASGST_Iterator* iterator, void* arg) {
  CallTrace *trace = (CallTrace*)arg;
  ASGST_Frame frame;
  int count;
  for (count = 0; ASGST_NextFrame(iterator, &frame) == 1 && 
         count < MAX_DEPTH; count++) {
    trace->frames[count] = frame;  
  }
  trace->num_frames = count;
}

We use the argument pass-through from ASGST_RunWithIterator to the callback to pass the CallTrace instance where we want to store the traces. We then walk the trace using the ASGST_NextFrame method and iterate till the maximum count is reached, or the trace is finished.

ASGST_RunWithIterator itself is called in the signal handler:

static void signalHandler(int signo, siginfo_t* siginfo, 
 void* ucontext) {
  CallTrace &trace = global_traces[available_trace++];
  int ret = ASGST_RunWithIterator(ucontext, 0, 
              &storeTrace, &trace);
  if (ret >= 2) { // non Java trace
    ret = 0;
  }
  if (ret <= 0) { // error
    trace.num_frames = ret;
  }
  stored_traces++;
}

You can find the complete code on GitHub; feel free to ask any yet unanswered questions. To use the profiler, just run it from the command line:

java -agentpath:libSmallProfiler.so=output=flames.html \
  -cp samples math.MathParser

This assumes that you use the modified OpenJDK. MathParser is a demo program that generates and evaluates simple mathematical expressions. I wrote this for a compiler lab while I was still a student. The resulting flame graph should look something like this:

Conclusion

Using an iterator-based profiling API in combination with better method ids offers flexibility, performance, and safety for profiler writers. The new API is better than the old one, but it becomes even better. Get ready for the next blog post in which I tell you about safepoints and why it matters that there is a safepoint-check before unwinding any physical frame, which is the reason why I found a bug in The Inner Workings of Safepoints. So it will all come together.

Thank you for coming this far; I hope you enjoyed this blog post, and I’m open to any suggestions on my profiling API proposal.

This project is part of my work in the SapMachine team at SAP, making profiling easier for everyone.