Mastering the Art of Controlling the JIT: Unlocking Reproducible Profiler Tests

In my last blog post, I hinted Using Async-Profiler and Jattach Programmatically with AP-Loader, that I’m currently working on a test library for writing better profiling API tests. The library is still work-in-progress, but it already allows you to write profiling API tests in plain Java:

private int innerASGCT2() {
    new Tracer().runASGCT().assertTrue(
        Frame.hasMethod(0, "innerASGCT2", "()I"), 
        Frame.hasMethod(1, "testRunASGCT2"));
    return 0;
}

@Test
public void testRunASGCT2() {
    innerASGCT2();
}

This test case checks that calling AsyncGetCallTrace gives the correct result in this specific example. The test library allows you to write tests comparing the returns of multiple GetStackTrace, AsyncGetCallTrace, and AsyncGetStackTrace invocations in different modes and settings. The library can be found as trace-tester on GitHub; I aim to bring it into the OpenJDK later with my JEP.

Writing small test cases this way is great, but it would be even better if we could force specific methods to be compiled, interpreted, or inlined so that we can test different scenarios. The proposed AsyncGetStackTrace will return the compilation level directly for every frame, so it is necessary to check the correctness of the level too.

Consider reading my Validating Java Profiling APIs post to get a different angle on profiling API testing.

Introduction

Before I start with discussing the ways you can force methods to be compiled, interpreted, or inlined, I’ll have to clarify that:

  1. The following only works with the HotSpot tired JIT compiler and not other JVM’s like OpenJ9 (see issue #11272)
  2. It should only be used for testing. I would refrain from using it anywhere near production, even if you know that specific methods should be compiled. Use a tool like JITWatch by Chris Newland to check whether the JVM doesn’t make the correct decisions automatically: Ask your fellow JVM expert how to deal with this.
  3. I’m not an expert in the APIs I’m showing you, nor in tiered compilation, so be aware that I might be missing something, but I’m happy for any suggestions and corrections.
  4. There are four different compilation levels, but I’m subsuming all C1 variants under the C1 label because some of my used techniques only work on the C1/C2/inlined level. You can read more on tiered compilation in articles like Tiered Compilation in JVM on Baeldung.

Now that I finished the obligatory disclaimer: What are the stages in the life of a method with a tiered JIT?

The first time the JVM executes a method, the method’s byte code is interpreted without compilation. This allows the JVM to gather information on the method, as C1 and C2 are profile guided.

The method is then compiled when the JVM deems this to be beneficial, usually after the method has been executed a few times. The next call of the method will then use the compiled version. The method is initially compiled with different levels of the C1 compiler before finally being C2 compiled, which takes the longest but produces the best native instructions.

The JVM might decide at any point to use the interpreted version of a method by deoptimizing it. The compiled versions are kept, depending on the compiler and the reasons for the deoptimization.

Every compiler can decide to inline called methods of a currently compiled method. A compiler uses the initial byte code for this purpose.

What we want and what we get

The ideal would be to tell the JVM to just use a method in its compiled version, e.g.:

But this is not possible, as the JVM does not have any information it needs for compilation before the first execution of a method. We, therefore, have first to execute the method (or the benchmark) and then set the compilation level:

How do we get it?

We can split the task of forcing a method to be compiled (or inlined, for that matter) into two parts:

  1. Force all methods into their respective state (→ WhiteBox API) after the initial execution.
  2. Force the JIT to never compile a method with a different compiler (→ Compiler Control)

The following is the modified state diagram when forcing a method to be C1 compiled:

In the following, I’ll discuss how to use both the WhiteBox API and Compiler Control to facilitate the wanted behavior.

WhiteBox API

Many JVM tests are written in the JTreg framework, allowing developers to write these tests in Java. But these tests often require specific functionality not regularly available to Java developers. This functionality is exported in the WhiteBox API:

One of the not so well-known tools of the HotSpot VM is its WhiteBox testing API. Introduced in Java 7 it has been significantly improved and extended in Java 8 and 9. It can be used to query or change HotSpot internals which are not otherwise exposed to Java-land. While its features make it an indispensable tool for writing good HotSpot regression tests, it can also be used for experiments or for the mere fun of peeking into the VM. This entry will focus on the usage of the WhiteBox API in Java 8 and 9.

The WhiteBox API is implemented as a Java class (called sun.hotspot.WhiteBox) which defines various entry points into the HotSpot VM. Most of the functionality is implemented natively, directly in the HotSpot VM. The API is implemented as a singleton which can be easily retrieved by calling the static method WhiteBox.getWhiteBox().

Unfortunately, currently even a simple JavaDoc documentation of the API doesn’t exist, so in order to make full use of its functionality, you’ll have to peek right into WhiteBox.java.

The WhiteBox testing API

This API can be used outside of JTreg tests after enabling it by passing -Xbootclasspath/a:wb.jar -XX:+UnlockDiagnosticVMOptions -XX:+WhiteBoxAPI as JVM arguments. To use it, you have to build the WhiteBox JAR from scratch for your specific JVM by calling make build-test-lib (after you set up the build via the configure script).

But please be aware that using this API outside of JVM tests is relatively rare, and the documentation is still non-existent, so using it entails reading a lot of JDK sources and experimentation.

The build target did not work in JDK 21, and when I fixed it, the first question in the PR was by Daniel Jelinski, who asked:

That’s interesting. How did you find this? Is the result of this target used anywhere?
As far as I could tell, the build-test-lib target itself is not used anywhere. The classes that fail to compile here are used by tests without any problems – each test specifies the necessary imports individually. Should we remove this make target instead?

8307732: build-test-lib is broken #13885

So it would be best if you certainly did not depend on it.

The WhiteBox API consists of the singleton class jdk.test.whitebox.WhiteBox which offers many methods: From GC related methods like boolean isObjectInOldGen(Object o) and void fullGC() to NMT-related methods like long NMTMalloc(long size) and JIT-related methods like void deoptimizeAll().

You can even use it to force the compilation of a method and to set JVM flags, as shown in this example by Jean-Philippe Bempel:

public class WhiteBoxTest {
    static WhiteBox wb = WhiteBox.getWhiteBox();

    private void m() {
        System.out.println("foo");
    }

    public static void main(String[] args) throws Exception {
        wb.setBooleanVMFlag("PrintCompilation", true);
        wb.setBooleanVMFlag("BackgroundCompilation", false);
        wb.enqueueMethodForCompilation(
          WhiteBoxTest.class.getDeclaredMethod("m", null), 4);
    }
}

This is from his blog post WhiteBox API, the only blog post I could find on this topic.

Back to our goal of forcing the compilation of a method. It is a good idea to reset the state of a method and deoptimize it to start from a blank slate:

// obtain a method reference
Executable m = X.class.getDeclaredMethod("m", null);
// obtain a WhiteBox instance
WhiteBox wb = WhiteBox.getWhiteBox();
// deooptimize the method
wb.deoptimizeMethod(m);
// clear its state, found by experimentation to be neccessary
wb.clearMethodState(m);

We can then either leave the method uncompiled (for compilation level 0) or enqueue for compilation:

// level 1 - 3: C1, level 4: C2
wb.enqueueMethodForCompilation(m, level);

But be aware that it takes some time to actually compile the method, so it’s best to wait till it is compiled:

while (wb.getMethodCompilationLevel(m) != level) {
    Thread.sleep(1);
}

We can then also force a method to be never inlined:

wb.testSetDontInlineMethod(m, true);
wb.testSetForceInlineMethod(m, false);

Or inversely to be always inlined:

wb.testSetDontInlineMethod(m, false);
wb.testSetForceInlineMethod(m, true);

I implemented this in the WhiteBoxUtil class in my trace-tester library. This allows us to force all methods in their respective states. But the JVM can still decide to optimize further or inline a method, even when specifying the contrary. So we have to force the JVM using the second the Compiler Control specifications.

Compiler Control

This control mechanism has been introduced in Java 9 with JEP 165 by Nils Eliasson:

Summary

This JEP proposes an improved way to control the JVM compilers. It enables runtime manageable, method dependent compiler flags. (Immutable for the duration of a compilation.)

Goals

  • Fine-grained and method-context dependent control of the JVM compilers (C1 and C2)
  • The ability to change the JVM compiler control options in run time
  • No performance degradation

Motivation

Method-context dependent control of the compilation process is a powerful tool for writing small contained JVM compiler tests that can be run without restarting the entire JVM. It is also very useful for creating workarounds for bugs in the JVM compilers. A good encapsulation of the compiler options is also good hygiene.

JEP 165

This mechanism is properly standardized for the OpenJDK, unlike the WhiteBox APi. The compiler control allows to specify compilation settings by defining them in a JSON file and applying them:

  • Using jcmd (see JEP): jcmd <pid> Compiler.add_directives <file>
  • Passing it via JVM arguments: -XX:+UnlockDiagnosticVMOptions -XX:CompilerDirectivesFile=<file>
  • Using the WhiteBox API: int addCompilerDirective(String compDirect)

The following directives specify as an example that the method m should not be C2 compiled and not be inlined:

[
  {
    // can also contain patterns
    "match": ["X::m()"],
    // "-" prefixes not inlined, "+" inlined methods
    "inline": ["-X::m()"],
    "C1": {},
    "C2": {
      "Exclude": true
    }
  }
  // multiple directives supported
  // first directives have priority
]

This, in theory, allows the method to be deoptimized, but this did not happen during my testing. With forced compilation, one can assume that this method will almost be used in its compiled form.

I recommend this Compiler Control guide for a more in-depth guide with all options. An implementation of the control file generation with a fluent API can be found in the trace-tester project in the CompilerDirectives class. Feel free to adapt this for your own projects.

Conclusion

I’ve shown you in this article how to control the JIT to specify the inlining and compilation of methods using two lesser-known JVM APIs. This allows us to write reproducible profiling APIs and makes it easier to check how a profiling API reacts to different scenarios.

If you have any suggestions, feel free to reach out. I look forward to preparing slides for my upcoming talks in Milan, Munich, Arnhem, and Karlsruhe. Feel free to come to my talks; more information soon on Twitter.

This project is part of my work in the SapMachine team at SAP, making profiling easier for everyone.

Using Async-Profiler and Jattach Programmatically with AP-Loader

Using async-profiler and jattach can be quite a hassle. First, you have to download the proper archive from GitHub for your OS and architecture; then, you have to unpack it and place it somewhere. It gets worse if you want to embed it into your library, agent, or application: Library developers cannot just use maven dependency but have to create wrapper code and build scripts that deal with packaging the binaries themselves, or worse, they depend on a preinstalled version which they do not control.

In November 2022, I started the ap-loader project to remedy this situation: I wrapped async-profiler and jattach in a platform-independent JAR which can be pulled from maven central. I already wrote a blog post on its essential features: AP-Loader: A new way to use and embed async-profiler.

Flamegraph for a recording of profiling data for the dacapo benchmark suite

In this blog post, I’m focusing on its programmatic usage: Async-profiler can be used in a library to gather profiling data of the current or a different process, but the profiler distribution contains more: It contains converters to convert from JFR to flamegraphs, and jattach to attach a native agent dynamically to (potentially the current) JVM and send commands to it.

This blog post does assume that you’re familiar with the basic usage of async-profiler. If you are not, consider reading the async-profiler README or the Async-profiler – manual by use cases by Krzysztof Ślusarski.

The ap-loader library allows you to depend on a specific version of async-profiler using gradle or maven:

<dependency>
    <groupId>me.bechberger</groupId>
    <artifactId>ap-loader-all</artifactId>
    <version>2.9-5</version>
</dependency>

There are multiple maven artifacts: ap-loader-all which contains the native libraries for all platforms for which async-profiler has pre-built libraries and artifacts that only support a single platform like ap-loader-macos. I recommend using the ap-loader-all if you don’t know what you’re doing, the current release is still tiny, with 825KB.

The version number consists of the async-profiler version and the version (here 2.9) of the ap-loader support libraries (here 5). I’m typically only publishing the newest ap-loader version for the latest async-profiler. The changes in ap-loader are relatively minimal, and I keep the API stable between versions.

The ap-loader library consists of multiple parts:

  • AsyncProfilerLoader class: Wraps async-profiler and jattach, adding a few helper methods
  • converter package: Contains all classes from the async-profiler converter JAR and helps to convert between multiple formats
  • AsyncProfiler class: API for async-profiler itself, wrapping the native library.

All but the AsyncProfilerLoader class is just copied from the underlying async-profiler release. ap-loader contains all Java classes from async-profiler, but I omit the helper classes here for brevity.

AsyncProfilerLoader

This is the main entry point to ap-loader; it lives in the one.profiler package like the AsyncProfiler class. Probably the most essential method is load:

Load

The load method loads the included async-profiler library for the current platform:

AsyncProfiler profiler = AsyncProfilerLoader.load();

It returns the instantiated API wrapper class. The method throws an IllegalStateException if the present ap-loader dependencies do not support the platform and an IOException if loading the library resulted in other problems.

Newer versions of the AsyncProfiler API contain the AsyncProfiler#getInstance() method, which can also load an included library. The main difference is that you have to include the native library for all the different platforms, replicating all the work of the ap-loader build system every time you update async-profiler.

Dealing with multiple platforms is hard, and throwing an exception when not supporting a platform might be inconvenient for your use case. AsyncProfilerLoader has the loadOrNull method which returns null instead and also the isSupported to check whether the current combination of OS and CPU is supported. A typical use case could be:

if (AsyncProfilerLoader.isSupported()) {
  AsyncProfilerLoader.load().start(...);
} else {
  // use JFR or other fall-backs
}

This might still throw IOExceptions, but they should never happen in normal circumstances and are probably by problems that should be investigated, being either an error in ap-loader or in your application.

If you want to merely get the path to the extracted libAsyncProfiler, then use the getAsyncProfilerPath method which throws the same exceptions as the load method. A similar method exists for jattach (getJattachPath).

Execute Profiler

The async-profiler project contains the profiler.sh script (will be replaced by asprof starting with async-profiler 2.10):

To run the agent and pass commands to it, the helper script profiler.sh is provided. A typical workflow would be to launch your Java application, attach the agent and start profiling, exercise your performance scenario, and then stop profiling. The agent’s output, including the profiling results, will be displayed in the Java application’s standard output.

Async-Profiler documentation

This helper script is also included in ap-loader and allows you to use the script on the command-line via java -jar ap-loader profiler ..., the API exposes this functionality via ExecutionResult executeProfiler(String... args).

AsyncProfilerLoader.executeProfiler("-e", "wall", "8983")
// is equivalent to
./profiler.sh -e wall -t -i 5ms -f result.html 8983

The executeProfiler method throws an IllegalStateException if the current platform is not supported. The returned instance of ExecutionResult contains the standard and error output:

public static class ExecutionResult {
  private final String stdout;
  private final String stderr;
    // getter and constructor
    ...
}

executeProfiler throws an IOException if the profiler execution failed.

Execute Converter

You cannot only use the converter by using the classes from the one.profiler.converter, but you can also execute the converter by calling ExecutionResult executeProfiler(String... args), e.g., the following:

AsyncProfilerLoader.executeConverter(
  "jfr2flame", "<input.jfr>", "<output.html>")
// is equivalent to
java -cp converter.jar \
  jfr2flame <input.jfr> <output.html>

The executeConverter returns the output of the conversion tool on success and throws an IOException on error, as before.

JAttach

There are multiple ways to use the embedded jattach besides using the binary returned by getJattachPath: ExecutionResult executeJattach(String... args) and boolean jattach(Path agentPath[, String arguments]).

executeJattach works similar to executeProfiler, e.g.:

AsyncProfilerLoader.executeJattach(
  "<pid>", "load", "instrument", "false", "javaagent.jar=arguments")
// is equivalent to
jattach <pid> load instrument false "javaagent.jar=arguments"

This runs the same as jattach with the only exception that every string that ends with
libasyncProfiler.so is mapped to the extracted async-profiler library for the load command.
One can, therefore, for example, start the async-profiler on a different JVM via the following:

AsyncProfilerLoader.executeJattach(
  PID, "load", "libasyncProfiler.so", true, "start")

But this use case can, of course, be accomplished by using the executeProfiler method, which internally uses jattach.

A great use case for jattach is to attach a custom native agent to the currently running JVM. Starting with JVM 9 doing this via VirtualMachine#attach throws an IOException if you try this without setting -Djdk.attach.allowAttachSelf=true. The boolean jattach(Path agentPath[, String arguments]) methods simplify this, constructing the command line arguments for you and returning true if jattach succeeded, e.g.:

AsyncProfilerLoader.jattach("libjni.so")

This attaches the libjni.so agent to the current JVM. The process id of this JVM can be obtained by using the getProcessId method.

Extracting a Native Library

I happen to write many small projects for testing profilers that often require loading a native library from the resources folder; an example can be found in the trace_validation (blog post) project:

/**
 * extract the native library and return its temporary path
 */
public static synchronized Path getNativeLibPath(
 ClassLoader loader) {
  if (nativeLibPath == null) {
    try {
      String filename = System.mapLibraryName(NATIVE_LIB);
      InputStream in = loader.getResourceAsStream(filename);
      // ...
    } catch (IOException e) {
      throw new RuntimeException(e);
    }
  }
  return nativeLibPath;
}

I, therefore, added the extractCustomLibraryFromResources method:

/**                                                                                                                                        
 * Extracts a custom native library from the resources and 
 * returns the alternative source if the file is not 
 * in the resources.                                                                                                    
 *                                                                                                                                         
 * If the file is extracted, then it is copied to 
 * a new temporary folder which is deleted upon JVM exit.                            
 *                                                                                                                                         
 * This method is mainly seen as a helper method 
 * to obtain custom native agents for #jattach(Path) and                          
 * #jattach(Path, String). It is included in ap-loader 
 * to make it easier to write applications that need                           
 * custom native libraries.                                                                                                           
 *                                                                                                                                         
 * This method works on all architectures.                                                                                          
 *                                                                                                                                         
 * @param classLoader the class loader to load 
 *                 the resources from                                                                          
 * @param fileName the name of the file to copy, 
 *                 maps the library name if the fileName 
 *                 does not start with "lib", e.g. "jni" 
 *                 will be treated as "libjni.so" on Linux 
 *                 and as "libjni.dylib" on macOS                                       
 * @param alternativeSource the optional resource directory 
 *                 to use if the resource is not found in 
 *                 the resources, this is typically the case 
 *                 when running the application from an IDE, 
 *                 an example would be "src/main/resources" 
 *                 or "target/classes" for maven projects                                                    
 * @return the path of the library                                                                                                         
 * @throws IOException if the extraction fails and 
 *                  the alternative source is not present 
 *                  for the current architecture                      
 */                                                                                                                                        
public static Path extractCustomLibraryFromResources(
  ClassLoader classLoader, String fileName, 
  Path alternativeSource) throws IOException

This can be used effectively together with jattach to attach a native agent from the resources to the current JVM:

// extract the agent first from the resources
Path p = one.profiler.AsyncProfilerLoader.
  extractCustomLibraryFromResources(
    ....getClassLoader(), "library name");
// attach the agent to the current JVM
one.profiler.AsyncProfilerLoader.jattach(p, "optional arguments")
// -> returns true if jattach succeeded

This use-case comes from a profiler test helper library on which I hope to write a blog post in the near future.

Conclusion

ap-loader makes it easy to use async-profiler and its included tools programmatically without creating complex build systems. The project is regularly updated to keep pace with the newest stable async-profiler version; updating a version just requires changing a single dependency in your dependencies list.

The ap-loader is mature, so try it and tell me about it. I’m happy to help with any issues you have with this library, so feel free to write to me or create an issue on GitHub.

This project is part of my work in the SapMachine team at SAP, making profiling easier for everyone.

Couldn’t we just Use AsyncGetCallTrace in a Separate Thread?

I’m keenly interested in everything related to profiling on the JVM, especially if it is related to AsyncGetCallTrace, this tiny unofficial API that powers most profilers out there, heck I’m even in the process of adding an improved version to the OpenJDK, AsyncGetStackTrace.

During the discussions on the related JDK enhancement proposal and PRs fixing AsyncGetCallTrace bugs, one thing often arises: Why is AsyncGetCallTrace always called in the signal handler on top of the stack that we want to walk (like in my Writing a Profiler from Scratch series)?

Interaction between the wall-clock sampler thread and the different signal handlers, as currently implemented in async-profiler.

JDK Flight Recorder (JFR) does not do this; it instead walks the stack in the sampler thread while pausing the sampled thread (implemented with a SuspendedThreadTask).

Interaction between the sampler thread and the signal handlers, as currently implemented in JFR.

Update after talks on the JEP: The recommended way to use AsyncGetStackTrace will be to call it in a separate thread.

Advantages

Walking the thread in a sampler thread has multiple advantages: Only a few instructions run in the signal handler: the handler is either just busy waiting for the stack walking to finish, or the thread is stopped entirely. Most of the code runs in the sampler thread, walking one thread after another. This makes the code easier to debug and reason about, and the stack-walking code is less likely to mess up the stack of the sampled thread when something goes terribly wrong. These are part of the reasons why the JFR code silently ignores segmentation faults during stack walking:

One important difference to consider is that in JFR, in contrast to AGCT, there is only a single thread, the ThreadSampler thread, that is wrapped in the CrashProtection. Stack walking is different in JFR compared to AGCT, in that it is done by a different thread, during a point where the target is suspended. Originally, this thread sampler thread was not even part of the VM, although now it is a NonJavaThread. It has been trimmed to not involve malloc(), raii, and other hard-to-recover-from constructs, from the moment it has another thread suspended. Over the years, some transitive malloc() calls has snuck in, but it was eventually found due to rare deadlocking. Thomas brings a good point about crashes needing to be recoverable.

MarKUS Grönlund In A Comment on OpenJDK PR 8225

I digress here from the main topic of this article, but I think that the next comment of Markus Grönlund on the PR is interesting because it shows how pressures from the outside can lead to band-aid fixes that are never removed:

For additional context, I should add that the CrashProtection mechanism was mainly put in place as a result of having to deliver JFR from JRockit into Hotspot under a deadline, upholding feature-parity. The stack walking code was in really bad shape back then. Over the years, it has been hardened and improved much, and I have not seen any reported issues about JFR crashes in many years (we log when crashing in production).

An important difference is that AGCT allows more thread states compared to JFR, so there can be issues in that area that are not seen in JFR.

MarKUS Grönlund In A Comment on OpenJDK PR 8225

Back to the main topic: It is important to note that even when we walk a thread in a separate thread, we still have to make sure that we only use signal-safe methods while the sampled thread is waiting (thanks to Lukas Werling for pointing this out). The sampled thread might, for example, hold locks for malloc, so our sampled thread cannot use malloc without risking a dead-lock.

Disadvantages

There are, of course, disadvantages: Sampling in a signal handler is more straightforward, as we’re running in the context of the sampled thread and get passed the ucontext (with stack pointer, …) directly. It is more accurate, as we can trigger the sampling of the threads precisely at the time that we want (disregarding thread scheduling), and faster, as we do not busy wait in any thread.

We’re running on the same CPU core, which benefits caching, especially on NUMA CPUs (thanks to Francesco Nigro for pointing this out). Although the performance is rarely an issue with the stack-walking as its runtime is in the tens of microseconds, even if we include the whole signal processing.

Another major disadvantage is related to CPU time and perf-event-related profiling: The commonly used itimer (it has major problems, according to Felix Geisendörfer) and perf APIs send signals to threads in certain intervals. When we walk the stack in a separate thread, the triggered signal handlers must trigger the sampler thread to sample the specific thread.

This can be implemented by pushing the current thread id in a queue, and the sampler thread stops the sampled thread when it’s ready and walks the stack as before or by waiting in the signal handler until the sampler thread has finished walking the stack. The former is less performant because it sends an additional signal, and the latter is only significant if the walk requests of all threads are evenly distributed.

This problem can be lessened when we choose a different way of accessing the perf data: We can read the perf events in a loop and then just use the technique from wall-clock profiling. This is a significant modification of the inner workings of the profiler, and it is not possible with itimer-based profiling.

What is the real reason?

Walking in a separate thread has more advantages than disadvantages, especially when wall-clock profiling or valuing stability over slight performance gains. So why don’t tools like async-profiler implement their sampling this way? It’s because AsyncGetCallTrace currently doesn’t support it. This is the starting point of my small experiment: Could I modify the OpenJDK with just a few changes to add support for out-of-thread walking with AsyncGetCallTrace (subsequently proposing this for AsyncGetStackTrace too)?

Modifying AsyncGetCallTrace

Let us first take a look at the API to refresh our knowledge:

void AsyncGetCallTrace(ASGCT_CallTrace *trace, jint depth, 
                       void* ucontext)
// Arguments:
//
//   trace    - trace data structure to be filled by the VM.
//   depth    - depth of the call stack trace.
//   ucontext - ucontext_t of the LWP
//
// ASGCT_CallTrace:
//   typedef struct {
//       JNIEnv *env_id;
//       jint num_frames;
//       ASGCT_CallFrame *frames;
//   } ASGCT_CallTrace;
//
// Fields:
//   env_id     - ID of thread which executed this trace.
//   num_frames - number of frames in the trace.
//                (< 0 indicates the frame is not walkable).
//   frames     - the ASGCT_CallFrames that make up this trace. 
//                Callee followed by callers.
//
//  ASGCT_CallFrame:
//    typedef struct {
//        jint lineno;
//        jmethodID method_id;
//    } ASGCT_CallFrame;

If you’re new to AsyncGetCallTrace (and my blog), consider reading my Writing a Profiler from Scratch: Introduction article.

So we already pass an identifier of the current thread (env_id) to the API, which should point to the walked thread :

// This is safe now as the thread has not terminated 
// and so no VM exit check occurs.
assert(thread == 
         JavaThread::thread_from_jni_environment(trace->env_id),
       "AsyncGetCallTrace must be called by " +
       "the current interrupted thread");
 

This is the only usage of the passed thread identifier, and why I considered removing it in AsyncGetStackTrace altogether. AsyncGetCallTrace uses the current thread instead:

Thread* raw_thread = Thread::current_or_null_safe();

The assertion above is only enabled in debug builds of the OpenJDK, which are rarely profiled. Therefore, the thread identifier is often ignored and is probably a historic relic. We can use this identifier to obtain the thread that the API user wants to profile and only use the current thread when the thread identifier is null (source):

Thread* raw_thread;
if (trace->env_id == nullptr) {
  raw_thread = Thread::current_or_null_safe();
} else {
  raw_thread = 
    JavaThread::thread_from_jni_environment_raw(trace->env_id);
}

We can thereby support the new feature without modifying the API itself, only changing the behavior if the thread identifier does not reference the current thread.

The implementation can be found in my OpenJDK fork. This is still a prototype, but it works well enough for testing and benchmarking.

Modifying async-profiler

At the beginning of the article, I already told you how JFR walks the stack in a different thread. We are implementing similar code into async-profiler, restricting us to wall-clock profiling, as its implementation requires fewer modifications.

Before our changes, async-profiler would signal selected threads in a loop via

OS::sendSignalToThread(thread_id, SIGVTALRM)

(source) and records the sample directly in the signal handler (source):

void WallClock::signalHandler(
  int signo, 
  siginfo_t* siginfo, 
  void* ucontext) {
    ExecutionEvent event;
    event._thread_state = _sample_idle_threads ? 
      getThreadState(ucontext) : THREAD_UNKNOWN;
    Profiler::instance()->recordSample(ucontext, _interval, 
                                       EXECUTION_SAMPLE, &event);
}

The Profiler::recordSample the method does more than just call AsyncGetCallTrace; it also obtains C/C++ frames. However, this is insignificant for our modifications, as the additional stack walking is only related to the ucontext, not the thread.

We now modify this code so that we still send a signal to the sampled thread but only set a global ucontext and thread identifier (struct Data) in the signal handler, blocking till we finished walking the stack in the sampler thread, walking the stack in the latter (source):

struct Data {
    void* ucontext;
    JNIEnv* jni;
};

std::atomic<int> _thread_id;
std::atomic<Data*> _thread_data;

bool WallClock::walkStack(int thread_id) {
    // set the current thread
    _thread_id = thread_id;
    _thread_data = nullptr;

    // send the signal to the sampled thread
    if (!OS::sendSignalToThread(thread_id, SIGVTALRM)) {
        _thread_id = -1;
        return false;
    }
    // wait till the signal handler has set the ucontext and jni
    if (!waitWhile([&](){ return _thread_data == nullptr;}, 
                   10 * 1000 * 1000)) {
        _thread_id = -1;
        return false;
    }
    Data *data = _thread_data.load();
    // walk the stack
    ExecutionEvent event;
    event._thread_state = _sample_idle_threads ?
      getThreadState(data->ucontext) : THREAD_UNKNOWN;
    u64 ret = Profiler::instance()->recordSample(data->ucontext,
      _interval, EXECUTION_SAMPLE, &event, data->jni);

    // reset the thread_data, triggering the signal handler
    _thread_data = nullptr;
    return ret != 0;
}

void WallClock::signalHandler(
  int signo,
  siginfo_t* siginfo,
  void* ucontext) {
    // check that we are in the thread we are supposed to be
    if (OS::threadId() != _thread_id) {
        return;
    }
    
    Data data{
       ucontext,
       // Get a JNIEnv if it is deamed to be safe
       VMThread::current() == nullptr ? nullptr : VM::jni()
    };

    Data* expected = nullptr;
    if (!_thread_data.compare_exchange_strong(expected, &data)) {
        // another signal handler invocation 
        // is already in progress
        return;
    }
    // wait for the stack to be walked, and block the thread 
    // from executing
    // we do not timeout here, as this leads to difficult bugs
    waitWhile([&](){ return _thread_data != nullptr;});
}

The signal handler only stores the ucontext and thread identifier if it is run in the thread currently walked and uses compare_exchange_strong to ensure that the _thread_data is only set once. This prevents stalled signal handlers from concurrently modifying the global variables.

_thread_data.compare_exchange_strong(expected, &data) is equivalent to atomically executing:

if (_thread_data == expected) {
    _thread_data = &data;
    return true;
} else {
    expected = _thread_data;
    return false;
}

This ensures that the _thread_data is only set if it is null. Such operations are the base of many lock-free data structures; you can find more on this topic in the Wikipedia article on Compare-and-Swap (a synonym for compare-and-exchange).

Coming back to the signal handler implementation: The waitWhile method is a helper method that busy waits until the passed predicate does return false or the optional timeout is exhausted, ensuring that the profiler does not hang if something goes wrong.

The implementation uses the _thread_data variable to implement its synchronization protocol:

Interaction between the sampler thread and the signal handler.

You can find the implementation in my async-profiler fork, but as with my OpenJDK fork: It’s only a rough implementation.

The implemented approach works fine with async-profiler, but it has a minor flaw: We depend on an implementation detail of the current iteration of OpenJDK. It is only safe to get the JNIEnv in a signal handler if the JVM has allocated a thread-local Thread object for the signaled thread:

JDK-8132510: it’s not safe to call GetEnv() inside a signal handler since JDK 9, so we do it only for threads already registered in ThreadLocalStorage

async-profiler source code

This issue was first discovered when Krzysztof Ślusarski (of “Async-Profiler – manual by use cases” fame) reported a related issue in the async-profiler bug tracker.

For a deeper dive, consider reading the comment of David Holmes to the references JDK issue:

The code underpinning __thread use is not async-signal-safe, which is not really a surprise as pthread_get/setspecific are not designated async-signal-safe either.

The problem, in glibc, is that first access of a TLS variable can trigger allocation [1]. This contrasts with using pthread_getspecific which is benign and so effectively async-signal-safe.

So if a thread is executing in malloc and it takes a signal, and the signal handler tries to use TLS (it shouldn’t but it does and has gotten away with it with pthread_getspecific), then we can crash or get a deadlock.

Excerpt FROm DAdvid HOlme’s COMMENT on issue JDK-8132510

We check this condition in our signal handler implementation with the line

VMThread::current() == nullptr ? nullptr : VM::jni()

with VMThread::current() being implemented as:

VMThread* VMThread::current() {
    return (VMThread*)pthread_getspecific(
      (pthread_key_t)_tls_index /* -1 */);
}

This implementation detail is not an issue for async-profiler as it might make assumptions. Still, it is undoubtedly a problem for the general approach I want to propose for my new AsyncGetStackTrace API.

Modifying AsyncGetCallTrace (2nd approach)

We want to identify the thread using something different from JNIEnv. The OS thread id seems to be a good fit. It has three significant advantages:

  • It can be obtained independently from the JVM, depending on the OS rather than the JVM.
  • Our walkStack method already gets passed the thread id, so we don’t have to pass it from the signal handler to the sampler thread.
  • The mapping from thread id to Thread happens outside the signal handler in the AsyncGetCallTrace call, and the API sets the env_id field to the appropriate JNIEnv.

We have to add a new parameter os_thread_id to the API to facilitate this change (source):

// ...
//   os_thread_id - OS thread id of the thread which executed 
//                  this trace, or -1 if the current thread 
//                  should be used.
// ...
// Fields:
//   env_id     - ID of thread which executed this trace, 
//                the API sets this field if it is NULL.
// ...
void AsyncGetCallTrace(ASGCT_CallTrace *trace, jint depth, 
  void* ucontext, jlong os_thread_id)

The implementation can be found in my OpenJDK fork, but be aware that it is not yet optimized for performance as it iterates over the whole thread list for every call to find the Thread which matches the passed OS thread id.

Modifying async-profiler (2nd approach)

The modification to async-profiler is quite similar to the first approach. The only difference is that we’re not dealing with JNIEnv anymore. This makes the signal handler implementation slightly simpler (source):

void WallClock::signalHandler(
  int signo, 
  siginfo_t* siginfo, 
  void* ucontext) {
    // check that we are in the thread we are supposed to be
    if (OS::threadId() != _thread_id) {
        return;
    }
    void* expected = nullptr;
    if (!_ucontext.compare_exchange_strong(expected, ucontext)) {
        // another signal handler invocation 
        // is already in progress
        return;
    }
    // wait for the stack to be walked, and block the thread 
    // from executing
    // we do not timeout here, as this leads to difficult bugs
    waitWhile([&](){ return _ucontext != nullptr;});
}

You can find the full implementation in my async-profiler fork.

Now to the fun part (the experiment): Two drawbacks of the two previously discussed approaches are that one thread waits busily, and the other cannot execute all non-signal-safe code during that period. So the obvious next question is:

Could we walk a thread without stopping it?

In other words: Could we omit the busy waiting? An unnamed person suggested this.

The short answer is: It’s a terrible idea. The sampled thread modifies the stack while we’re walking its stack. It might even terminate while we’re in the middle of its previously valid stack. So this is a terrible idea when you don’t take many precautions.

The only advantage is that we can use non-signal-safe methods during stack walking. The performance of the profiling will not be significantly improved, as the signal sending and handling overhead is a magnitude larger than the stack walking itself for small traces. Performance-wise, it could only make sense for huge (1000 and more frames) traces.

Our central assumption is: The profiler takes some time to transition out of the signal handler of the sampled thread. Possibly longer than it takes to walk the topmost frames, which are most likely to change during the execution, in AsyncGetCallTrace.

But: Timing with signals is hard to predict (see this answer on StackExchange), and if the assumption fails, the resulting trace is either bogus or the stack walking leads to “interesting” segmentation faults. I accidentally tested this when I initially implemented the signal handler in my async-profiler and made an error. I saw error messages in places that I had not seen before.

So the results could be imprecise / sometimes incorrect. But we’re already sampling, so approximations are good enough.

The JVM might crash during the stack walking because the ucontext might be invalid and the thread stack changes (so that the stack pointer in the ucontext points to an invalid value and more), but we should be able to reduce the crashes by using enough precautions in AsyncGetCallTrace and testing it properly (I already implemented tests with random ucontexts in the draft for AsyncGetStackTrace).

The other option is to catch any segmentation faults that occur inside AsyncGetCallTrace. We can do this because we walk the stack in a separate thread (and JFR does it as well, as I’ve written at the beginning of this post). We can implement this by leveraging the ThreadCrashProtection clas,s which has, quite rightfully, some disclaimers:

/*
 * Crash protection for the JfrSampler thread. Wrap the callback
 * with a sigsetjmp and in case of a SIGSEGV/SIGBUS we siglongjmp
 * back.
 * To be able to use this - don't take locks, don't rely on 
 * destructors, don't make OS library calls, don't allocate 
 * memory, don't print, don't call code that could leave
 * the heap / memory in an inconsistent state, or anything 
 * else where we are not in control if we suddenly jump out.
 */
class ThreadCrashProtection : public StackObj {
public:
  // ...
  bool call(CrashProtectionCallback& cb);
  // ...
};

We wrap the call to the actual AsyncGetCallTrace implementation of our second approach in this handler (source):

void AsyncGetCallTrace(ASGCT_CallTrace *trace, jint depth, 
 void* ucontext, jlong os_thread_id) {
  trace->num_frames = ticks_unknown_state;
  AsyncGetCallTraceCallBack cb(trace, depth, ucontext, 
                               os_thread_id);
  ThreadCrashProtection crash_protection;
  if (!crash_protection.call(cb)) {
    fprintf(stderr, "AsyncGetCallTrace: catched crash\n");
    if (trace->num_frames >= 0) {
      trace->num_frames = ticks_unknown_state;
    }
  }
}

This prevents all crashes related to walking the stack from crashing the JVM, which is also helpful for the AsyncGetCallTrace usage of the previous part of this article. The only difference is that crashes in the stack walking are considered a bug in a normal use case but are expected in this use case where we don’t stop the sampled thread.

Back to this peculiar case: The implementation in async-profiler is slightly more complex than just removing the busy waiting at the end. First, we must copy the ucontext in the signal handler because the ucontext pointer only points to a valid ucontext while the thread is stopped. Furthermore, we have to disable the native stack walking in the async-profiler, as it isn’t wrapped in code that catches crashes. We also have, for unknown reasons, to set the safemode option of async-profiler to 0.

The implementation of the signal handler is simple (just remove the wait from the previous version). It results in the following sequence diagram:

Interaction between the sampler thread and the signal handlers when not blocking the sampled thread during the stack walking.

You can find the implementation on GitHub, albeit with known concurrency problems, but these are out-of-scope for this blog post and related to copying the ucontext atomically.

And now to the important question: How often did AsyncGetCallTrace crash? In the renaissance finagle-http benchmark (with a sampling interval of 10ms), it crashed in 592 of around 808000 calls, a crash rate of 0.07% and far better than expected.

The main problem can be seen when we look at the flame graphs (set the environment variable SKIP_WAIT to enable the modification):

Which looks not too dissimilar to the flame graph with busy waiting:

Many traces (the left part of the graph) are broken and do not appear in the second flame graph. Many of these traces seem to be aborted:

But this was an interesting experiment, and the implementation seems to be possible, albeit creating a safe and accurate profiler would be hard and probably not worthwhile: Catching the segmentation faults seems to be quite expensive: The runtime for the renaissance finagle-http benchmark is 83 seconds for the version with busy waiting and 84 seconds without, despite producing worse results.

Evaluation

We can now compare the performance of the original with the two prototypical implementations and the experimental implementation in a preliminary evaluation. I like using the benchmarks of the renaissance suite (version 0.14.2). For this example, I used the primarily single core, dotty benchmark with an interval of 1ms and 10ms:

java -agentpath:./build/lib/libasyncProfiler.so=start,\
                interval=INTERVAL,event=wall,\ 
                flamegraph,file=flame.html \
     -XX:+UnlockDiagnosticVMOptions -XX:DebugNonSafepoints \
     -jar renaissance.jar BENCHMARK

The shorter interval will make the performance impact of changes to the profiling more impactful. I’m profiling with my Threadripper 3995WX on Ubuntu using hyperfine (one warm-up run and ten measured runs each). The standard deviation is less than 0.4% in the following diagram, which shows the wall-clock time:

The number of obtained samples is roughly the same overall profiler runs, except for the experimental implementation, which produces around 12% fewer samples. All approaches seem to have a comparable overhead when considering wall-clock time. It’s different considering the user-time:

This shows that there is a significant user-time performance penalty when not using the original approach. This is expected, as we’re engaging two threads into one during the sampling of a specific threadTherefore, the wall-clock timings might.

The wall-clock timings might therefore be affected by my CPU having enough cores so that the sampler and all other threads run fully concurrently.

I tried to evaluate all approaches with a benchmark that utilizes all CPU (finagle-http), but my two new approaches have apparently severe shortcomings, as they produced only around a quarter of the samples compared to the original async-profiler and OpenJDK combination. This is worth fixing, but out-of-scope for this blog post, which already took more than a week to write.

Conclusion

This was the serious part of the experiment: Using AsyncGetCallTrace in a separate thread is possible with minor modifications and offers many advantages (as discussed before). It especially provides a more secure approach to profiling while not affecting performance if you’re system is not yet saturated: A typical trade-off between security and performance. I think that it should be up to the experienced performance engineer two decide and profilers should offer both when my JEP eventually makes the out-of-thread walking available on stock OpenJDKs.

The implementations in both the OpenJDK and async-profiler also show how to quickly implement, test and evaluate different approaches with widely used benchmarks.

Conclusion

The initial question, “Couldn’t we just use AsyncGetCallTrace in a separate thread?” can be answered with a resounding “Yes!”. Sampling in separate threads has advantages, but we have to block the sampled thread during stack walking; omitting this leads to broken traces.

If you have anything to add or found a flaw in my approaches or my implementations, or any other suggestions, please let me know 🙂

I hope this article gave you a glimpse into my current work and the future of low-level Java profiling APis.

This blog post is part of my work in the SapMachine team at SAP, making profiling easier for everyone.

Do you trust profilers? I once did too

Profilers are great tools in your toolbox, like debuggers, when solving problems with your Java application (I’ve been on a podcast on this topic recently). I’ll tell you some of their problems and a technique to cope with them in this blog post.

There are many open-source profilers, most notably JFR/JMC, and async-profiler, that help you to find and fix performance problems. But they are just software themself, interwoven with a reasonably large project, the OpenJDK (or OpenJ9, for that matter), and thus suffer from the same problems as the typical problems of application they are used to profile:

  • Tests could be better
  • Performance and accuracy could be better
  • Tests could be more plentiful, especially for the underlying API, which could be tested well
  • Changes in seemingly unrelated parts of the enclosing project can adversely affect them

Therefore you take the profiles generated by profilers with a grain of salt. There are several blog posts and talks covering the accuracy problems of profilers:

I would highly recommend you to read my Writing a profiler from scratch series If you want to know more about how the foundational AsyncGetCallTrace is used in profilers. Just to list a few.

A sample AsyncGetCallTraceTrace bug

A problem that has been less discussed is the lacking test coverage of the underlying APIs. The AsyncGetCallTrace API, used by async-profiler and others, has just one test case in the OpenJDK (as I discussed before). This test case can be boiled down to the following:

import java.lang.reflect.InvocationTargetException;
import java.lang.reflect.Method;

public class Main {

    static { /** load native library */ }

    public static void main(String[] args) throws Exception {
        Class<?> klass = Main.class;
        Method mainMethod = klass.getMethod("test");
        mainMethod.invoke(null);
    }

    public static void test() {
        if (!checkAsyncGetCallTraceCall()) {
            throw ...;
        }
    }

    public static native boolean checkAsyncGetCallTraceCall();
}

This is the simplest test case that can be written in the OpenJDK JTREG test framework for OpenJDK. The problem with this test case? The implementation of checkAsyncGetCallTraceCall only checks for the topmost frame. To test AsyncGetCallTrace correctly here, one should compare the trace returned by this call with the trace of an oracle. We can use GetStackTrace (the safepoint-biased predecessor of ASGCT) here as it seems to return the correct trace.

GetStackTrace returns something like the following:

Frame 0: Main.checkAsyncGetStackTraceCall
Frame 1: Main.test
Frame 2: java.lang.invoke.LambdaForm$DMH.[...].invokeStatic
Frame 3: java.lang.invoke.LambdaForm$MH.[...].invoke
Frame 4: java.lang.invoke.Invokers$Holder.invokeExact_MT
Frame 5: jdk.internal.reflect.DirectMethodHandleAccessor
                             .invokeImpl
Frame 6: jdk.internal.reflect.DirectMethodHandleAccessor.invoke
Frame 7: java.lang.reflect.Method.invoke
Frame 8: Main.main

AsyncGetCallTrace, on the other hand, had problems walking over some of the reflection internals and returned:

Frame 0: Main.checkAsyncGetStackTraceCall
Frame 1: Main.test
Frame 2: java.lang.invoke.LambdaForm$DMH.[...].invokeStatic

This problem can be observed with a modified test case with JFR and async-profiler too:

public class Main {

    public static void main(String[] args) throws Exception {
        Class<?> klass = Main.class;
        Method mainMethod = klass.getMethod("test");
        mainMethod.invoke(null);
    }

    public static void test() {
        javaLoop();
    }

    public static void javaLoop() {
        long start = System.currentTimeMillis();
        while (start + 3000 > System.currentTimeMillis());
    }
}
The expected flame graph is on the left (obtained after fixing the bug), and the actual flame graph is on the right.

So the only test case on AsyncGetCallTrace in the OpenJDK did not properly test the whole trace. This was not a problem when the test case was written. One can expect that its author checked the entire stack trace manually once and then created a small check test case to test the first frame, which is not implementation specific. But this is a problem for regression testing:

The Implementation of JEP 416: Reimplement Core Reflection with Method Handle in JDK 18+23 in mid-2021 modified the inner workings of reflection and triggered this bug. The lack of proper regression tests meant the bug had only been discovered a week ago. The actual cause of the bug is more complicated and related to a broken invariant regarding stack pointers in the stack walking. You can read more on this in the comments by Jorn Vernee and Richard Reingruber to my PR.

My PR improves the test by checking the result of AsyncGetCallTrace against GetStackTrace, as explained before, and fixing the bug by slightly loosening the invariant.

My main problem with finding this bug is that it shows how the lack of test coverage for the underlying profiling APIs might cause problems even for profiling simple Java code. I only found the bug because I’m writing many tests for my new AsyncGetStackTrace API. It’s hard work, but I’m convinced this is the only way to create a reliable foundation for profilers.

Profilers in a loop

Profilers have many problems but are still helpful if you know what they can and cannot do. They should be used with care, without trusting everything they tell you. Profilers are only as good as the person interpreting the profiler results and the person’s technique.

I have a background in computer science, and every semester I give students in a paper writing lab an hour-long lecture on doing experiments. I started this a few years back and continue to do it pro-bono because it is an important skill to teach. One of the most important things that I teach the students is that doing experiments is essentially a loop:

You start with an abstract model of the experiment and its environment (like the tool or algorithm you’re testing). Then you formulate a hypothesis in this model (e.g., “Algorithm X is faster as Y because of Z”). You might find problems in your model during this step and go back to the modeling step, or you don’t and start evaluating, checking whether the hypothesis holds. During this evaluation, you might find problems with your hypothesis (e.g., it isn’t valid) or even your model and go back to the respective step. Besides problems, you usually find new information that lets you refine your model and hypothesis. Evaluating without a mental model or a hypothesis makes it impossible to interpret the evaluation results correctly. But remember that a mismatch between hypothesis and evaluation might also be due to a broken evaluation.

The same loop can be applied to profiling: Before investigating any issue with a program, you should acquire at least a rough mental model of the code. This means understanding the basic architecture, performance-critical components, and the issues of the underlying libraries. Then you formulate a hypothesis based on the problem you’re investigating embedded in your mental model (e.g., “Task X is slow because Y is probably slow …”). You can then evaluate the hypothesis using actual tests and a profiler. But as before, remember that your evaluation might also contain bugs. You can only discover these with a mental model and a reasonably refined hypothesis.

This technique lets you use profilers without fearing that spurious errors will lead you to wrong conclusions.

I hope you found this article helpful and educational. It is an ongoing effort to add proper tests and educate users of profilers. See you in the next post when I cover the next step in writing a profiler from scratch.

This blog post is part of my work in the SapMachine team at SAP, making profiling easier for everyone.

JFR Timestamps and System.nanoTime

Did you ever wonder whether JFR timestamps use the same time source as System.nanoTime? This is important when you have miscellaneous logging besides JFR events; otherwise, you could not match JFR events and your logging properly. We assume here that you use System.nanoTime and not less-suited timing information from System.currentTimeMillis.

The journey into this started with a question on the JDK Mission Control slack channel, which led me into a rabbit hole:

Could I have a question regarding JFR timestamps? (working with Linux) Is there any difference between JFR timestamp implementation and System#nanoTime (any optimization)?

Petr Bouda

This question essentially boils down to comparing both methods’ OS time sources. We’re only considering Unix systems in the following.

Source of JFR timestamps

The JFR event time stamps are set in the JFR event constructor, which is defined in jfrEvent.hpp (and not in the Java code, as one might expect):

  JfrEvent(EventStartTime timing=TIMED) : 
    _start_time(0), _end_time(0),
    _untimed(timing == UNTIMED),
    _should_commit(false), _evaluated(false)
#ifdef ASSERT
  , _verifier()
#endif
  {
    if (!T::isInstant && !_untimed && is_enabled()) {
      set_starttime(JfrTicks::now());
    }
  }

Looking further reveals that JFRTicks calls FastUnorderedElapsedCounterSource which uses two different time sources:

FastUnorderedElapsedCounterSource::Type 
 FastUnorderedElapsedCounterSource::now() {
#if defined(X86) && !defined(ZERO)
  static bool valid_rdtsc = Rdtsc::initialize();
  if (valid_rdtsc) {
    return Rdtsc::elapsed_counter();
  }
#endif
  return os::elapsed_counter();
}

The RDTSC instruction reads the time stamp counter on x86 processors:

The time stamp counter (TSC) is a hardware counter found in all contemporary x86 processors. The counter is implemented as a 64-bit model-specific register (MSR) that is incremented at every clock cycle. The RDTSC (“read time stamp counter”) register has been present since the original Pentium.

Already because of the access method, TSC provides a low-overhead and high-resolution way to obtain CPU timing information. This traditional premise was violated when such factors as system sleep states, CPU “hotplugging”, “hibernation”, and CPU frequency scaling were introduced to the x86 lineage. This was however mainly a short abruption: in many new x86 CPUs the time stamp counter is again invariant with respect to the stability of the clock frequency. Care should be however taken in implementations that rely on this assumption.

NETBSD MANUAL

This instruction allows the OS to implement a monotonic real-time clock.

On non-x86 systems os::elapsed_counter is used, which, surprise, calls os::javaTimeNanos:

jlong os::elapsed_counter() {
  return os::javaTimeNanos() - initial_time_count;
}

Source of System.nanoTime

Now the remaining question is: Does System.nanoTime also call os::javaTimeNanos? The method is defined in the jvm.cpp:

JVM_LEAF(jlong, JVM_NanoTime(JNIEnv *env, jclass ignored))
  return os::javaTimeNanos();
JVM_END

So System.nanoTime is just a tiny wrapper around os::javaTimeNanos. So this solves the original question on non-x86 CPUs. But what about x86 CPUs?

First for Mac OS: It boils down to calling mach_absolute_time:

Returns current value of a clock that increments monotonically in tick units (starting at an arbitrary point), this clock does not increment while the system is asleep.

ApplE DEVELOPER DOCUMENTATION

Information on the implementation of this method is scarce, but source code from 2007 suggests that mach_absolute_time is RDTSC based. So there is (probably) no difference between JFR timestamps and System.nanoTime on Mac OS, regardless of the CPU architecture.

Now on Linux: Here, the used os::javaTimeNanos is implemented using clock_gettime(CLOCK_MONOTONIC, ...):

CLOCK_MONOTONIC Clock that cannot be set and represents monotonic time since some unspecified starting point.

Linux MAN PAGE

I tried to find something in the Linux Kernel sources, but they are slightly too complicated to find the solution quickly, so I had to look elsewhere. Someone asked a question on clock_gettime on StackOverflow. The answers essentially answer our question too: clock_gettime(CLOCK_MONOTONIC, ...) seems to use RDTSC.

Conclusion

JFR timestamps and System.nanoTime seem to use the same time source on all Unix systems on all platforms, as far as I understand it.

You can stop the JVM from using RDTSC by using the -XX:+UnlockExperimentalVMOptions -XX:-UseFastUnorderedTimeStamps JVM flags (thanks to Richard Startin for pointing this out). You can read Markus Grönlunds Mail on Timing Differences Between JFR and GC Logs for another take on JFR timestamps (or skip ahead):

JFR performance as it relates to event generation, which is also functional for JFR, reduce to a large extent to how quickly a timestamp can be acquired. Since everything in JFR is an event, and an event will have at least one timestamp, and two timestamps for events that represent durations, the event generation leans heavily on the clock source. Clock access latencies is therefore of central importance for JFR, maybe even more so than correctness. And historically, operating systems have varied quite a bit when it comes to access latency and resolution for the performance timers they expose.

What you see in your example is that os::elapsed_counter() (which on Windows maps to QueryPerformanceCounter() with a JVM relative epoch offset) and the rdtsc() counter are disjoint epochs, and they are treated as such in Hotspot. Therefore, attempting to compare the raw counter values is not semantically valid.

Relying on and using rdtsc() come with disclaimers and problems and is generally not recommended. Apart from the historical and performance related aspects already detailed, here is a short description of how it is treated in JFR:

JFR will only attempt to use this source if it has the InvariantTSC property, with timestamp values only treated relative to some other, more stable, clock source. Each “chunk” (file) in JFR reifies a relative epoch, with the chunk start time anchored to a stable timestamp (on Windows this is UTC nanoseconds). rdtsc() timestamps for events generated during that epoch are only treated relative to this start time during post-processing, which gives very high resolution to JFR events. As JFR runs, new “chunks”, and therefore new time epochs, are constructed, continuously, each anchored anew to a stable timestamp.

The nature of rdtsc() querying different cores / sockets with no guarantee of them having been synchronized is of course a problem using this mechanism. However, over the years, possible skews have proven not as problematic as one might think in JFR. In general, the relative relations between the recorded JFR events give enough information to understand a situation and to solve a problem. Of course, there are exceptions, for example, when analyzing low-level aspects expecting high accuracy, usually involving some correlation to some other non-JFR related component. For these situations, an alternative is to turn off rdtsc() usages in JFR using the flags: -XX:+UnlockExperimentalVMOptions -XX:-UseFastUnorderedTimeStamps. JFR will now use os::elapsed_counter() as the time source. This comes with higher overhead, but if this overhead is not deemed problematic in an environment, then this is of course a better solution.

As other have already pointed out, there have been evolution in recent years in how operating systems provide performance counter information to user mode. It might very well be that now the access latencies are within acceptable overhead, combined with high timer resolution. If that is the case, the rdtsc() usages should be phased out due to its inherent problems. This requires a systematic investigation and some policy on how to handle older HW/SW combinations – if there needs to be a fallback to continue to use rdtsc(), it follows it is not feasible to phase it out completely.

Markus Grönlund

Difference between System.currentTimeMillis and System.nanoTime

This is not directly related to the original question, but nonetheless interesting. System.currentTimeMillis is implemented using clock_gettime(CLOCK_REALTIME, ...) on all CPU architectures:

CLOCK_REALTIME System-wide realtime clock. Setting this clock requires appropriate privileges.

Linux MAN PAGE

This clock is not guaranteed to be monotonic:

CLOCK_REALTIME represents the machine’s best-guess as to the current wall-clock, time-of-day time. […] this means that CLOCK_REALTIME can jump forwards and backwards as the system time-of-day clock is changed, including by NTP.

CLOCK_MONOTONIC represents the absolute elapsed wall-clock time since some arbitrary, fixed point in the past. It isn’t affected by changes in the system time-of-day clock.

Ciro Santilli on STACKOVERFLOW

So does it make a difference? Probably only slightly, especially if you’re running shorter profiling runs. For longer runs, consider using System.nanoTime.

I hope you enjoyed coming down this rabbit hole with me and learned something about JFR internals along the way.

This blog post is part of my work in the SapMachine team at SAP, making profiling easier for everyone.

Using Firefox Profiler beyond the web

This blog post is the base for the second half of my upcoming talk at FOSDEM 2023 on the topic “Firefox Profiler beyond the web: Using Firefox Profiler to view Java profiling data.”

I detailed in my last blog post how the Firefox Profiler can be used to view Java profiling data:

But I’m of course not the only one who uses Firefox Profiler beyond the web, because using it has many advantages: You’re essentially getting a prototypical visualization for your data in an afternoon.

Other tools that use Firefox Profiler

There are other tools that output use Firefox Profiler for their front end. A great example is the Rust profiler samply by Markus Stange, the initial developer of Firefox Profiler:

samply is a command line CPU profiler which uses the Firefox profiler as its UI.

At the moment it runs on macOS and Linux. Windows support is planned. samply is still under development and far from finished, but works quite well already.

Give it a try:

% cargo install samply
% samply record ./your-command your-arguments
GiTHUB

Another example is the python profiler FunctionTrace:

A graphical Python profiler that provides a clear view of your application’s execution while being both low-overhead and easy to use.

FunctionTrace supports all of the useful profiling views you’re familiar with, including Stack Charts, Flame Graphs, and Call Trees, thanks to its integration with the Firefox Profiler.

There are also non-open source uses of Firefox Profiler, Luís Oliveira, for example, works on integration with Lisp:

We’re using the Firefox Profiler to debug performance issues at the Dutch Railways dispatching center.

Basic Structure

I hope I convinced you that the Firefox Profiler is really great for visualizing profiling data, even if this data comes from the world beyond web UIs. If not, please read my previous article. The main part of adapting to Firefox Profiler is to convert your data into the profiler format. The data is stored as JSON in a (optionally zipped) file and can be loaded into Firefox Profiler. See Loading in profiles from various sources for more information.

The basic structure of a tool using Firefox Profiler can be as follows, using my plugin as an example:

You have a converter from your profile format to the Firefox Profiler format. The converted file is then passed to the Firefox Profiler, either from profiler.firefox.com or a custom fork. You typically then wrap your UI and the converter, hosting both on a small webserver. This web server runs then on e.g. localhost. Hosting your own Firefox Profiler instance has two main advantages: First, you have always a tested combination of Firefox Profiler and Converter. Second, it works offline. The web server can then be embedded into a larger application, showing the UI using an embedded browser.

You can find the type definitions in the types folder of the Firefox Profiler repository. All of the following will be based on this. This part of the article will not focus on all the details, like markers, but more on a bird’s eye view of the format, so it will probably still apply with future revisions. I’ll also omit parts that are not that useful for non-web use cases. If you have any questions on the file format, feel free to ask them in the matrix channel.

The type definitions are written with flow. It is helpful to read its documentation if you want to understand the intricacies. But for now, it should suffice to know that x?: type means that the property x is optional and that | denotes either types.

Layout

A short interlude: The layout of Firefox Profiler consists basically of a timeline view and a methods and timing view:

The timeline allows you to select specific threads and a time slice to view the details in the detail section below the timeline.

Overview

The following shows the main components of the profile format, omitting and summarizing many properties. This diagram should give a rough overview of what comes next:

Profile

The topmost level of a profile is the Profile type:

type Profile = {|
  meta: ProfileMeta,    // meta information, like application name
  libs: Lib[],          // used shared native libraries
  ...
  counters?: Counter[], // CPU and memory counters
  ...
  threads: Thread[],    // samples per thread
  ...
|};

A profile consists of the metadata, shared libraries, CPU and memory counters, and the rest of the data per thread.

ProfileMeta

A profile can have lots of metadata shown in the UI. The ProfileMeta type specifies this:

type ProfileMeta = {|
  // The interval at which the threads are sampled.
  interval: Milliseconds,
  // The number of milliseconds since midnight January 1, 1970 GMT.
  startTime: Milliseconds,
  // The number of milliseconds since midnight January 1, 1970 GMT.
  endTime?: Milliseconds,
  ...
  // The list of categories as provided by the platform. The categories are present for
  // all Firefox profiles, but imported profiles may not include any category support.
  // The front-end will provide a default list of categories, but the saved profile
  // will not include them.
  categories?: CategoryList,
  // The name of the product, most likely "Firefox".
  product: 'Firefox' | string,
  ...
  // Arguments to the program (currently only used for imported profiles)
  arguments?: string,
  ...
  // The amount of logically available CPU cores for the program.
  logicalCPUs?: number,
  ...
  symbolicated?: boolean,  // usually false for imported profiles
  symbolicationNotSupported?: boolean, // usually true for imported profiles
  // symbolication is usually not important for imported and converted profiles
  ...
  // Profile importers can optionally add information about where they are imported from.
  // They also use the "product" field in the meta information, but this is somewhat
  // ambiguous. This field, if present, is unambiguous that it was imported.
  importedFrom?: string,

  // The following are settings that are used to configure the views for
  // imported profiles, as some features do not make sense for them

  // Do not distinguish between different stack types?
  usesOnlyOneStackType?: boolean, // true in our use case
  // Hide the "implementation" information in the UI (see #3709)?
  doesNotUseFrameImplementation?: boolean, // true in our use case
  // Hide the "Look up the function name on Searchfox" menu entry?
  sourceCodeIsNotOnSearchfox?: boolean,    // true in our use case
  // Indexes of the threads that are initially visible in the UI.
  // This is useful for imported profiles for which the internal visibility score
  // ranking does not make sense.
  initialVisibleThreads?: ThreadIndex[],
  // Indexes of the threads that are initially selected in the UI.
  // This is also most useful for imported profiles where just using the first thread
  // of each process might not make sense.
  initialSelectedThreads?: ThreadIndex[],
  // Keep the defined thread order
  keepProfileThreadOrder?: boolean,
|};

And there is more. It might feel overwhelming, but this data structure also allows you to tailor the profiler UI slightly to your needs.

Category

Many parts of the profile are associated with a Category and a subcategory. A category is defined as:

type Category = {|
  name: string,
  color: string,
  subcategories: string[],
|};

Categories are referenced by their index in the category list of the ProfileMeta data structure and subcategories by their index in the field of their parent category.

The categories are used to assign a color to the squares in front of the method names and give more information on every call tree node in the sidebar:

Now to the individual threads:

Thread

The thread data structure combines all information related to a single thread. There can be multiple threads per process Id. The thread with the name GeckoMain is handled differently than the others. It is the main thread that is shown in the process timeline.

type Thread = {|
  ...
  processStartupTime: Milliseconds,
  processShutdownTime: Milliseconds | null,
  registerTime: Milliseconds,
  unregisterTime: Milliseconds | null,
  ...
  name: string,
  ...
  pid: Pid,
  tid: Tid,
  ...
  // Strings for profiles are collected into a single table, and are referred to by
  // their index by other tables.
  stringTable: UniqueStringArray,
  ...
  samples: SamplesTable,
  ...
  stackTable: StackTable,
  frameTable: FrameTable,
  funcTable: FuncTable,
  resourceTable: ResourceTable,
  ...
|};

The file format stores all stack traces in a space-efficient format which the front end can handle fast. It uses an array of strings (stringTable) to store all strings that appear in the stack traces (like function names), the other data structures only refer to strings by their index in this array.

SampleS Table

This data structure associates a captured stack with a capture time and an optional weight:

/**
 * The Gecko Profiler records samples of what function was currently being executed, and
 * the callstack that is associated with it. This is done at a fixed but configurable
 * rate, e.g. every 1 millisecond. This table represents the minimal amount of
 * information that is needed to represent that sampled function. Most of the entries
 * are indices into other tables.
 */
type SamplesTable = {|
  ...
  stack: Array<IndexIntoStackTable | null>,
  time: Milliseconds[],
  // An optional weight array. If not present, then the weight is assumed to be 1.
  // See the WeightType type for more information.
  weight: null | number[],
  weightType: WeightType,  // 'samples' or 'tracing-ms'
  // CPU usage value of the current thread. Its values are null only if the back-end
  // fails to get the CPU usage from operating system.
  threadCPUDelta?: Array<number | null>,
  length: number,
|};

Filling this with data from a sampling profiler is easy, just add references to the stacks and their occurrence time. For example consider you sampled the stack A-B at 0 and A-B-C at 2, then the samples table is:

SamplesTable = {
  stack: [A-B, A-B-C], // references into the stack table
  time:  [0, 2],
  weigth: null,
  weigthType: 'samples',
  length: 2
}

Filling the threadCPUDelta property allows you to specify the CPU time a thread has used since the last sample. The Firefox Profiler uses this property to show the CPU usage curves in the timeline:

Stack Table

All stacks are stored in the stack table using a prefix array:

type StackTable = {|
  frame: IndexIntoFrameTable[],
  // Imported profiles may not have categories. In this case fill the array with 0s.
  category: IndexIntoCategoryList[],
  subcategory: IndexIntoSubcategoryListForCategory[],
  prefix: Array<IndexIntoStackTable | null>,
  length: number,
|};

Category and subcategory of a stack n gives information on the whole stack, the frame just on its topmost frame. The prefix denotes the stack related to the second-top-most frame or that this stack only has one frame if null. This allows the efficient storage of stacks.

Now consider our example from before. We could store the stack A-B-C as follows:

StackTable = {
  frame:  [A, B, C], // references into the frame table
  prefix: [1, 2, 0],
  ...
}

Frame Table

The frames themselves are stored in the frame table:

/**
 * Frames contain the context information about the function execution at the moment in
 * time. The caller/callee relationship between frames is defined by the StackTable.
 */
type FrameTable = {|
  // If this is a frame for native code, the address is the address of the frame's
  // assembly instruction,  relative to the native library that contains it.
  address: Array<Address | -1>,

  // The inline depth for this frame. If there is an inline stack at an address,
  // we create multiple frames with the same address, one for each depth.
  // The outermost frame always has depth 0.
  inlineDepth: number[],

  category: (IndexIntoCategoryList | null)[],
  subcategory: (IndexIntoSubcategoryListForCategory | null)[],
  func: IndexIntoFuncTable[],
  ...
  line: (number | null)[],
  column: (number | null)[],
  length: number,
|};

Each frame is related to a function, which is in turn stored in the FuncTable.

Func Table

The function table stores all functions with some metadata:

type FuncTable = {|
  // The function name.
  name: Array<IndexIntoStringTable>,

  // isJS and relevantForJS describe the function type. Non-JavaScript functions
  // can be marked as "relevant for JS" so that for example DOM API label functions
  // will show up in any JavaScript stack views.
  // It may be worth combining these two fields into one:
  // https://github.com/firefox-devtools/profiler/issues/2543
  isJS: Array<boolean>,
  relevantForJS: Array<boolean>,

  // The resource describes "Which bag of code did this function come from?".
  // For JS functions, the resource is of type addon, webhost, otherhost, or url.
  // For native functions, the resource is of type library.
  // For labels and for other unidentified functions, we set the resource to -1.
  resource: Array<IndexIntoResourceTable | -1>,

  // These are non-null for JS functions only. The line and column describe the
  // location of the *start* of the JS function. As for the information about which
  // which lines / columns inside the function were actually hit during execution,
  // that information is stored in the frameTable, not in the funcTable.
  fileName: Array<IndexIntoStringTable | null>,
  lineNumber: Array<number | null>,
  columnNumber: Array<number | null>,

  length: number,
|};

Resource Table

The last table I’ll show in this article is the resource table. It depends on you and what you map to it:

/**
 * The ResourceTable holds additional information about functions. It tends to contain
 * sparse arrays. Multiple functions can point to the same resource.
 */
type ResourceTable = {|
  length: number,
  ...
  name: Array<IndexIntoStringTable>,
  ...
  // 0: unknown, library: 1, addon: 2, webhost: 3, otherhost: 4, url: 5
  type: resourceTypeEnum[],
|};

This was quite a technical article, so thanks for reading till the end. I hope it helps you when you try to target the Firefox Profiler, and see you for the next blog post.

This blog post is part of my work in the SapMachine team at SAP, making profiling easier for everyone.

Firefox Profiler beyond the web

This blog post is the base of the first half of my upcoming talk at FOSDEM 2023 on the topic “Firefox Profiler beyond the web: Using Firefox Profiler to view Java profiling data.”

For the impatient: Take a look at my Java JFR Profiler IntelliJ plugin to easily profile your application and view JFR files directly in your IDE.

I got involved in the Firefox Profiler development, spending significant portions of my time at SAP in the last half year on it. It has been an exciting ride. I learned a lot and contributed a few features. So I was essentially developing React code when I wasn’t working on ASGST or other OpenJDK-related tools. But you may well ask the most important of all questions: Why? Why did I spend so much time on a profiler for JavaScript and the Firefox browser? It all started in August 2022…

How did I end up there?

I developed code related to the Java debugging protocol in the second quarter of 2022. I had grand plans, but it eventually did turn out to be far more complicated than expected, but that is a story for another blog post. You can read my short primer on Java debugging internals to get a glimpse of this work. During the development, I encountered a problem: How can I profile my code, especially unit tests? I had many unit tests like the following, which tested specific aspects of my code:

@Test
public void testEvaluateSwitchStatement() {
    var program = Program.parse("((= x 1) (switch (const x)" +
        "(case 1 (= v (collect 2))) (case 2 (= v (collect 1)))" +
        "))");
    var funcs = new RecordingFunctions();
    new Evaluator(vm, funcs).evaluate(program);
    assertEquals(List.of(wrap(2L)), funcs.values);
}

I wanted to use an open-source profiler, as I had no access to the paid version of IntelliJ, which includes profiling support. Multiple tools are available, but it essentially boils down to async-profiler and JMC. Both tools have their advantages and disadvantages regarding their UI, but it essentially boils down to ease of use vs. available features:

Async-profiler and its profiling visualizations are easy to use but do not have that many features. The only available visualization is flamegraphs with minimal interactivity, just zooming is supported. Flamegraphs are the bread-and-butter of profiling:

If you want more visualizations, like a tree view, timelines, or a JFR event view, you can export your profile into JFR format (or use JFR to record your profile directly) and view it in JMC. But the difference between the ease of use of both is vast: Whereas the flamegraphs of async-profiler are usable by anyone with a short introduction, using JMC has a steep learning curve, it is currently more a tool for experts to dig deep into the profiling data. This observation leads us to the first problem: There is, to my knowledge, no open-source tool that offers more visualizations than just flamegraphs and is as easy to use.

Another problem with both async-profiler and JFR is the missing integration into IDEs. I would like to just click on a button in a context menu to profile an individual test case:

Without the hassle of creating a main method that just calls this method: I want to be able to profile it by modifying the JVM options of a run configuration.

I thought I was probably not the only one with this use case who stumbled upon the two problems impeding profiling. I had some spare time in August, so I looked for ways to build a tool myself.

Building a basic version of an IDE plugin that solves the second of my two problems is relatively easy. There is already the open-source profiling plugin Panda by Rastislav Papp, on which I based mine. Panda has only a simple tree view visualization, so it does not cover my first problem with visualizations. So I still had to figure out how I could implement the UI. Implementing it directly in the IDE in Java is cumbersome, so I decided early on to use an embedded browser. I considered implementing it myself, with the help of libraries like d3-flamegraph or extending speedscope, but this proved too much work. And I had no experience in building production-ready web applications, or for that matter, React.

Advantages of Firefox Profiler

Here comes the Firefox Profiler: It might seem illogical to use it in a use case that its developers never thought about, but it has a lot going for it:

  • it has multiple visualizations and a timeline to select time slices
  • it is open-source but backed by a large company
  • it is actively developed by a small group of people
  • it has a matrix channel where you can ask a lot of questions and get great answers
  • it has a well-defined profile format which is rather extensively documented
  • its developers were open to collaborating with me, adapting Firefox Profiler for non-web use cases

It has still been a lot of work to add the necessary features for my use case, and it is an ongoing effort to integrate them into the mainline Firefox Profiler. But if the current Firefox Profiler meets all your needs UI-wise, then using it beyond the web is just a matter of writing a profiler.

Just keep in mind that you’ll have to map your data onto the profile data structure of Firefox Profiler.

My Java Profiler IntelliJ Plugin

My Java JFR profiler (GitHub) plugin is the result of all my efforts:

It uses my Firefox Profiler fork, which includes additions not yet in the upstream repository and has a modular implementation so that you can use the JFR to Firefox Profiler converter independently. The plugin supports gathering profiles using JFR and async-profiler (via ap-loader), the previous image with the different run configurations is from my plugin, and opening arbitrary JFR files (as long as they are not too large):

The plugin integrates with your IDE, navigating to a method in the source code when you double-click a method in the profile view. Shift double-click, and it shows you the code with the profiling information on the side:

Besides that, it has support for showing information on all JFR events:

The Firefox Profiler view contains a Function Table, Flame Graph, and Stack Chart view, combined with a timeline on top, so it truly solves the first problem of visualizations. And it solves the second problem, as profiling with JFR or async-profiler can’t be more accessible than clicking a single button.

The plugin is under active development and currently in the beta stage, so give it a try and feel free to create issues on GitHub. You can pick the Java JFR Profiler plugin from the JetBrains marketplace.

I will write an entire blog post covering the plugin usage in the next few weeks, so stay tuned for more information and possibly a screencast.

I’ll release another, more technical blog post in the next few days, which covers other tools that use Firefox Profiler as a visualization and the profile format. This will be the second part of my FOSDEM talk. I hope to see you there or in the stream.

This blog post and the IntelliJ Plugin are part of my work in the SapMachine team at SAP, making profiling easier for everyone.

Writing a Profiler from Scratch: The Profiling Loop

This is the second post in the series, building a profiler from scratch using AsyncGetCallTrace: Today, we’re covering wall-clock profiling and how to collect the obtain stack traces. If you’re unfamiliar with AsyncGetCallTrace, please check out my previous article in the series here.

Our goal today is to essentially write the primary loop of a profiler and do the following every n milliseconds (where n is our chosen profiling interval):

Schema of a profiler loop

The question that comes up is: Which threads should we sample? If we only sample currently running threads, then we have cpu-time profiling, but if we sample all threads (running and idle), we have the far more popular wall-clock profiling. The difference between both types is best described by Richard Warburton and Sadiq Jaffer in their article on opsian.com:

CPU Time. There’s very little point in trying to measure what’s happening off-cpu if you’ve got a single very hot thread, as many event-loop style applications have. You may see plenty of time being spent in other threads that are waiting around performing background work but aren’t actually a bottleneck in terms of your application. So CPU Time is useful when you know you’re bottlenecked on a CPU problem.

Wallclock Time is useful for problems where CPU utilisation seems low but application throughput or latency doesn’t meet expectations. This is because threads are probably blocked on something, for example lock contention, times when you’re waiting for downstream blocking network IO to happen or running with too many threads so some of them don’t get scheduled. The trigger for looking at Wallclock time would be low CPU usage during your period of performance issues.

Solving JVM Performance Problems with Profilers: Wallclock vs CPU Time Edition

The choice of the sampling mode, therefore, depends on the actual use case.

How do we obtain the list of threads in the first place? We could use the aptly named GetAllThreads method, but we cannot use the returned thread objects to quickly get the thread ids we need for sending the threads signals. There is another possibility: Registering for ThreadStart and ThreadEnd events and collecting the threads as they are created:

void JNICALL
ThreadStart(jvmtiEnv *jvmti_env,
            JNIEnv* jni_env,
            jthread thread)

Thread start events are generated by a new thread before its initial method executes. A thread may be listed in the array returned by GetAllThreads before its thread start event is generated. It is possible for other events to be generated on a thread before its thread start event. The event is sent on the newly started thread.

JVMTI documentation

We create a class called ThreadMap for this purpose (stored in a global variable): It maps thread ids to the internal ids (because thread ids might be reused during the lifetime of a process), and the jthread (for JVMTI operations) and the pthread (for sending signals) and maps these internal ids in turn to names, so we can later display them. It has to be thread-safe, as we’re accessing it from multiple threads in parallel. So we have to create a lock for its internal data structures. This is quite simple with modern C++:

struct ValidThreadInfo {
    jthread jthread;
    pthread_t pthread;
    bool is_running;
    long id;
};

class ThreadMap {
  std::recursive_mutex m;
  std::unordered_map<pid_t, ValidThreadInfo> map;
  std::vector<std::string> names;

  public:

    void add(pid_t pid, std::string name, jthread thread) {
      const std::lock_guard<std::recursive_mutex> lock(m);
      map[pid] = ValidThreadInfo{.jthread = thread, .pthread = pthread_self(), .id = (long)names.size()};
      names.emplace_back(name);
    }

    void remove(pid_t pid) {
      const std::lock_guard<std::recursive_mutex> lock(m);
      map.erase(pid);
    }
};

Obtaining the thread id for the current thread leads to our first platform-dependent code, using a syscall on Linux and a non-posix pthread method on mac:

pid_t get_thread_id() {
  #if defined(__APPLE__) && defined(__MACH__)
  uint64_t tid;
  pthread_threadid_np(NULL, &tid);
  return (pid_t) tid;
  #else
  return syscall(SYS_gettid);
  #endif
}

Yes, we could quickly implement our profiler for other BSDs like FreeBSD, but MacOS is the only one I have at hand.

Keep in mind to also add the main thread to the collection by calling the OnThreadStart method in the OnVMInit method, which handles the start of the VM.

Now we have an up-to-date list of all threads. For wall-clock profiling, this is fine, but for cpu-time profiling, we have to obtain a list of running threads. We conveniently stored the jthread objects in our thread map, so we can now use the GetThreadState method:

bool is_thread_running(jthread thread) {
  jint state;
  auto err = jvmti->GetThreadState(thread, &state);
  // handle errors as not runnable
  return err == 0 && state != JVMTI_THREAD_STATE_RUNNABLE;
}

This leads us to a list of threads that we can sample. It is usually not a good idea to sample all available threads: With wall-clock profiling, the list of threads can be so large that sampling all threads is too costly. Therefore one typically takes a random subset of the list of available threads. Async-profiler, for example, takes 8, which we use too.

Taking a random subset is quite cumbersome in C, but C++ has a neat library function since C++11: std:shuffle. We can use it to implement the random selection:

std::vector<pid_t> get_shuffled_threads() {
  const std::lock_guard<std::recursive_mutex> lock(m);
  std::vector<pid_t> threads = get_all_threads();
  std::random_device rd;
  std::mt19937 g(rd());
  std::shuffle(threads.begin(), threads.end(), g);
  return std::vector(threads.begin(), threads.begin() + std::min(MAX_THREADS_PER_ITERATION, (int)threads.size()));
}

Be aware that we had to change the mutex to a mutex which allows recursive reentrance, as get_all_threads also acquires the mutex.

The next step is creating a sampling thread that runs the loop I described at the beginning of this article. This superseedes the itimer sampler from the previous post, as the mechanism does not support sending signals to a random subset only. The sampling thread is started in the OnVMInit handler and joined the OnAgentUnload handler. Its implementation is rather simple:

std::atomic<bool> shouldStop = false;
std::thread samplerThread;

static void sampleLoop() {
  initSampler();
  std::chrono::nanoseconds interval{interval_ns};
  while (!shouldStop) {
    auto start = std::chrono::system_clock::now();
    sampleThreads();
    auto duration = std::chrono::system_clock::now() - start;
    // takes into account that sampleThreads() takes some time
    auto sleep = interval - duration;
    if (std::chrono::seconds::zero() < sleep) {
      std::this_thread::sleep_for(sleep);
    }
  }
  endSampler(); // print the collected information
}

static void startSamplerThread() {
  samplerThread = std::thread(sampleLoop);
}

The initSampler function preallocates the traces for N=8 threads, as we decided only to sample as many at every loop. The signal handlers can later use these preallocated traces to call AsyncGetCallTrace:

const int MAX_DEPTH = 512; // max number of frames to capture

static ASGCT_CallFrame global_frames[MAX_DEPTH * MAX_THREADS_PER_ITERATION];
static ASGCT_CallTrace global_traces[MAX_THREADS_PER_ITERATION];


static void initSampler() {
  for (int i = 0; i < MAX_THREADS_PER_ITERATION; i++) {
    global_traces[i].frames = global_frames + i * MAX_DEPTH;
    global_traces[i].num_frames = 0;
    global_traces[i].env_id = env;
  }
}

The sampleThreads the function sends the signals to each thread using pthread_signal. This is why store the pthread for every thread. But how does a signal handler know which of the traces to use? And how does the sampleThreads function know when all signal handlers are finished?

It uses two atomic integer variables for this purpose:

  • available_trace: the next unused available trace index in the global_traces array
  • stored_traces: the number of traces already stored

The signal handler is then quite simple:

static void signalHandler(int signo, siginfo_t* siginfo, void* ucontext) {
  asgct(&global_traces[available_trace++], MAX_DEPTH, ucontext);
  stored_traces++;
}

The sampleThreads function has to only wait till stored_traces is as expected:

static void sampleThreads() {
  // reset the global variables
  available_trace = 0;
  stored_traces = 0;
  // get the threads to sample
  auto threads = thread_map.get_shuffled_threads();
  for (pid_t thread : threads) {
    auto info = thread_map.get_info(thread);
    if (info) {
      // send a signal to the thread
      pthread_kill(info->pthread, SIGPROF);
    }
  }
  // wait till all handlers obtained there traces
  while (stored_traces < threads.size());
  // process the traces
  processTraces(threads.size());
}

We keep the processing of the traces from the last blog post and just store the total number of traces as well as the number of failed traces:

static void processTraces(size_t num_threads) {
  for (int i = 0; i < num_threads; i++) {
    auto& trace = global_traces[i];
    if (trace.num_frames <= 0) {
      failedTraces++;
    }
    totalTraces++;
  }
}

The two global variables don’t have to be atomic anymore, as only the sampler thread modifies them. A regular profiler would of course use the traces to obtain information on the run methods, but this is a topic for the next blog post in this series, so stay tuned. You can find the source code for the project on GitHub.

This blog series is part of my work in the SapMachine team at SAP, making profiling easier for everyone.

Writing a Profiler from Scratch: Introduction

Async-profiler is undoubtedly one of the most used open-source Java profilers out there. Its most commonly used feature is sampling a given Java program’s stack traces and visualizing them as a flame graph. I would recommend reading the excellent async-profiler manual by Krzysztof Ślusarski or taking a look at my profiling playlist on YouTube if you’re new to this tool. After using the async-profiler for a while, you might wonder: How does this tool work? And, of course, if you’re someone like me: Could I write a stripped-down version to learn how it really works? I did this a few years back in university when I wanted to understand how LR(1) parser generators work: Turns out it is a lot of work, but my parser generator can output GIFs.

This blog series will take you with me on this journey, where we start with a small tool and see where curiosity leads us. It’s helpful if you know some C/C++ and have profiled Java using async-profiler before, which helps you better understand my code snippets and the profiling Jargon.

All the code will present on GitHub, and I’m happy for any suggestions, bug reports, and comments either under this blog post or on GitHub.

The aim of this series is

  • to understand how a sampling profiler works
  • to write a simple profiler, and possibly related tools, based on the AsyncGetCallTrace API
  • to have some good tooling which I use in OpenJDK tests, to make the used APIs safer and more reliable

The aim is not to write anything vaguely production ready: async-profiler has too many features to count and is battle-tested.

Come with me on this journey of knowledge, starting with a simple tool that just counts how many times AsyncGetCallTrace could not trace the stack.

The fundamental API that we will be relying on is the AsyncGetCallTrace API of the OpenJDK: This API can be used to obtain the stack trace (which methods are on the stack) for a given thread at any time. This makes it quite useful for writing accurate profilers. The API was introduced in November 2002 for Sun Studio. However, Sun removed it in January 2003 and demoted it to an internal API, but most profiling applications still use it. The following part of this post is inspired by Nitsan Wakart’s post, The Pros and Cons of AsyncGetCallTrace Profilers.

The AsyncGetCallTrace API is defined in the forte.cpp file:

typedef struct {
  jint lineno;         // BCI in the source file, or < 0 for native methods
  jmethodID method_id; // method executed in this frame
} ASGCT_CallFrame;

typedef struct {
  JNIEnv *env_id;   // Env where trace was recorded
  jint num_frames;  // number of frames in this trace, < 0 gives us an error code
  ASGCT_CallFrame *frames; // recorded frames 
} ASGCT_CallTrace; 

void AsyncGetCallTrace(ASGCT_CallTrace *trace, // pre-allocated trace to fill
                       jint depth,             // max number of frames to walk
                       void* ucontext);        // signal context

One typically uses this API by pinging a thread using a signal, stopping the thread, and invoking the signal handler, which in turn calls AsyncGetCallTrace with the execution context of the stopped thread (the ucontext) so that AsyncGetCallTrace can walk the thread, skipping all C/C++ frames on the stack and only storing the native (from native methods) and Java frames in the frames array. The signal handler has to process the trace, but this is for another post. We will just store the number of times that AsyncGetCallTrace was successful and unsuccessful.

Be aware that we cannot allocate any memory outside the stack in a signal handler. For this reason, we have to preallocate the data structures for AsyncGetCallTrace. There are a few C library functions that are guaranteed to be signal safe. To learn more about signals, consider reading Introduction To Unix Signals Programming, or similar sources. But don’t be confused with the terms thread and process. In Unix (Mac OS, Linux, BSD), threads are processes that share the same address space. Every (non-virtual/Loom) thread in Java is backed by an OS thread.

As an example of calling AsyncGetCallTrace, consider profiling the following Java code:

public class BasicSample {

    public void waitForever() throws InterruptedException {
        System.out.print("Waiting forever...");
        for (int i = 0; i < 100; i++) {
            Thread.sleep(10);
            System.out.print(".");
        }
        System.out.println("done");
    }

    public static void main(String[] args) throws InterruptedException {
        new BasicSample().waitForever();
    }
}

During profiling, we call AsyncGetCallTrace often, but let’s visualize a trace when the JVM runs one of the println lines.

AsyncGetCallTrace on a small example, using the demo code for JEP 435

Interrupting a thread at any point and running some code can have stability implications. If you’re interested in these when using AsyncGetCallTrace, head over to the async-profiler manual, where I co-wrote the section on this topic. This small sample tool seems to be quite good at triggering a specific fault in JDK 19+36, run the tool of this blog post yourself to find it.

The signal handler in our small example is the following:

std::atomic<size_t> failedTraces = 0;
std::atomic<size_t> totalTraces = 0;

static void signalHandler(int signo, siginfo_t* siginfo, void* ucontext) {
  const int MAX_DEPTH = 512; // max number of frames to capture
  static ASGCT_CallFrame frames[MAX_DEPTH];
  ASGCT_CallTrace trace;
  trace.frames = frames;
  trace.num_frames = 0;
  trace.env_id = env; // we obtained this via the OnVMInit hook
  
  // call AsyncGetCallTrace
  asgct(&trace, MAX_DEPTH, ucontext);

  // process the results
  totalTraces++;
  if (trace.num_frames < 0) {
    failedTraces++;
  }
}

We use atomic variables here to increment the two counting variables in parallel. We cannot use any locks as creating locks is not signal-safe.

You see in line 13 that we cannot call AsyncGetCallTrace directly, as it is not exported in any JVM header. So we have to obtain the pointer to this function via dlsym at the beginning, which is a bit brittle:

static void initASGCT() {
  asgct = reinterpret_cast<ASGCTType>(dlsym(RTLD_DEFAULT, "AsyncGetCallTrace"));
  if (asgct == NULL) {
    fprintf(stderr, "=== ASGCT not found ===\n");
    exit(1);
  }
}

Additionally, we have to copy the declarations for ASGCT_CallFrame, ASGCT_CallTrace, and AsyncGetCallTrace into our project.

After writing a signal handler, we must use some mechanism to create signals. There are multiple ways, like perf or using a thread that signals all threads every few milliseconds, but we’ll use the most straightforward option which is a timer:

static bool startITimerSampler() {
  time_t sec = interval_ns / 1000000000;
  suseconds_t usec = (interval_ns % 1000000000) / 1000;
  struct itimerval tv = {{sec, usec}, {sec, usec}};

  // ...

  if (setitimer(ITIMER_PROF, &tv, NULL) != 0) {
    return false;
  }
  return true;
}

Our code uses the timers in PROF mode: “A profiling timer that counts both processor time used by the process, and processor time spent in system calls on behalf of the process. This timer sends a SIGPROF signal to the process when it expires.” (see gnu.org) The result is roughly similar to the CPU event of the async-profiler. It is inaccurate, but we’ll tackle this problem in another blog post.

You can find the final code in the GitHub repo as libSmallProfiler.cpp. It includes all the boiler-plate code for JVMTI agents that I omitted in this blog post for brevity. Feel free to file issues or PRs with improvements or suggestions there. When we finally run the tool with a JVM and the example Java program, we get the following output via java -agentpath:cpp/libSmallProfiler.dylib=interval=0.001s -cp samples BasicSample:

Waiting forever.......................................................................................................done
Failed traces:          5
Total traces:          15
Failed ratio:       33.33%

This tool might seem to be rather useless, but one can adjust its sampling interval by specifying the interval option: This makes it quite helpful in testing the stack walking code of AsyncGetCallTrace.

I hope I didn’t frighten you too much with all this Unix C, and hopefully see you again in around two weeks for my next blog post in which we create a tool that outputs a method list.

The code of this blog post is based on the libAsyncGetCallTraceTest.cpp and the libAsyncGetStackTraceSampler.cpp of the OpenJDK. This blog series is part of my work in the SapMachine team at SAP, making profiling easier for everyone.