Writing a Profiler in 240 Lines of Pure Java

Posted on March 27, 2023 by Johannes Bechberger

A few months back, I started writing a profiler from scratch, and the code since became the base of my profiler validation tools. The only problem with this project: I wanted to write a proper non-safepoint-biased profiler from scratch. This is a noble effort, but it requires lots C/C++/Unix programming which is finicky, and not everyone can read C/C++ code.

For people unfamiliar with safepoint bias: A safepoint is a point in time where the JVM has a known defined state, and all threads have stopped. The JVM itself needs safepoints to do major garbage collections, Class definitions, method deoptimizations, and more. Threads are regularly checking whether they should get into a safepoint, for example, at method entry, exit, or loop backjumps. A profiler that only profiles at a safepoint have an inherent bias because it only includes frames from the locations inside methods where Threads check for a safepoint. The only advantage is that the stack-walking at safepoints is slightly less error-prone, as there are fewer mutations of heap and stack. For more information, consider reading the excellent article Java Safepoint and Async Profiling by Seetha Wenner, the more technical one by JP Bempel, or the classic article Safepoints: Meaning, Side Effects and Overheads by Nitsan Wakart. To conclude: Safepoint-biased profilers don’t give you a holistic view of your application, but can still be helpful to analyze major performance issues where you look at the bigger picture.

People on the hackernews thread on this blog post pointed out that the code has potentially some concurrency and publication issues. I’ll fixed the code in the GitHub repository, but kept the old code here. The modifications are minor.

This blog post aims to develop a tiny Java profiler in pure Java code that everyone can understand. Profilers are not rocket science, and ignoring safepoint-bias, we can write a usable profiler that outputs a flame graph in just 240 lines of code.

You can find the whole project on GitHub. Feel free to use it as a base for your adventures (and if you do, feel free to write me on Twitter, where I regularly post on profiling-related topics).

We implement the profiler in a daemon thread started by a Java agent. This allows us to start and run the profiler alongside the Java program we want to profile. The main parts of the profiler are:

Main: Entry point of the Java agent and starter of the profiling thread
Options: Parses and stores the agent options
Profiler: Contains the profiling loop
Store: Stores and outputs the collected results

Main Class

We start by implementing the agent entry points:

public class Main {
    public static void agentmain(String agentArgs) {
        premain(agentArgs);
    }

    public static void premain(String agentArgs) {
        Main main = new Main();
        main.run(new Options(agentArgs));
    }

    private void run(Options options) {
        Thread t = new Thread(new Profiler(options));
        t.setDaemon(true);
        t.setName("Profiler");
        t.start();
    }
}

The premain is called when the agent is attached to the JVM at the start. This is typical because the user passed the -javagent to the JVM. In our example, this means that the user runs Java with

java -javaagent:./target/tiny_profiler.jar=agentArgs ...

But there is also the possibility that the user attaches the agent at runtime. In this case, the JVM calls the method agentmain. To learn more about Java agent, visit the JDK documentation.

Please be aware that we have to set the Premain-Class and the Agent-Class attributes in the MANIFEST file of our resulting JAR file.

Our Java agent parses the agent arguments to get the options. The options are modeled and parsed by the Options class:

public class Options {
    /** interval option */
    private Duration interval = Duration.ofMillis(10);

    /** flamegraph option */
    private Optional<Path> flamePath;

    /** table option */
    private boolean printMethodTable = true;
    ...
}

The exciting part of the Main class is its run method: The Profiler class implements the Runnable interface so that we can create a thread directly:

Thread t = new Thread(new Profiler(options));

We then mark the profiler thread as a daemon thread; this means that the JVM does terminate at the end of the profiled application even when the profiler thread is running:

t.setDaemon(true);

No, we’re almost finished; we only have to start the thread. Before we do this, we name the thread, this is not required, but it makes debugging easier.

t.setName("Profiler");
t.start();

Profiler Class

The actual sampling takes place in the Profiler class:

public class Profiler implements Runnable {
    private final Options options;
    private final Store store;

    public Profiler(Options options) {
        this.options = options;
        this.store = new Store(options.getFlamePath());
        Runtime.getRuntime().addShutdownHook(new Thread(this::onEnd));
    }

    private static void sleep(Duration duration) {
        // ...
    }

    @Override
    public void run() {
        while (true) {
            Duration start = Duration.ofNanos(System.nanoTime());
            sample();
            Duration duration = Duration.ofNanos(System.nanoTime())
                                        .minus(start);
            Duration sleep = options.getInterval().minus(duration);
            sleep(sleep);
        }
    }

    private void sample() {
        Thread.getAllStackTraces().forEach(
          (thread, stackTraceElements) -> {
            if (!thread.isDaemon()) { 
                // exclude daemon threads
                store.addSample(stackTraceElements);
            }
        });
    }

    private void onEnd() {
        if (options.printMethodTable()) {
            store.printMethodTable();
        }
        store.storeFlameGraphIfNeeded();
    }

We start by looking at the constructor. The interesting part is

Runtime.getRuntime().addShutdownHook(new Thread(this::onEnd));

which causes the JVM to call the Profiler::onEnd when it shuts down. This is important as the profiler thread is silently aborted, and we still want to print the captured results. You can read more on shutdown hooks in the Java documentation.

After this, we take a look at the profiling loop in the run method:

while (true) {
    Duration start = Duration.ofNanos(System.nanoTime());
    sample();
    Duration duration = Duration.ofNanos(System.nanoTime())
                                .minus(start);
    Duration sleep = options.getInterval().minus(duration);
    sleep(sleep);
}

This calls the sample method and sleeps the required time afterward, to ensure that the sample method is called every interval (typically 10 ms).

The core sampling takes place in this sample method:

Thread.getAllStackTraces().forEach(
  (thread, stackTraceElements) -> {
    if (!thread.isDaemon()) { 
        // exclude daemon threads
        store.addSample(stackTraceElements);
    }
});

We use here the Thread::getAllStackTraces method to obtain the stack traces of all threads. This triggers a safepoint and is why this profiler is safepoint-biased. Taking the stack traces of a subset of threads would not make sense, as there is no method in the JDK for this. Calling Thread::getStackTrace on a subset of threads would trigger many safepoints, not just one, resulting in a more significant performance penalty than obtaining the traces for all threads.

The result of Thread::getAllStackTraces is filtered so that we don’t include daemon threads (like the Profiler thread or unused Fork-Join-Pool threads). We pass the appropriate traces to the Store, which deals with the post-processing.

Store Class

This is the last class of this profiler and also the by far most significant, post-processing, storing, and outputting of the collected information:

package me.bechberger;

import java.io.BufferedOutputStream;
import java.io.OutputStream;
import java.io.PrintStream;
import java.nio.file.Path;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.Optional;
import java.util.stream.Stream;

/**
 * store of the traces
 */
public class Store {

    /** too large and browsers can't display it anymore */
    private final int MAX_FLAMEGRAPH_DEPTH = 100;

    private static class Node {
        // ...
    }

    private final Optional<Path> flamePath;
    private final Map<String, Long> methodOnTopSampleCount = 
        new HashMap<>();
    private final Map<String, Long> methodSampleCount = 
        new HashMap<>();

    private long totalSampleCount = 0;

    /**
     * trace tree node, only populated if flamePath is present
     */
    private final Node rootNode = new Node("root");

    public Store(Optional<Path> flamePath) {
        this.flamePath = flamePath;
    }

    private String flattenStackTraceElement(
      StackTraceElement stackTraceElement) {
        // call intern to safe some memory
        return (stackTraceElement.getClassName() + "." + 
            stackTraceElement.getMethodName()).intern();
    }

    private void updateMethodTables(String method, boolean onTop) {
        methodSampleCount.put(method, 
            methodSampleCount.getOrDefault(method, 0L) + 1);
        if (onTop) {
            methodOnTopSampleCount.put(method, 
                methodOnTopSampleCount.getOrDefault(method, 0L) + 1);
        }
    }

    private void updateMethodTables(List<String> trace) {
        for (int i = 0; i < trace.size(); i++) {
            String method = trace.get(i);
            updateMethodTables(method, i == 0);
        }
    }

    public void addSample(StackTraceElement[] stackTraceElements) {
        List<String> trace = 
            Stream.of(stackTraceElements)
                   .map(this::flattenStackTraceElement)
                   .toList();
        updateMethodTables(trace);
        if (flamePath.isPresent()) {
            rootNode.addTrace(trace);
        }
        totalSampleCount++;
    }

    // the only reason this requires Java 17 :P
    private record MethodTableEntry(
        String method, 
        long sampleCount, 
        long onTopSampleCount) {
    }

    private void printMethodTable(PrintStream s, 
      List<MethodTableEntry> sortedEntries) {
        // ...
    }

    public void printMethodTable() {
        // sort methods by sample count
        // the print a table
        // ...
    }

    public void storeFlameGraphIfNeeded() {
        // ...
    }
}

The Profiler calls the addSample method which flattens the stack trace elements and stores them in the trace tree (for the flame graph) and counts the traces that any method is part of.

The interesting part is the trace tree modeled by the Node class. The idea is that every trace A -> B -> C (A calls B, B calls C, [C, B, A]) when returned by the JVM) can be represented as a root node with a child node A with child B with child C, so that every captured trace is a path from the root node to a leaf. We count how many times a node is part of the trace. This can then be used to output the tree data structure for d3-flame-graph which we use to create nice flamegraphs like:

Flame graph produced by the profiler for the renaissance dotty benchmark

Keep in my mind that the actual Node class is as follows:

private static class Node {                                                                                                                                                                                              
    private final String method;                                                                                                                                                                                         
    private final Map<String, Node> children = new HashMap<>();                                                                                                                                                          
    private long samples = 0;                                                                                                                                                                                            
                                                                                                                                                                                                                         
    public Node(String method) {                                                                                                                                                                                         
        this.method = method;                                                                                                                                                                                            
    }                                                                                                                                                                                                                    
                                                                                                                                                                                                                         
    private Node getChild(String method) {                                                                                                                                                                               
        return children.computeIfAbsent(method, Node::new);                                                                                                                                                              
    }                                                                                                                                                                                                                    
                                                                                                                                                                                                                         
    private void addTrace(List<String> trace, int end) {                                                                                                                                                                 
        samples++;                                                                                                                                                                                                       
        if (end > 0) {                                                                                                                                                                                      
            getChild(trace.get(end)).addTrace(trace, end - 1);                                                                                                                                                           
        }                                                                                                                                                                                                                
    }                                                                                                                                                                                                                    
                                                                                                                                                                                                                         
    public void addTrace(List<String> trace) {                                                                                                                                                                           
        addTrace(trace, trace.size() - 1);                                                                                                                                                                               
    }                                                                                                                                                                                                                    
                                                                                                                                                                                                                         
    /**                                                                                                                                                                                                                  
     * Write in d3-flamegraph format                                                                                                                                                                                     
     */                                                                                                                                                                                                                  
    private void writeAsJson(PrintStream s, int maxDepth) {                                                                                                                                                              
        s.printf("{ \"name\": \"%s\", \"value\": %d, \"children\": [", 
                 method, samples);                                                                                                                                 
        if (maxDepth > 1) {                                                                                                                                                                                              
            for (Node child : children.values()) {                                                                                                                                                                       
                child.writeAsJson(s, maxDepth - 1);                                                                                                                                                                      
                s.print(",");                                                                                                                                                                                            
            }                                                                                                                                                                                                            
        }                                                                                                                                                                                                                
        s.print("]}");                                                                                                                                                                                                   
    }                                                                                                                                                                                                                    
                                                                                                                                                                                                                         
    public void writeAsHTML(PrintStream s, int maxDepth) {                                                                                                                                                               
        s.print("""                                                                                                                                                                                                      
                <head>                                                                                                                                                                                                   
                  <link rel="stylesheet" 
                   type="text/css" 
                   href="https://cdn.jsdelivr.net/npm/d3-flame-graph@4.1.3/dist/d3-flamegraph.css">                                                                                
                </head>                                                                                                                                                                                                  
                <body>                                                                                                                                                                                                   
                  <div id="chart"></div>                                                                                                                                                                                 
                  <script type="text/javascript" 
                   src="https://d3js.org/d3.v7.js"></script>                                                                                                                               
                  <script type="text/javascript" 
                   src="https://cdn.jsdelivr.net/npm/d3-flame-graph@4.1.3/dist/d3-flamegraph.min.js"></script>                                                                             
                  <script type="text/javascript">                                                                                                                                                                        
                  var chart = flamegraph().width(window.innerWidth);                                                                                                                                                     
                  d3.select("#chart").datum(""");                                                                                                                                                                        
        writeAsJson(s, maxDepth);                                                                                                                                                                                        
        s.print("""                                                                                                                                                                                                      
                ).call(chart);                                                                                                                                                                                           
                  window.onresize = 
                      () => chart.width(window.innerWidth);                                                                                                                                                
                  </script>                                                                                                                                                                                              
                </body>                                                                                                                                                                                                  
                """);                                                                                                                                                                                                    
    }                                                                                                                                                                                                                    
}

Tiny-Profiler

I named the final profiler tiny-profiler and its sources are on GitHub (MIT licensed). The profiler should work on any platform with a JDK 17 or newer. The usage is fairly simple:

# build it
mvn package

# run your program and print the table of methods sorted by their sample count
# and the flame graph, taking a sample every 10ms
java -javaagent:target/tiny-profiler.jar=flamegraph=flame.html ...

You can easily run it on the renaissance benchmark and create the flame graph shown earlier:

# download a benchmark
> test -e renaissance.jar || wget https://github.com/renaissance-benchmarks/renaissance/releases/download/v0.14.2/renaissance-gpl-0.14.2.jar -O renaissance.jar

> java -javaagent:./target/tiny_profiler.jar=flamegraph=flame.html -jar renaissance.jar dotty
...
===== method table ======
Total samples: 11217
Method                                      Samples Percentage  On top Percentage
dotty.tools.dotc.typer.Typer.typed            59499     530.44       2       0.02
dotty.tools.dotc.typer.Typer.typedUnadapted   31050     276.81       7       0.06
scala.runtime.function.JProcedure1.apply      24283     216.48      13       0.12
dotty.tools.dotc.Driver.process               19012     169.49       0       0.00
dotty.tools.dotc.typer.Typer.typedUnnamed$1   18774     167.37       7       0.06
dotty.tools.dotc.typer.Typer.typedExpr        18072     161.11       0       0.00
scala.collection.immutable.List.foreach       16271     145.06       3       0.03
...

The overhead for this example is around 2% on my MacBook Pro 13″ for a 10ms interval, which makes the profiler usable when you ignore the safepoint-bias.

Conclusion

Writing a Java profiler in 240 lines of pure Java is possible and the resulting profiler could even be used to analyze performance problems. This profiler is not designed to replace real profilers like async-profiler, but it demystifies the inner workings of simple profilers.

I hope you enjoyed this code-heavy blog post. As always I’m happy for any feedback, issue, or PR. Come back next week for my next blog post on profiling.

This blog post is part of my work in the SapMachine team at SAP, making profiling easier for everyone. Significant parts of this post have been written below the English channel…

Writing a Profiler from Scratch: The Profiling Loop

Posted on January 6, 2023 by Johannes Bechberger

This is the second post in the series, building a profiler from scratch using AsyncGetCallTrace: Today, we’re covering wall-clock profiling and how to collect the obtain stack traces. If you’re unfamiliar with AsyncGetCallTrace, please check out my previous article in the series here.

Our goal today is to essentially write the primary loop of a profiler and do the following every n milliseconds (where n is our chosen profiling interval):

The question that comes up is: Which threads should we sample? If we only sample currently running threads, then we have cpu-time profiling, but if we sample all threads (running and idle), we have the far more popular wall-clock profiling. The difference between both types is best described by Richard Warburton and Sadiq Jaffer in their article on opsian.com:

CPU Time. There’s very little point in trying to measure what’s happening off-cpu if you’ve got a single very hot thread, as many event-loop style applications have. You may see plenty of time being spent in other threads that are waiting around performing background work but aren’t actually a bottleneck in terms of your application. So CPU Time is useful when you know you’re bottlenecked on a CPU problem.

Wallclock Time is useful for problems where CPU utilisation seems low but application throughput or latency doesn’t meet expectations. This is because threads are probably blocked on something, for example lock contention, times when you’re waiting for downstream blocking network IO to happen or running with too many threads so some of them don’t get scheduled. The trigger for looking at Wallclock time would be low CPU usage during your period of performance issues.
Solving JVM Performance Problems with Profilers: Wallclock vs CPU Time Edition

The choice of the sampling mode, therefore, depends on the actual use case.

How do we obtain the list of threads in the first place? We could use the aptly named GetAllThreads method, but we cannot use the returned thread objects to quickly get the thread ids we need for sending the threads signals. There is another possibility: Registering for ThreadStart and ThreadEnd events and collecting the threads as they are created:

void JNICALL
ThreadStart(jvmtiEnv *jvmti_env,
            JNIEnv* jni_env,
            jthread thread)
Thread start events are generated by a new thread before its initial method executes. A thread may be listed in the array returned by GetAllThreads before its thread start event is generated. It is possible for other events to be generated on a thread before its thread start event. The event is sent on the newly started thread.
JVMTI documentation

We create a class called ThreadMap for this purpose (stored in a global variable): It maps thread ids to the internal ids (because thread ids might be reused during the lifetime of a process), and the jthread (for JVMTI operations) and the pthread (for sending signals) and maps these internal ids in turn to names, so we can later display them. It has to be thread-safe, as we’re accessing it from multiple threads in parallel. So we have to create a lock for its internal data structures. This is quite simple with modern C++:

struct ValidThreadInfo {
    jthread jthread;
    pthread_t pthread;
    bool is_running;
    long id;
};

class ThreadMap {
  std::recursive_mutex m;
  std::unordered_map<pid_t, ValidThreadInfo> map;
  std::vector<std::string> names;

  public:

    void add(pid_t pid, std::string name, jthread thread) {
      const std::lock_guard<std::recursive_mutex> lock(m);
      map[pid] = ValidThreadInfo{.jthread = thread, .pthread = pthread_self(), .id = (long)names.size()};
      names.emplace_back(name);
    }

    void remove(pid_t pid) {
      const std::lock_guard<std::recursive_mutex> lock(m);
      map.erase(pid);
    }
};

Obtaining the thread id for the current thread leads to our first platform-dependent code, using a syscall on Linux and a non-posix pthread method on mac:

pid_t get_thread_id() {
  #if defined(__APPLE__) && defined(__MACH__)
  uint64_t tid;
  pthread_threadid_np(NULL, &tid);
  return (pid_t) tid;
  #else
  return syscall(SYS_gettid);
  #endif
}

Yes, we could quickly implement our profiler for other BSDs like FreeBSD, but MacOS is the only one I have at hand.

Keep in mind to also add the main thread to the collection by calling the OnThreadStart method in the OnVMInit method, which handles the start of the VM.

Now we have an up-to-date list of all threads. For wall-clock profiling, this is fine, but for cpu-time profiling, we have to obtain a list of running threads. We conveniently stored the jthread objects in our thread map, so we can now use the GetThreadState method:

bool is_thread_running(jthread thread) {
  jint state;
  auto err = jvmti->GetThreadState(thread, &state);
  // handle errors as not runnable
  return err == 0 && state != JVMTI_THREAD_STATE_RUNNABLE;
}

This leads us to a list of threads that we can sample. It is usually not a good idea to sample all available threads: With wall-clock profiling, the list of threads can be so large that sampling all threads is too costly. Therefore one typically takes a random subset of the list of available threads. Async-profiler, for example, takes 8, which we use too.

Taking a random subset is quite cumbersome in C, but C++ has a neat library function since C++11: std:shuffle. We can use it to implement the random selection:

std::vector<pid_t> get_shuffled_threads() {
  const std::lock_guard<std::recursive_mutex> lock(m);
  std::vector<pid_t> threads = get_all_threads();
  std::random_device rd;
  std::mt19937 g(rd());
  std::shuffle(threads.begin(), threads.end(), g);
  return std::vector(threads.begin(), threads.begin() + std::min(MAX_THREADS_PER_ITERATION, (int)threads.size()));
}

Be aware that we had to change the mutex to a mutex which allows recursive reentrance, as get_all_threads also acquires the mutex.

The next step is creating a sampling thread that runs the loop I described at the beginning of this article. This superseedes the itimer sampler from the previous post, as the mechanism does not support sending signals to a random subset only. The sampling thread is started in the OnVMInit handler and joined the OnAgentUnload handler. Its implementation is rather simple:

std::atomic<bool> shouldStop = false;
std::thread samplerThread;

static void sampleLoop() {
  initSampler();
  std::chrono::nanoseconds interval{interval_ns};
  while (!shouldStop) {
    auto start = std::chrono::system_clock::now();
    sampleThreads();
    auto duration = std::chrono::system_clock::now() - start;
    // takes into account that sampleThreads() takes some time
    auto sleep = interval - duration;
    if (std::chrono::seconds::zero() < sleep) {
      std::this_thread::sleep_for(sleep);
    }
  }
  endSampler(); // print the collected information
}

static void startSamplerThread() {
  samplerThread = std::thread(sampleLoop);
}

The initSampler function preallocates the traces for N=8 threads, as we decided only to sample as many at every loop. The signal handlers can later use these preallocated traces to call AsyncGetCallTrace:

const int MAX_DEPTH = 512; // max number of frames to capture

static ASGCT_CallFrame global_frames[MAX_DEPTH * MAX_THREADS_PER_ITERATION];
static ASGCT_CallTrace global_traces[MAX_THREADS_PER_ITERATION];


static void initSampler() {
  for (int i = 0; i < MAX_THREADS_PER_ITERATION; i++) {
    global_traces[i].frames = global_frames + i * MAX_DEPTH;
    global_traces[i].num_frames = 0;
    global_traces[i].env_id = env;
  }
}

The sampleThreads the function sends the signals to each thread using pthread_signal. This is why store the pthread for every thread. But how does a signal handler know which of the traces to use? And how does the sampleThreads function know when all signal handlers are finished?

It uses two atomic integer variables for this purpose:

available_trace: the next unused available trace index in the global_traces array
stored_traces: the number of traces already stored

The signal handler is then quite simple:

static void signalHandler(int signo, siginfo_t* siginfo, void* ucontext) {
  asgct(&global_traces[available_trace++], MAX_DEPTH, ucontext);
  stored_traces++;
}

The sampleThreads function has to only wait till stored_traces is as expected:

static void sampleThreads() {
  // reset the global variables
  available_trace = 0;
  stored_traces = 0;
  // get the threads to sample
  auto threads = thread_map.get_shuffled_threads();
  for (pid_t thread : threads) {
    auto info = thread_map.get_info(thread);
    if (info) {
      // send a signal to the thread
      pthread_kill(info->pthread, SIGPROF);
    }
  }
  // wait till all handlers obtained there traces
  while (stored_traces < threads.size());
  // process the traces
  processTraces(threads.size());
}

We keep the processing of the traces from the last blog post and just store the total number of traces as well as the number of failed traces:

static void processTraces(size_t num_threads) {
  for (int i = 0; i < num_threads; i++) {
    auto& trace = global_traces[i];
    if (trace.num_frames <= 0) {
      failedTraces++;
    }
    totalTraces++;
  }
}

The two global variables don’t have to be atomic anymore, as only the sampler thread modifies them. A regular profiler would of course use the traces to obtain information on the run methods, but this is a topic for the next blog post in this series, so stay tuned. You can find the source code for the project on GitHub.

This blog series is part of my work in the SapMachine team at SAP, making profiling easier for everyone.

Writing a Profiler from Scratch: Introduction

Posted on December 20, 2022 by Johannes Bechberger

Async-profiler is undoubtedly one of the most used open-source Java profilers out there. Its most commonly used feature is sampling a given Java program’s stack traces and visualizing them as a flame graph. I would recommend reading the excellent async-profiler manual by Krzysztof Ślusarski or taking a look at my profiling playlist on YouTube if you’re new to this tool. After using the async-profiler for a while, you might wonder: How does this tool work? And, of course, if you’re someone like me: Could I write a stripped-down version to learn how it really works? I did this a few years back in university when I wanted to understand how LR(1) parser generators work: Turns out it is a lot of work, but my parser generator can output GIFs.

This blog series will take you with me on this journey, where we start with a small tool and see where curiosity leads us. It’s helpful if you know some C/C++ and have profiled Java using async-profiler before, which helps you better understand my code snippets and the profiling Jargon.

All the code will present on GitHub, and I’m happy for any suggestions, bug reports, and comments either under this blog post or on GitHub.

The aim of this series is

to understand how a sampling profiler works
to write a simple profiler, and possibly related tools, based on the AsyncGetCallTrace API
- and later on, my proposed AsyncGetStackTrace API
to have some good tooling which I use in OpenJDK tests, to make the used APIs safer and more reliable

The aim is not to write anything vaguely production ready: async-profiler has too many features to count and is battle-tested.

Come with me on this journey of knowledge, starting with a simple tool that just counts how many times AsyncGetCallTrace could not trace the stack.

The fundamental API that we will be relying on is the AsyncGetCallTrace API of the OpenJDK: This API can be used to obtain the stack trace (which methods are on the stack) for a given thread at any time. This makes it quite useful for writing accurate profilers. The API was introduced in November 2002 for Sun Studio. However, Sun removed it in January 2003 and demoted it to an internal API, but most profiling applications still use it. The following part of this post is inspired by Nitsan Wakart’s post, The Pros and Cons of AsyncGetCallTrace Profilers.

The AsyncGetCallTrace API is defined in the forte.cpp file:

typedef struct {
  jint lineno;         // BCI in the source file, or < 0 for native methods
  jmethodID method_id; // method executed in this frame
} ASGCT_CallFrame;

typedef struct {
  JNIEnv *env_id;   // Env where trace was recorded
  jint num_frames;  // number of frames in this trace, < 0 gives us an error code
  ASGCT_CallFrame *frames; // recorded frames 
} ASGCT_CallTrace; 

void AsyncGetCallTrace(ASGCT_CallTrace *trace, // pre-allocated trace to fill
                       jint depth,             // max number of frames to walk
                       void* ucontext);        // signal context

One typically uses this API by pinging a thread using a signal, stopping the thread, and invoking the signal handler, which in turn calls AsyncGetCallTrace with the execution context of the stopped thread (the ucontext) so that AsyncGetCallTrace can walk the thread, skipping all C/C++ frames on the stack and only storing the native (from native methods) and Java frames in the frames array. The signal handler has to process the trace, but this is for another post. We will just store the number of times that AsyncGetCallTrace was successful and unsuccessful.

Be aware that we cannot allocate any memory outside the stack in a signal handler. For this reason, we have to preallocate the data structures for AsyncGetCallTrace. There are a few C library functions that are guaranteed to be signal safe. To learn more about signals, consider reading Introduction To Unix Signals Programming, or similar sources. But don’t be confused with the terms thread and process. In Unix (Mac OS, Linux, BSD), threads are processes that share the same address space. Every (non-virtual/Loom) thread in Java is backed by an OS thread.

As an example of calling AsyncGetCallTrace, consider profiling the following Java code:

public class BasicSample {

    public void waitForever() throws InterruptedException {
        System.out.print("Waiting forever...");
        for (int i = 0; i < 100; i++) {
            Thread.sleep(10);
            System.out.print(".");
        }
        System.out.println("done");
    }

    public static void main(String[] args) throws InterruptedException {
        new BasicSample().waitForever();
    }
}

During profiling, we call AsyncGetCallTrace often, but let’s visualize a trace when the JVM runs one of the println lines.

AsyncGetCallTrace on a small example, using the demo code for JEP 435

Interrupting a thread at any point and running some code can have stability implications. If you’re interested in these when using AsyncGetCallTrace, head over to the async-profiler manual, where I co-wrote the section on this topic. This small sample tool seems to be quite good at triggering a specific fault in JDK 19+36, run the tool of this blog post yourself to find it.

The signal handler in our small example is the following:

std::atomic<size_t> failedTraces = 0;
std::atomic<size_t> totalTraces = 0;

static void signalHandler(int signo, siginfo_t* siginfo, void* ucontext) {
  const int MAX_DEPTH = 512; // max number of frames to capture
  static ASGCT_CallFrame frames[MAX_DEPTH];
  ASGCT_CallTrace trace;
  trace.frames = frames;
  trace.num_frames = 0;
  trace.env_id = env; // we obtained this via the OnVMInit hook
  
  // call AsyncGetCallTrace
  asgct(&trace, MAX_DEPTH, ucontext);

  // process the results
  totalTraces++;
  if (trace.num_frames < 0) {
    failedTraces++;
  }
}

We use atomic variables here to increment the two counting variables in parallel. We cannot use any locks as creating locks is not signal-safe.

You see in line 13 that we cannot call AsyncGetCallTrace directly, as it is not exported in any JVM header. So we have to obtain the pointer to this function via dlsym at the beginning, which is a bit brittle:

static void initASGCT() {
  asgct = reinterpret_cast<ASGCTType>(dlsym(RTLD_DEFAULT, "AsyncGetCallTrace"));
  if (asgct == NULL) {
    fprintf(stderr, "=== ASGCT not found ===\n");
    exit(1);
  }
}

Additionally, we have to copy the declarations for ASGCT_CallFrame, ASGCT_CallTrace, and AsyncGetCallTrace into our project.

After writing a signal handler, we must use some mechanism to create signals. There are multiple ways, like perf or using a thread that signals all threads every few milliseconds, but we’ll use the most straightforward option which is a timer:

static bool startITimerSampler() {
  time_t sec = interval_ns / 1000000000;
  suseconds_t usec = (interval_ns % 1000000000) / 1000;
  struct itimerval tv = {{sec, usec}, {sec, usec}};

  // ...

  if (setitimer(ITIMER_PROF, &tv, NULL) != 0) {
    return false;
  }
  return true;
}

Our code uses the timers in PROF mode: “A profiling timer that counts both processor time used by the process, and processor time spent in system calls on behalf of the process. This timer sends a SIGPROF signal to the process when it expires.” (see gnu.org) The result is roughly similar to the CPU event of the async-profiler. It is inaccurate, but we’ll tackle this problem in another blog post.

You can find the final code in the GitHub repo as libSmallProfiler.cpp. It includes all the boiler-plate code for JVMTI agents that I omitted in this blog post for brevity. Feel free to file issues or PRs with improvements or suggestions there. When we finally run the tool with a JVM and the example Java program, we get the following output via java -agentpath:cpp/libSmallProfiler.dylib=interval=0.001s -cp samples BasicSample:

Waiting forever.......................................................................................................done
Failed traces:          5
Total traces:          15
Failed ratio:       33.33%

This tool might seem to be rather useless, but one can adjust its sampling interval by specifying the interval option: This makes it quite helpful in testing the stack walking code of AsyncGetCallTrace.

I hope I didn’t frighten you too much with all this Unix C, and hopefully see you again in around two weeks for my next blog post in which we create a tool that outputs a method list.

The code of this blog post is based on the libAsyncGetCallTraceTest.cpp and the libAsyncGetStackTraceSampler.cpp of the OpenJDK. This blog series is part of my work in the SapMachine team at SAP, making profiling easier for everyone.

Mostly nerdless

Every two weeks a text on profiling, debugging or eBPF

Tag Archives: writing a profiler from scratch