Couldn’t we just Use AsyncGetCallTrace in a Separate Thread?

Posted on April 21, 2023 by Johannes Bechberger

I’m keenly interested in everything related to profiling on the JVM, especially if it is related to AsyncGetCallTrace, this tiny unofficial API that powers most profilers out there, heck I’m even in the process of adding an improved version to the OpenJDK, AsyncGetStackTrace.

During the discussions on the related JDK enhancement proposal and PRs fixing AsyncGetCallTrace bugs, one thing often arises: Why is AsyncGetCallTrace always called in the signal handler on top of the stack that we want to walk (like in my Writing a Profiler from Scratch series)?

Interaction between the wall-clock sampler thread and the different signal handlers, as currently implemented in async-profiler.

JDK Flight Recorder (JFR) does not do this; it instead walks the stack in the sampler thread while pausing the sampled thread (implemented with a SuspendedThreadTask).

Interaction between the sampler thread and the signal handlers, as currently implemented in JFR.

Update after talks on the JEP: The recommended way to use AsyncGetStackTrace will be to call it in a separate thread.

Advantages

Walking the thread in a sampler thread has multiple advantages: Only a few instructions run in the signal handler: the handler is either just busy waiting for the stack walking to finish, or the thread is stopped entirely. Most of the code runs in the sampler thread, walking one thread after another. This makes the code easier to debug and reason about, and the stack-walking code is less likely to mess up the stack of the sampled thread when something goes terribly wrong. These are part of the reasons why the JFR code silently ignores segmentation faults during stack walking:

One important difference to consider is that in JFR, in contrast to AGCT, there is only a single thread, the ThreadSampler thread, that is wrapped in the CrashProtection. Stack walking is different in JFR compared to AGCT, in that it is done by a different thread, during a point where the target is suspended. Originally, this thread sampler thread was not even part of the VM, although now it is a NonJavaThread. It has been trimmed to not involve malloc(), raii, and other hard-to-recover-from constructs, from the moment it has another thread suspended. Over the years, some transitive malloc() calls has snuck in, but it was eventually found due to rare deadlocking. Thomas brings a good point about crashes needing to be recoverable.
MarKUS Grönlund In A Comment on OpenJDK PR 8225

I digress here from the main topic of this article, but I think that the next comment of Markus Grönlund on the PR is interesting because it shows how pressures from the outside can lead to band-aid fixes that are never removed:

For additional context, I should add that the CrashProtection mechanism was mainly put in place as a result of having to deliver JFR from JRockit into Hotspot under a deadline, upholding feature-parity. The stack walking code was in really bad shape back then. Over the years, it has been hardened and improved much, and I have not seen any reported issues about JFR crashes in many years (we log when crashing in production).

An important difference is that AGCT allows more thread states compared to JFR, so there can be issues in that area that are not seen in JFR.
MarKUS Grönlund In A Comment on OpenJDK PR 8225

Back to the main topic: It is important to note that even when we walk a thread in a separate thread, we still have to make sure that we only use signal-safe methods while the sampled thread is waiting (thanks to Lukas Werling for pointing this out). The sampled thread might, for example, hold locks for malloc, so our sampled thread cannot use malloc without risking a dead-lock.

Disadvantages

There are, of course, disadvantages: Sampling in a signal handler is more straightforward, as we’re running in the context of the sampled thread and get passed the ucontext (with stack pointer, …) directly. It is more accurate, as we can trigger the sampling of the threads precisely at the time that we want (disregarding thread scheduling), and faster, as we do not busy wait in any thread.

We’re running on the same CPU core, which benefits caching, especially on NUMA CPUs (thanks to Francesco Nigro for pointing this out). Although the performance is rarely an issue with the stack-walking as its runtime is in the tens of microseconds, even if we include the whole signal processing.

Another major disadvantage is related to CPU time and perf-event-related profiling: The commonly used itimer (it has major problems, according to Felix Geisendörfer) and perf APIs send signals to threads in certain intervals. When we walk the stack in a separate thread, the triggered signal handlers must trigger the sampler thread to sample the specific thread.

This can be implemented by pushing the current thread id in a queue, and the sampler thread stops the sampled thread when it’s ready and walks the stack as before or by waiting in the signal handler until the sampler thread has finished walking the stack. The former is less performant because it sends an additional signal, and the latter is only significant if the walk requests of all threads are evenly distributed.

This problem can be lessened when we choose a different way of accessing the perf data: We can read the perf events in a loop and then just use the technique from wall-clock profiling. This is a significant modification of the inner workings of the profiler, and it is not possible with itimer-based profiling.

What is the real reason?

Walking in a separate thread has more advantages than disadvantages, especially when wall-clock profiling or valuing stability over slight performance gains. So why don’t tools like async-profiler implement their sampling this way? It’s because AsyncGetCallTrace currently doesn’t support it. This is the starting point of my small experiment: Could I modify the OpenJDK with just a few changes to add support for out-of-thread walking with AsyncGetCallTrace (subsequently proposing this for AsyncGetStackTrace too)?

Modifying AsyncGetCallTrace

Let us first take a look at the API to refresh our knowledge:

void AsyncGetCallTrace(ASGCT_CallTrace *trace, jint depth, 
                       void* ucontext)
// Arguments:
//
//   trace    - trace data structure to be filled by the VM.
//   depth    - depth of the call stack trace.
//   ucontext - ucontext_t of the LWP
//
// ASGCT_CallTrace:
//   typedef struct {
//       JNIEnv *env_id;
//       jint num_frames;
//       ASGCT_CallFrame *frames;
//   } ASGCT_CallTrace;
//
// Fields:
//   env_id     - ID of thread which executed this trace.
//   num_frames - number of frames in the trace.
//                (< 0 indicates the frame is not walkable).
//   frames     - the ASGCT_CallFrames that make up this trace. 
//                Callee followed by callers.
//
//  ASGCT_CallFrame:
//    typedef struct {
//        jint lineno;
//        jmethodID method_id;
//    } ASGCT_CallFrame;

If you’re new to AsyncGetCallTrace (and my blog), consider reading my Writing a Profiler from Scratch: Introduction article.

So we already pass an identifier of the current thread (env_id) to the API, which should point to the walked thread :

// This is safe now as the thread has not terminated 
// and so no VM exit check occurs.
assert(thread == 
         JavaThread::thread_from_jni_environment(trace->env_id),
       "AsyncGetCallTrace must be called by " +
       "the current interrupted thread");

This is the only usage of the passed thread identifier, and why I considered removing it in AsyncGetStackTrace altogether. AsyncGetCallTrace uses the current thread instead:

Thread* raw_thread = Thread::current_or_null_safe();

The assertion above is only enabled in debug builds of the OpenJDK, which are rarely profiled. Therefore, the thread identifier is often ignored and is probably a historic relic. We can use this identifier to obtain the thread that the API user wants to profile and only use the current thread when the thread identifier is null (source):

Thread* raw_thread;
if (trace->env_id == nullptr) {
  raw_thread = Thread::current_or_null_safe();
} else {
  raw_thread = 
    JavaThread::thread_from_jni_environment_raw(trace->env_id);
}

We can thereby support the new feature without modifying the API itself, only changing the behavior if the thread identifier does not reference the current thread.

The implementation can be found in my OpenJDK fork. This is still a prototype, but it works well enough for testing and benchmarking.

Modifying async-profiler

At the beginning of the article, I already told you how JFR walks the stack in a different thread. We are implementing similar code into async-profiler, restricting us to wall-clock profiling, as its implementation requires fewer modifications.

Before our changes, async-profiler would signal selected threads in a loop via

OS::sendSignalToThread(thread_id, SIGVTALRM)

(source) and records the sample directly in the signal handler (source):

void WallClock::signalHandler(
  int signo, 
  siginfo_t* siginfo, 
  void* ucontext) {
    ExecutionEvent event;
    event._thread_state = _sample_idle_threads ? 
      getThreadState(ucontext) : THREAD_UNKNOWN;
    Profiler::instance()->recordSample(ucontext, _interval, 
                                       EXECUTION_SAMPLE, &event);
}

The Profiler::recordSample the method does more than just call AsyncGetCallTrace; it also obtains C/C++ frames. However, this is insignificant for our modifications, as the additional stack walking is only related to the ucontext, not the thread.

We now modify this code so that we still send a signal to the sampled thread but only set a global ucontext and thread identifier (struct Data) in the signal handler, blocking till we finished walking the stack in the sampler thread, walking the stack in the latter (source):

struct Data {
    void* ucontext;
    JNIEnv* jni;
};

std::atomic<int> _thread_id;
std::atomic<Data*> _thread_data;

bool WallClock::walkStack(int thread_id) {
    // set the current thread
    _thread_id = thread_id;
    _thread_data = nullptr;

    // send the signal to the sampled thread
    if (!OS::sendSignalToThread(thread_id, SIGVTALRM)) {
        _thread_id = -1;
        return false;
    }
    // wait till the signal handler has set the ucontext and jni
    if (!waitWhile([&](){ return _thread_data == nullptr;}, 
                   10 * 1000 * 1000)) {
        _thread_id = -1;
        return false;
    }
    Data *data = _thread_data.load();
    // walk the stack
    ExecutionEvent event;
    event._thread_state = _sample_idle_threads ?
      getThreadState(data->ucontext) : THREAD_UNKNOWN;
    u64 ret = Profiler::instance()->recordSample(data->ucontext,
      _interval, EXECUTION_SAMPLE, &event, data->jni);

    // reset the thread_data, triggering the signal handler
    _thread_data = nullptr;
    return ret != 0;
}

void WallClock::signalHandler(
  int signo,
  siginfo_t* siginfo,
  void* ucontext) {
    // check that we are in the thread we are supposed to be
    if (OS::threadId() != _thread_id) {
        return;
    }
    
    Data data{
       ucontext,
       // Get a JNIEnv if it is deamed to be safe
       VMThread::current() == nullptr ? nullptr : VM::jni()
    };

    Data* expected = nullptr;
    if (!_thread_data.compare_exchange_strong(expected, &data)) {
        // another signal handler invocation 
        // is already in progress
        return;
    }
    // wait for the stack to be walked, and block the thread 
    // from executing
    // we do not timeout here, as this leads to difficult bugs
    waitWhile([&](){ return _thread_data != nullptr;});
}

The signal handler only stores the ucontext and thread identifier if it is run in the thread currently walked and uses compare_exchange_strong to ensure that the _thread_data is only set once. This prevents stalled signal handlers from concurrently modifying the global variables.

_thread_data.compare_exchange_strong(expected, &data) is equivalent to atomically executing:

if (_thread_data == expected) {
    _thread_data = &data;
    return true;
} else {
    expected = _thread_data;
    return false;
}

This ensures that the _thread_data is only set if it is null. Such operations are the base of many lock-free data structures; you can find more on this topic in the Wikipedia article on Compare-and-Swap (a synonym for compare-and-exchange).

Coming back to the signal handler implementation: The waitWhile method is a helper method that busy waits until the passed predicate does return false or the optional timeout is exhausted, ensuring that the profiler does not hang if something goes wrong.

The implementation uses the _thread_data variable to implement its synchronization protocol:

Interaction between the sampler thread and the signal handler.

You can find the implementation in my async-profiler fork, but as with my OpenJDK fork: It’s only a rough implementation.

The implemented approach works fine with async-profiler, but it has a minor flaw: We depend on an implementation detail of the current iteration of OpenJDK. It is only safe to get the JNIEnv in a signal handler if the JVM has allocated a thread-local Thread object for the signaled thread:

JDK-8132510: it’s not safe to call GetEnv() inside a signal handler since JDK 9, so we do it only for threads already registered in ThreadLocalStorage
async-profiler source code

This issue was first discovered when Krzysztof Ślusarski (of “Async-Profiler – manual by use cases” fame) reported a related issue in the async-profiler bug tracker.

For a deeper dive, consider reading the comment of David Holmes to the references JDK issue:

The code underpinning __thread use is not async-signal-safe, which is not really a surprise as pthread_get/setspecific are not designated async-signal-safe either.

The problem, in glibc, is that first access of a TLS variable can trigger allocation [1]. This contrasts with using pthread_getspecific which is benign and so effectively async-signal-safe.

So if a thread is executing in malloc and it takes a signal, and the signal handler tries to use TLS (it shouldn’t but it does and has gotten away with it with pthread_getspecific), then we can crash or get a deadlock.
Excerpt FROm DAdvid HOlme’s COMMENT on issue JDK-8132510

We check this condition in our signal handler implementation with the line

VMThread::current() == nullptr ? nullptr : VM::jni()

with VMThread::current() being implemented as:

VMThread* VMThread::current() {
    return (VMThread*)pthread_getspecific(
      (pthread_key_t)_tls_index /* -1 */);
}

This implementation detail is not an issue for async-profiler as it might make assumptions. Still, it is undoubtedly a problem for the general approach I want to propose for my new AsyncGetStackTrace API.

Modifying AsyncGetCallTrace (2^nd approach)

We want to identify the thread using something different from JNIEnv. The OS thread id seems to be a good fit. It has three significant advantages:

It can be obtained independently from the JVM, depending on the OS rather than the JVM.
Our walkStack method already gets passed the thread id, so we don’t have to pass it from the signal handler to the sampler thread.
The mapping from thread id to Thread happens outside the signal handler in the AsyncGetCallTrace call, and the API sets the env_id field to the appropriate JNIEnv.

We have to add a new parameter os_thread_id to the API to facilitate this change (source):

// ...
//   os_thread_id - OS thread id of the thread which executed 
//                  this trace, or -1 if the current thread 
//                  should be used.
// ...
// Fields:
//   env_id     - ID of thread which executed this trace, 
//                the API sets this field if it is NULL.
// ...
void AsyncGetCallTrace(ASGCT_CallTrace *trace, jint depth, 
  void* ucontext, jlong os_thread_id)

The implementation can be found in my OpenJDK fork, but be aware that it is not yet optimized for performance as it iterates over the whole thread list for every call to find the Thread which matches the passed OS thread id.

Modifying async-profiler (2^nd approach)

The modification to async-profiler is quite similar to the first approach. The only difference is that we’re not dealing with JNIEnv anymore. This makes the signal handler implementation slightly simpler (source):

void WallClock::signalHandler(
  int signo, 
  siginfo_t* siginfo, 
  void* ucontext) {
    // check that we are in the thread we are supposed to be
    if (OS::threadId() != _thread_id) {
        return;
    }
    void* expected = nullptr;
    if (!_ucontext.compare_exchange_strong(expected, ucontext)) {
        // another signal handler invocation 
        // is already in progress
        return;
    }
    // wait for the stack to be walked, and block the thread 
    // from executing
    // we do not timeout here, as this leads to difficult bugs
    waitWhile([&](){ return _ucontext != nullptr;});
}

You can find the full implementation in my async-profiler fork.

Now to the fun part (the experiment): Two drawbacks of the two previously discussed approaches are that one thread waits busily, and the other cannot execute all non-signal-safe code during that period. So the obvious next question is:

Could we walk a thread without stopping it?

In other words: Could we omit the busy waiting? An unnamed person suggested this.

The short answer is: It’s a terrible idea. The sampled thread modifies the stack while we’re walking its stack. It might even terminate while we’re in the middle of its previously valid stack. So this is a terrible idea when you don’t take many precautions.

The only advantage is that we can use non-signal-safe methods during stack walking. The performance of the profiling will not be significantly improved, as the signal sending and handling overhead is a magnitude larger than the stack walking itself for small traces. Performance-wise, it could only make sense for huge (1000 and more frames) traces.

Our central assumption is: The profiler takes some time to transition out of the signal handler of the sampled thread. Possibly longer than it takes to walk the topmost frames, which are most likely to change during the execution, in AsyncGetCallTrace.

But: Timing with signals is hard to predict (see this answer on StackExchange), and if the assumption fails, the resulting trace is either bogus or the stack walking leads to “interesting” segmentation faults. I accidentally tested this when I initially implemented the signal handler in my async-profiler and made an error. I saw error messages in places that I had not seen before.

So the results could be imprecise / sometimes incorrect. But we’re already sampling, so approximations are good enough.

The JVM might crash during the stack walking because the ucontext might be invalid and the thread stack changes (so that the stack pointer in the ucontext points to an invalid value and more), but we should be able to reduce the crashes by using enough precautions in AsyncGetCallTrace and testing it properly (I already implemented tests with random ucontexts in the draft for AsyncGetStackTrace).

The other option is to catch any segmentation faults that occur inside AsyncGetCallTrace. We can do this because we walk the stack in a separate thread (and JFR does it as well, as I’ve written at the beginning of this post). We can implement this by leveraging the ThreadCrashProtection clas,s which has, quite rightfully, some disclaimers:

/*
 * Crash protection for the JfrSampler thread. Wrap the callback
 * with a sigsetjmp and in case of a SIGSEGV/SIGBUS we siglongjmp
 * back.
 * To be able to use this - don't take locks, don't rely on 
 * destructors, don't make OS library calls, don't allocate 
 * memory, don't print, don't call code that could leave
 * the heap / memory in an inconsistent state, or anything 
 * else where we are not in control if we suddenly jump out.
 */
class ThreadCrashProtection : public StackObj {
public:
  // ...
  bool call(CrashProtectionCallback& cb);
  // ...
};

We wrap the call to the actual AsyncGetCallTrace implementation of our second approach in this handler (source):

void AsyncGetCallTrace(ASGCT_CallTrace *trace, jint depth, 
 void* ucontext, jlong os_thread_id) {
  trace->num_frames = ticks_unknown_state;
  AsyncGetCallTraceCallBack cb(trace, depth, ucontext, 
                               os_thread_id);
  ThreadCrashProtection crash_protection;
  if (!crash_protection.call(cb)) {
    fprintf(stderr, "AsyncGetCallTrace: catched crash\n");
    if (trace->num_frames >= 0) {
      trace->num_frames = ticks_unknown_state;
    }
  }
}

This prevents all crashes related to walking the stack from crashing the JVM, which is also helpful for the AsyncGetCallTrace usage of the previous part of this article. The only difference is that crashes in the stack walking are considered a bug in a normal use case but are expected in this use case where we don’t stop the sampled thread.

Back to this peculiar case: The implementation in async-profiler is slightly more complex than just removing the busy waiting at the end. First, we must copy the ucontext in the signal handler because the ucontext pointer only points to a valid ucontext while the thread is stopped. Furthermore, we have to disable the native stack walking in the async-profiler, as it isn’t wrapped in code that catches crashes. We also have, for unknown reasons, to set the safemode option of async-profiler to 0.

The implementation of the signal handler is simple (just remove the wait from the previous version). It results in the following sequence diagram:

Interaction between the sampler thread and the signal handlers when not blocking the sampled thread during the stack walking.

You can find the implementation on GitHub, albeit with known concurrency problems, but these are out-of-scope for this blog post and related to copying the ucontext atomically.

And now to the important question: How often did AsyncGetCallTrace crash? In the renaissance finagle-http benchmark (with a sampling interval of 10ms), it crashed in 592 of around 808000 calls, a crash rate of 0.07% and far better than expected.

The main problem can be seen when we look at the flame graphs (set the environment variable SKIP_WAIT to enable the modification):

Which looks not too dissimilar to the flame graph with busy waiting:

Many traces (the left part of the graph) are broken and do not appear in the second flame graph. Many of these traces seem to be aborted:

But this was an interesting experiment, and the implementation seems to be possible, albeit creating a safe and accurate profiler would be hard and probably not worthwhile: Catching the segmentation faults seems to be quite expensive: The runtime for the renaissance finagle-http benchmark is 83 seconds for the version with busy waiting and 84 seconds without, despite producing worse results.

Evaluation

We can now compare the performance of the original with the two prototypical implementations and the experimental implementation in a preliminary evaluation. I like using the benchmarks of the renaissance suite (version 0.14.2). For this example, I used the primarily single core, dotty benchmark with an interval of 1ms and 10ms:

java -agentpath:./build/lib/libasyncProfiler.so=start,\
                interval=INTERVAL,event=wall,\ 
                flamegraph,file=flame.html \
     -XX:+UnlockDiagnosticVMOptions -XX:DebugNonSafepoints \
     -jar renaissance.jar BENCHMARK

The shorter interval will make the performance impact of changes to the profiling more impactful. I’m profiling with my Threadripper 3995WX on Ubuntu using hyperfine (one warm-up run and ten measured runs each). The standard deviation is less than 0.4% in the following diagram, which shows the wall-clock time:

The number of obtained samples is roughly the same overall profiler runs, except for the experimental implementation, which produces around 12% fewer samples. All approaches seem to have a comparable overhead when considering wall-clock time. It’s different considering the user-time:

This shows that there is a significant user-time performance penalty when not using the original approach. This is expected, as we’re engaging two threads into one during the sampling of a specific threadTherefore, the wall-clock timings might.

The wall-clock timings might therefore be affected by my CPU having enough cores so that the sampler and all other threads run fully concurrently.

I tried to evaluate all approaches with a benchmark that utilizes all CPU (finagle-http), but my two new approaches have apparently severe shortcomings, as they produced only around a quarter of the samples compared to the original async-profiler and OpenJDK combination. This is worth fixing, but out-of-scope for this blog post, which already took more than a week to write.

Conclusion

This was the serious part of the experiment: Using AsyncGetCallTrace in a separate thread is possible with minor modifications and offers many advantages (as discussed before). It especially provides a more secure approach to profiling while not affecting performance if you’re system is not yet saturated: A typical trade-off between security and performance. I think that it should be up to the experienced performance engineer two decide and profilers should offer both when my JEP eventually makes the out-of-thread walking available on stock OpenJDKs.

The implementations in both the OpenJDK and async-profiler also show how to quickly implement, test and evaluate different approaches with widely used benchmarks.

Conclusion

The initial question, “Couldn’t we just use AsyncGetCallTrace in a separate thread?” can be answered with a resounding “Yes!”. Sampling in separate threads has advantages, but we have to block the sampled thread during stack walking; omitting this leads to broken traces.

If you have anything to add or found a flaw in my approaches or my implementations, or any other suggestions, please let me know 🙂

I hope this article gave you a glimpse into my current work and the future of low-level Java profiling APis.

This blog post is part of my work in the SapMachine team at SAP, making profiling easier for everyone.

Instrumenting Java Code to Find and Handle Unused Classes

Posted on April 6, 2023 by Johannes Bechberger

This blog post is about writing a Java agent and instrumentation code to find unused classes and dependencies in your project. Knowing which classes and dependencies are not used in your application can save you from considering the bugs and problems in these dependencies and classes if you remove them.

There a multiple tools out there, for gradle and maven (thanks, Marit), that do this statically or dynamically (like the one described in the paper Coverage-Based Debloating for Java Bytecode, thanks, Wolfram). Statical tools are based on static program analysis and are usually safer, as they only remove classes that can statically be proven never to be used. But these tools generally struggle with reflection and code generation which frameworks like Spring use heavily. Dynamic tools typically instrument the bytecode of the Java application and run it to see which parts of the application are used in practice. These tools can deal with recursion and are more precise, removing larger portions of the code.

The currently available tools maybe suffice for your use case, but they are complex software, hard to reason about, and hard to understand. Therefore, this post aims to write a prototypical dynamic tool to detect unused classes. This is like the profiler of my Writing a Profiler in 240 Lines of Pure Java blog post, done mainly for educational purposes, albeit the tool might be helpful in certain real-world use cases. As always, you can find the final MIT-licensed code on GitHub in my dead-code-agent repository.

Main Idea

I make one simplification compared to many of the more academic tools: I only deal with code with class-level granularity. This makes it far more straightforward, as it suffices to automatically instrument the static initializers of every class (and interface), turning

class A {
    private int field;
    public void method() {...}
}

into

class A {
    static {
       Store.getInstance().processClassUsage("A");
    }
    private int field;
    public void method() {...}
}

to record the first usage of the class A in a global store. Another advantage is that there is minimal overhead when recording the class usage information, as only the first usage of every class has the recording overhead.

Static initializers are called whenever a class is initialized, which happens in the following circumstances:

A class or interface T will be initialized immediately before the first occurrence of any one of the following:

T is a class and an instance of T is created.

A static method declared by T is invoked.

A static field declared by T is assigned.

A static field declared by T is used and the field is not a constant variable (§4.12.4).

When a class is initialized, its superclasses are initialized (if they have not been previously initialized), as well as any superinterfaces (§8.1.5) that declare any default methods (§9.4.3) (if they have not been previously initialized). Initialization of an interface does not, of itself, cause initialization of any of its superinterfaces.
When Initialization Occurs – Java Language Specification

Adding code at the beginning of every class’s static initializers lets us obtain knowledge on all used classes and interfaces. Interfaces don’t have static initializers in Java source code, but the bytecode supports this nonetheless, and we’re only working with bytecode here.

We can then use this information to either remove all classes that are not used from the application’s JAR or log an error message whenever such a class is instantiated:

class UnusedClass {
    static {
       System.err.println("Class UnusedClass is used " + 
                          "which is not allowed");
    }
    private int field;
    public void method() {...}
}

This has the advantage that we still log when our assumption on class usage is broken, but the program doesn’t crash, making it more suitable in production settings.

Structure

The tool consists of two main parts:

Instrumenter: Instruments the JAR and removes classes, used both for modifying the JAR to obtain the used classes and to remove unused classes or add error messages (as shown above)
Instrumenting Agent: This agent is similar to the Instrumenter but is implemented as an instrumenting Java agent. Both instrumentation methods have advantages and disadvantages, which I will explain later.

This leads us to the following workflow:

Usage

Before I dive into the actual code, I’ll present you with how to use the tool. Skip this section if you’re only here to see how to implement an instrumenting agent 🙂

You first have to download and build the tool:

git clone https://github.com/parttimenerd/dead-code-agent
cd dead-code-agent
mvn package

# and as demo application the spring petclinic
git clone https://github.com/spring-projects/spring-petclinic
cd spring-petclinic
mvn package
# make the following examples more concise
cp spring-petclinic/target/spring-petclinic-3.0.0-SNAPSHOT.jar \
   petclinic.jar

The tool is written in Java 17 (you should be using this version anyways), which is the only system requirement.

Using the Instrumenting Agent to Obtain the Used Classes

The instrumenting agent can be started at JVM startup:

java -javaagent:./target/dead-code.jar=output=classes.txt \
     -jar petclinic.jar

This will record all loaded and used classes in the classes.txt file, which includes lines like:

u ch.qos.logback.classic.encoder.PatternLayoutEncoder 
l ch.qos.logback.classic.joran.JoranConfigurator 
u ch.qos.logback.classic.jul.JULHelper 
u ch.qos.logback.classic.jul.LevelChangePropagator

Telling you that the PatternLayoutEncoder class has been used and has only been loaded but not used. Loaded means, in our context, that the instrumenting agent instrumented this class.

Not all classes can be instrumented. It is impossible to, for example, add static initializers to the class that we loaded before the instrumentation agent started; this is not a problem, as we can start the agent just after all JDK classes have been loaded. Removing JDK classes is possible with jlink, but instrumenting these classes is out-of-scope for this article, as they are far harder to instrument and most people don’t consider these classes.

The instrumentation agent is not called for some Spring Boot classes for reasons unknown to me. This makes the agent approach unsuitable for Spring Boot applications and led me to the development of the main instrumenter:

Using the Instrumenter to Obtain the Used Classes

The instrumenter lets you create an instrumented JAR that records all used classes:

java -jar target/dead-code.jar classes.txt \
          instrument petclinic.jar instrumented.jar

This will throw a few errors, but remember; it’s still a prototype.

You can then run the resulting JAR to obtain the list of used classes (like above). Just use the instrumented.jar like your application JAR:

java -jar instrumented.jar

The resulting classes.txt is similar to the file produced by the instrumenting agent. The two differences are that we cannot observe only loaded but not used classes and don’t miss any Spring-related classes. Hopefully, I will find time to investigate the issue related to Spring’s classloaders.

Using the Instrumenter to Log Usages of Unused Classes

The list of used classes can be used to log the usage of classes not used in the recording runs:

java -jar target/dead-code.jar classes.txt \
          instrumentUnusedClasses petclinic.jar logging.jar

This will log the usage of all classes not marked as used in classes.txt on standard error, or exit the program if you pass the --exit option to the instrumenter.

If you, for example, recorded the used classes of a run where you did not access the petclinic on localhost:8080, then executing the modified logging.jar and accessing the petclinic results in output like:

Class org.apache.tomcat.util.net.SocketBufferHandler is used which is not allowed
Class org.apache.tomcat.util.net.SocketBufferHandler$1 is used which is not allowed
Class org.apache.tomcat.util.net.NioChannel is used which is not allowed
Class org.apache.tomcat.util.net.NioChannel$1 is used which is not allowed
...

An exciting feature of the instrumenter is that the file format of the used classes file is not restricted to what the instrumented JARs produce. It also supports wild cards:

u org.apache.tomcat.*

Tells the instrumenter that all classes which have a fully-qualified name starting with org.apache.tomcat. should be considered used.

r org.apache.* used apache

This tells the instrumenter to instrument the JAR to report all usages of Apache classes, adding the (optional) message “used apache.”

These two additions make the tool quite versatile.

Writing the Instrumentation Agent

We start with the instrumentation agent and later go into the details of the Instrumenter.

The agent itself consists of three major parts:

Main class: Entry point for the agent, registers the ClassTransformer as a transformer
ClassTransformer class: Instruments all classes as described before
Store class: Deals with handling and storing the information on used and stored classes

A challenge here is that all instrumented classes will use the Store. We, therefore, have to put the store onto the bootstrap classpath, making it visible to all classes. There are multiple ways to do this:

It is building a runtime JAR directly in the agent using the JarFile API, including the bytecode of the Store and its inner classes.
Building an additional dead-code-runtime.jar using a second maven configuration, including this JAR as a resource in our agent JAR, and copying it into a temporary file in the agent.

Both approaches are valid, but the second approach seems more widely used, and the build system includes all required classes and warns of missing ones.

We build the runtime JAR by creating a new maven configuration that only includes the me.bechberger.runtime package where the Store resides:

<build>
  ...
  <sourceDirectory>
    ${project.basedir}/src/main/java/me/bechberger/runtime
  </sourceDirectory>
  ...
</build>

Main Class

The main class consists mainly of the premain method which deletes the used classes file, loads the runtime JAR, and registers the ClassTransformer:

public class Main {

    public static void premain(String agentArgs, 
      Instrumentation inst) {
        AgentOptions options = new AgentOptions(agentArgs);
        // clear the file
        options.getOutput().ifPresent(out -> {
            try {
                Files.deleteIfExists(out);
                Files.createFile(out);
            } catch (IOException e) {
                throw new RuntimeException(e);
            }
        });
        try {
            inst.appendToBootstrapClassLoaderSearch(
                new JarFile(getExtractedJARPath().toFile()));
        } catch (IOException e) {
            throw new RuntimeException(e);
        }
        inst.addTransformer(new ClassTransformer(options), true);
    }
    // ...
}

I’m omitting the AgentOptions class, which parses the options passed to the agent (like the output file).

The premain method uses the getExtractedJARPath method to extract the runtime JAR. This extracts the JAR from the resources:

    private static Path getExtractedJARPath() throws IOException {
        try (InputStream in = Main.class.getClassLoader()
                 .getResourceAsStream("dead-code-runtime.jar")){
            if (in == null) {
                throw new RuntimeException("Could not find " + 
                    "dead-code-runtime.jar");
            }
            File file = File.createTempFile("runtime", ".jar");
            file.deleteOnExit();
            Files.copy(in, file.toPath(), 
                       StandardCopyOption.REPLACE_EXISTING);
            return file.toPath().toAbsolutePath();
        }
    }

ClassTransformer Class

This transformer implements the ClassFileTransformer to transform all loaded classes.

A transformer of class files. An agent registers an implementation of this interface using the addTransformer method so that the transformer’s transform method is invoked when classes are loaded, redefined, or retransformed. The implementation should override one of the transform methods defined here. Transformers are invoked before the class is defined by the Java virtual machine.
ClassFileTransformer DOcumentation

We could do all the bytecode modification ourselves. This is error-prone and complex, so we use the Javassist library, which provides a neat API to insert code into various class parts.

Our ClassTransformer has to implement the transform method:

public byte[] transform(Module module, 
                        ClassLoader loader, 
                        String className, 
                        Class<?> classBeingRedefined,
                        ProtectionDomain protectionDomain, 
                        byte[] classfileBuffer)

Transforms the given class file and returns a new replacement class file.

Parameters:
module – the module of the class to be transformed
loader – the defining loader of the class to be transformed, may be null if the bootstrap loader
className – the name of the class in the internal form of fully qualified class and interface names as defined in The Java Virtual Machine Specification. For example, "java/util/List".
classBeingRedefined – if this is triggered by a redefine or retransform, the class being redefined or retransformed; if this is a class load, null
protectionDomain – the protection domain of the class being defined or redefined
classfileBuffer – the input byte buffer in class file format – must not be modified
ClassFileTransformer DOcumentation

Our implementation first checks we’re not instrumenting our agent or some JDK code:

if (className.startsWith("me/bechberger/runtime/Store") || 
    className.startsWith("me/bechberger/ClassTransformer") || 
    className.startsWith("java/") || 
    className.startsWith("jdk/internal") || 
    className.startsWith("sun/")) {
            return classfileBuffer;
}

This prevents instrumentation problems and keeps the list of used classes clean. We then use a statically defined ScopedClassPoolFactory to create a class pool for the given class loader, parse the bytecode using javassist and transform it using our transform(String className, CtClass cc) method:

        try {
            ClassPool cp = scopedClassPoolFactory
                 .create(loader, ClassPool.getDefault(),
                         ScopedClassPoolRepositoryImpl
                             .getInstance());
            CtClass cc = cp.makeClass(
                 new ByteArrayInputStream(classfileBuffer));
            if (cc.isFrozen()) {
                // frozen classes cannot be modified
                return classfileBuffer;
            }
            // classBeingRedefined is null in our case
            transform(className, cc);
            return cc.toBytecode();
        } catch (CannotCompileException | IOException | 
                 RuntimeException | NotFoundException e) {
            e.printStackTrace();
            return classfileBuffer;
        }

The actual instrumentation is now done with the javassist API:

    private void transform(String className, CtClass cc) 
      throws CannotCompileException, NotFoundException {
        // replace "/" with "." in the className
        String cn = formatClassName(className);
        // handle the class load
        Store.getInstance().processClassLoad(cn, 
            cc.getClassFile().getInterfaces());
        // insert the call to processClassUsage at the beginning
        // of the static initializer
        cc.makeClassInitializer().insertBefore(
             String.format("me.bechberger.runtime.Store" +
                 ".getInstance().processClassUsage(\"%s\");", 
                 cn));
    }

You might wonder why we’re also recording the interfaces of every class. This is because the static initializers of interfaces are not called when the first static initializer of an implemented class is called. We, therefore, have to walk the interface tree ourselves. Static initializers of parent classes are called; therefore, we don’t have to handle parent classes ourselves.

Instrumenter

The main difference is that the instrumenter also transforms the bytecode, transforming all files in the JAR and writing a new JAR back. This new JAR is then executed, which has the advantage that we can instrument all classes in the JAR (even with Spring’s classloader magic). The central part of the Instrumenter is the ClassAndLibraryTransformer which can be targeted to a specific class transformation use case by setting its different fields:

public class ClassAndLibraryTransformer {
    /** Source JAR */
    private final Path sourceFile;
    /** 
     * Include a library in the output JAR.
     * A library is JAR inside this JAR and 
     * its name is the file name without version identifier 
     * and suffix.
     */
    private Predicate<String> isLibraryIncluded;
    /** Include a class in the output JAR */
    private Predicate<String> isClassIncluded;
    /** 
     * Transforms the class file, might be null.
     * Implemented using the javassist library as shown before.
     */
    private BiConsumer<ClassPool, CtClass> classTransformer;

    record JarEntryPair(String name, InputStream data) {
        static JarEntryPair of(Class<?> klass, String path)
          throws IOException {
            // obtain the bytecode from the dead-code JAR
            return new JarEntryPair(path, 
                klass.getClassLoader().getResourceAsStream(path));
        }
    }
    /** 
     * Supplies a list of class files that should 
     * be added to the JAR, like the Store related classes
     */
    private Supplier<List<JarEntryPair>> miscFilesSupplier = 
         List::of;
    /** Output JAR */
    private final OutputStream target;
    // ...
}

This class is used for instrumentation and removing classes and nested JARs/libraries, sharing most of the code between both.

The central entry point of this class is the process method, which iterates over all entries of the sourceFile JAR using the JarFile and JarOutputStream APIs:

    void process(boolean outer) throws IOException {
        try (JarOutputStream jarOutputStream = 
             new JarOutputStream(target); 
            JarFile jarFile = new JarFile(sourceFile.toFile())) {
            jarFile.stream().forEach(jarEntry -> {
                try {
                    String name = jarEntry.getName();
                    if (name.endsWith(".class")) {
                        processClassEntry(jarOutputStream, 
                            jarFile, jarEntry);
                    } else if (name.endsWith(".jar")) {
                        processJAREntry(jarOutputStream, 
                            jarFile, jarEntry);
                    } else {
                        processMiscEntry(jarOutputStream, 
                            jarFile, jarEntry);
                    }
                } catch (IOException e) { 
                    // .forEach forces us to wrap exceptions
                    throw new RuntimeException(e);
                }
            });
            if (outer) { // add miscellaneous class files
                for (JarEntryPair miscFile : 
                        miscFilesSupplier.get()) {
                    // create a new entry
                    JarEntry jarEntry = 
                        new JarEntry(miscFile.name);
                    jarOutputStream.putNextEntry(jarEntry);
                    // add the file contents
                    miscFile.data.transferTo(jarOutputStream);
                }
            }
        }
    }

Processing entries of the JAR file that are neither class files nor JARs consist only of copying the entry directly to the new file:

    private static void processMiscEntry(
      JarOutputStream jarOutputStream, 
      JarFile jarFile, JarEntry jarEntry) throws IOException {
        jarOutputStream.putNextEntry(jarEntry);
        jarFile.getInputStream(jarEntry)
               .transferTo(jarOutputStream);
    }

Such files are typically resources like XML configuration files.

Transforming class file entries is slightly more involved: We check whether we should include the class defined in the class file and transform it if necessary:

    private void processClassEntry(
      JarOutputStream jarOutputStream, 
      JarFile jarFile, JarEntry jarEntry) throws IOException {
        String className = classNameForJarEntry(jarEntry);
        if (isClassIncluded.test(className) || 
              isIgnoredClassName(className)) {
            jarOutputStream.putNextEntry(jarEntry);
            InputStream classStream = 
                jarFile.getInputStream(jarEntry);
            if (classTransformer != null && 
                  !isIgnoredClassName(className)) {
                // transform if possible and required
                classStream = transform(classStream);
            }
            classStream.transferTo(jarOutputStream);
        } else {
            System.out.println("Skipping class " + className);
        }
    }

We ignore here class files related to package-info or module-info, as they don’t contain valid classes. This is encapsulated in the isIgnoredClassName method. The implementation of the transform method is similar to the transform method of the instrumenting agent, using the classTransformer consumer for the actual class modification.

A transforming consumer to log the usage of every unused class looks as follows, assuming that isClassUsed it is a predicate that returns true if the passed class is used and that messageSupplier supplies specific messages that are output additionally:

(ClassPool cp, CtClass cc) -> {
    String className = cc.getName();
    if (isClassUsed.test(className)) {
        return;
    }
    try {
        String message = messageSupplier.apply(className);
        cc.makeClassInitializer().insertBefore(
            String.format("System.err.println(\"Class %s " + 
                          "is used which is not allowed%s\");" +
                          "if (%s) { System.exit(1); }", 
                className, 
                message.isBlank() ? "" : (": " + message), 
                exit));
    } catch (CannotCompileException e) {
        throw new RuntimeException(e);
    }
};

The last thing that I want to cover is the handling of nested JARs in the processJAREntry(JarOutputStream jarOutputStream, JarFile jarFile, JarEntry jarEntry) method. Nested JARs are pretty standard with Spring and bundle libraries with your application. To quote the Spring documentation:

Java does not provide any standard way to load nested jar files (that is, jar files that are themselves contained within a jar). This can be problematic if you need to distribute a self-contained application that can be run from the command line without unpacking.

To solve this problem, many developers use “shaded” jars. A shaded jar packages all classes, from all jars, into a single “uber jar”. The problem with shaded jars is that it becomes hard to see which libraries are actually in your application. It can also be problematic if the same filename is used (but with different content) in multiple jars. Spring Boot takes a different approach and lets you actually nest jars directly.
The Executable JAR Format – Spring Documentation

Our method first checks that we should include the nested JAR and, if so, extract it into a temporary file. We extract the JAR because the JarFile API can only work with files. We then use the ClassAndLibraryTransformer recursively:

    private void processJAREntry(JarOutputStream jarOutputStream, 
      JarFile jarFile, JarEntry jarEntry) throws IOException {
        String name = jarEntry.getName();
        String libraryName = Util.libraryNameForPath(name);
        if (!isLibraryIncluded.test(libraryName)) {
            System.out.println("Skipping library " + libraryName);
            return;
        }
        Path tempFile = Files.createTempFile("nested-jar", ".jar");
        tempFile.toFile().deleteOnExit();
        // copy entry over
        InputStream in = jarFile.getInputStream(jarEntry);
        Files.copy(in, tempFile, 
                   StandardCopyOption.REPLACE_EXISTING);
        ClassAndLibraryTransformer nestedJarProcessor;
        // create new JAR file
        Path newJarFile = Files.createTempFile("new-jar", 
                                               ".jar");
        newJarFile.toFile().deleteOnExit();
        try (OutputStream newOutputStream = 
              Files.newOutputStream(newJarFile)) {
            nestedJarProcessor = 
                new ClassAndLibraryTransformer(tempFile, 
                    isLibraryIncluded, isClassIncluded, 
                    classTransformer,
                    newOutputStream);
                    nestedJarProcessor.process(false);
        }
        // create an uncompressed entry
        JarEntry newJarEntry = new JarEntry(jarEntry.getName());
        newJarEntry.setMethod(JarEntry.STORED);
        newJarEntry.setCompressedSize(Files.size(newJarFile));
        CRC32 crc32 = new CRC32();
        crc32.update(Files.readAllBytes(newJarFile));
        newJarEntry.setCrc(crc32.getValue());
        jarOutputStream.putNextEntry(newJarEntry);
        Files.copy(newJarFile, jarOutputStream);
    }

Nesting JAR files come with a few restrictions, but most notable is the limitation of ZIP compression:

The ZipEntry for a nested jar must be saved by using the ZipEntry.STORED method. This is required so that we can seek directly to individual content within the nested jar. The content of the nested jar file itself can still be compressed, as can any other entries in the outer jar.
The Executable JAR Format – Spring Documentation

Therefore, the code creates a JarEntry that is just stored and not compressed. But this requires us to compute and set the CRC and file size ourselves; this is done automatically for compressed entries.

All other code can be found in the GitHub repository of the project. Feel free to adapt the code and use it in your own projects.

Conclusion

Dynamic dead-code analyses are great for finding unused code and classes, helping to reduce the attack surface. Implementing such tools in a few lines of Java code is possible, creating an understandable tool that offers less potential of surprise for users. The tool developed in this blog post is a prototype of a dead-code analysis that could be run in production to find all used classes in a real-world setting.

Writing instrumentation agents using the JDK instrumentation APIs combined with the javassist library allows us to write a somewhat functioning agent in hours.

I hope this blog post helped you to understand the basics of finding unused classes dynamically and implementing your own instrumentation agent.

Thanks to Wolfram Fischer from SAP Security Research Germany for nerd-sniping me, leading me to write the tool and this blog post. This blog post is part of my work in the SapMachine team at SAP, making profiling easier for everyone.

Mostly nerdless

Every two weeks a text on profiling, debugging or eBPF

Monthly Archives: April 2023

Couldn’t we just Use AsyncGetCallTrace in a Separate Thread?

Advantages

Disadvantages

What is the real reason?

Modifying AsyncGetCallTrace

Modifying async-profiler

Modifying AsyncGetCallTrace (2^nd approach)

Modifying async-profiler (2^nd approach)

Could we walk a thread without stopping it?

Evaluation

Conclusion

Conclusion

Instrumenting Java Code to Find and Handle Unused Classes

Main Idea

Structure

Usage

Using the Instrumenting Agent to Obtain the Used Classes

Using the Instrumenter to Obtain the Used Classes

Using the Instrumenter to Log Usages of Unused Classes

Writing the Instrumentation Agent

Main Class

ClassTransformer Class

Instrumenter

Conclusion

Advantages

Disadvantages

What is the real reason?

Modifying AsyncGetCallTrace

Modifying async-profiler

Modifying AsyncGetCallTrace (2nd approach)

Modifying async-profiler (2nd approach)

Could we walk a thread without stopping it?

Evaluation

Conclusion

Conclusion

Main Idea

Structure

Usage

Using the Instrumenting Agent to Obtain the Used Classes

Using the Instrumenter to Obtain the Used Classes

Using the Instrumenter to Log Usages of Unused Classes

Writing the Instrumentation Agent

Main Class

ClassTransformer Class

Instrumenter

Conclusion

Modifying AsyncGetCallTrace (2^nd approach)

Modifying async-profiler (2^nd approach)