Java 25's new CPU-Time Profiler: Queue Sizing (3) - Mostly nerdlessMostly nerdless

Welcome back to my series on the new CPU-time profiler in Java 25. In the previous blog post, I covered the implementation of the new profiler. In this week’s blog post, I’ll dive deep into the central request queue, focusing on deciding its proper size.

The JfrCPUTimeTraceQueue allows the signal handler to record sample requests that the out-of-thread sampler and the safepoint handler process. So it’s the central data structure of the profiler:

This queue is thread-local and pre-allocated, as it’s used in the signal handler, so the correct sizing is critical:

If the size is too small, you’ll lose many samples because the signal handler can’t record sample requests.
If you size it too large, you waste lots of memory. A sampling request is 48 bytes, so a queue with 500 elements (currently the default) requires 24kB. This adds up fast if you have more than a few threads.

So, in this blog post, we’re mainly concerned about setting the correct default size and discussing a potential solution to the whole problem.

Current Queue Sizing

As discussed in my previous blog post, the size of the queue is currently determined as follows:

Whenever the JVM creates a new thread, we create a queue of size 500 if the current sampling interval is bigger than 10 milliseconds and “500 * 10 / interval” for anything lower. For example, an interval of 1 millisecond leads to a queue size of 5000.

The reason for this is some back-of-the-envelope math, with the one significant advantage of being conservative. But it has a few problems:

What happens when the interval is lowered during profiling? Shouldn’t this also resize all queues? And if not, why are we making the queue size dependent on the interval in the first place?
It potentially wastes a lot of memory for 1ms intervals: We allocate around a quarter megabyte of memory for every thread.

So, let’s replace the guesstimate of the queue size with a proper analysis. For this, I devised two test cases, one probably the worst case and one the well-known Renaissance benchmark. We use these benchmarks combined with a JDK branch with added logging to analyse the relationship between queue size and sample loss.

Our guard rail is that, even in the worst case, we want a sample loss of less than 1%, if possible. But we mostly care about intervals of 10ms and larger.

The Worst Case Benchmark

When we run normal Java code, like the Renaissance benchmark mentioned before, we mostly hit Java top frames and get safepoint polls pretty frequently, as often as every few milliseconds. So can we do worse? Because the profiler relies on safepoints, it has difficulties with long-running native methods where the safepoint handler is never called. To remedy this, an out-of-thread sampler thread walks over all threads regularly to check if a thread is in native and needs its queue drained. But this thread is, of course, a bottleneck. Let’s try to exploit that. We want, therefore, a test case that only runs in native, and the stack walking itself takes a long time. So here is my approach:

I call it the native benchmark: It

spins up #number of cores threads that each
call a method m1, which directly calls a method m2, …, till we reach a given stack depth
the methods are distinct for every thread
the last level Java method just calls a native method that wastes a given number of seconds on the CPU

This should give us somewhat of a worst-case benchmark for the current sampler implementation. Of course, you could probably come up with something worse, but I think my benchmark is almost certainly worse than anything the profiler has to deal with in reality.

We’ll run this on a 16-core Thinkpad and a 128-core Threadripper tower. In the latter, a single thread has to deal with the requests made by 128 threads (well, we are nice and are only running 120 parallel native threads to keep the machine usable, but still, it’s probably one of the worst benchmarks).

Worst Case vs Loss

The following is a logarithmic plot for the sample loss vs the configure queue size (1 up to 1000) for a given sampling interval. It shows the loss for four data ranges each:

S100, D5s: Runs the native benchmark with stack depth 100 and a native method duration (it’s called in a loop) of 5 seconds for 250 seconds in a loop.
S100, D250s: Native method runs for the whole time
…

This benchmark on our large machine shows that with 10ms and larger intervals, a queue size of only 10 is sufficient. A queue size of 50 allows for proper handling of 5ms intervals, and for a queue size of 200, the 2ms intervals, which is also okayish for 1ms intervals:

So, it’s hard to argue that we need a queue of 500 in 10ms+ cases. Setting an interval-independent queue size of 100 should be good enough for all but the worst cases.

This would essentially save us around 19kB for every thread. By the way, these are the only test cases where the out-of-stack request handling runs substantially. The following test case is dominated by drains at the safepoints, which increases the stability and safety of the stack sampling.

Renaissance vs Loss

Let’s look at the Renaissance benchmark, run on the same tower:

This shows us that the sample loss is essentially independent of the queue size. So even a queue size as small as one. It is important to note that the P99 of the number of elements in the queue is small even for 1-ms intervals:

So essentially, we can choose a queue size of two and be fine.

But where does the loss come from when it is not from a full queue? It mostly comes from threads in the wrong (not-sampleable) state, like when the thread is currently processing a safepoint to drain the queue.

On the Laptop

The plots look similar when running on the laptop. The only difference is that in the native worst case, a queue size of 50 is sufficient (please be aware that I didn’t benchmark the 2ms case).

The Renaissance benchmark also shows us that a queue size of 2 is sufficient for all intervals:

Preliminary Conclusion

Our findings show that a queue size should be at a maximum of 200 and at a minimum of 2, with a trade-off to be made between memory consumption and worst-case loss.

If I’m honest, I didn’t expect this. I thought that we’d find that we need a queue size of 20 or so to handle the Renaissance benchmark. But the findings show that in many cases, we don’t actually need any queue. Just one single request entry would be sufficient for many use cases (and standard sampling intervals like 10 or 20 milliseconds).

So, I could stop here, pick the safe queue size of 200, and be fine, removing the interval-based size computation from the OpenJDK code and submit the change. But what if we could do something more intelligent? Risking a few lost samples in the worst case, for lots of saved memory in the real world. This is precisely what I propose with the dynamic queue sizing:

Dynamic Queue Sizing

Consider now that we pick a starting queue size (the x-axis in the plot) and then scale it up as needed, resulting in a reasonable loss rate even for small starting queue sizes.

In any real-world use case, the queue size is almost never going to exceed a value like 2, so we’re saving a lot of memory and can still reasonably handle the worst-case scenario.

I propose the following algorithm, which is executed after every queue drainage loop. This ensures that the queue contains no data and is already locked.

#define CPU_TIME_QUEUE_CAPACITY_MAX 2000 # only used in the worst case

float ratio = (float)lost_samples_due_to_queue_full / (float)_capacity;
int factor = 1;
if (ratio > 8) { // idea is to quickly scale the queue in the worst case
  factor = ratio;
} else if (ratio > 2) { // more than 66% of all samples are lost
  factor = 8; // so increase drastically
} else if (ratio > 0.5) {
  factor = 4;
} else if (ratio > 0.01) { // at most half of the samples are lost
  factor = 2;
}
if (factor > 1) {
  u4 new_size = _capacity * factor > CPU_TIME_QUEUE_CAPACITY_MAX ? 
      CPU_TIME_QUEUE_CAPACITY_MAX 
    : _capacity * factor;
  resize(new_size);
}

This algorithm allows our sampler to scale up the queue size quickly. Here is an example log of a queue in the native benchmark:

Increasing queue size to 56 (factor: 56, ratio: 56.00) for thread 60544
Increasing queue size to 224 (factor: 4, ratio: 0.82) for thread 60544
Increasing queue size to 448 (factor: 2, ratio: 0.43) for thread 60544
Increasing queue size to 896 (factor: 2, ratio: 0.21) for thread 60544
Increasing queue size to 1792 (factor: 2, ratio: 0.08) for thread 60544
Increasing queue size to 2000 (factor: 2, ratio: 0.10) for thread 60544

This shows that the queue reaches its maximum size in only a few iterations.

But of course, this has the drawback of potentially losing samples at the beginning while the queue finds its size. Therefore we

choose a queue size that is still fairly conservative, like 20
and never reduce the queue size after increasing it.

The latter also prevents the queue size from being constantly modified. The algorithm is, by design, somewhat conservative and coarse, usually increasing the queue more than needed. It was developed by analyzing the actual worst-case increment behaviour.

Of course, there is a worst-case scenario it can’t properly handle: when you constantly create new threads that only run in native for as long as necessary to overflow the queue and then stop.

Interestingly, the high sample loss with the 1ms interval is (apparently) not only because the queue is overflowing:

But to the JVM being in the VM state (we seem to hit it more often) and due to the thread being blocked (we block threads when sampling):

Please take these numbers with more than a grain of salt. They are preliminary findings that illustrate the current issues.

More Plots

Now, some plots are related to the dynamic queue sizing. First, a plot that shows the maximum memory consumption of all queues for the Renaissance benchmark on my laptop:

This shows, as expected, that the loss rate is not really correlated with the amount of queue memory and that memory consumption grows directly in proportion with the minimum size of each queue. We see in the plot of the queue size incrementations that we don’t have any significant number of these in 10 or 20ms intervals, and only with the 1ms interval and < 20 initial queue size. And still then, the number is not significant for a benchmark that runs for a few minutes:

But the cause for sample loss is also interesting here:

In contrast to the native worst case benchmark, the dominating cause for loss is (apparently) the thread being in VM (so sampling), but far lower than in the worst case benchmark:

One explanation is that in the native worst case, p99.9(duration for drainage) is 220ms, which is around 10 times as much as in the Renaissance laptop case, the same as the average duration. This could explain this difference.

Also interesting is the distribution of the average time to drain the queue and to create a sample request in the signal handler (please ignore the connecting lines):

The data fluctuates, but shows that draining the queue at safepoints and creating a sample request is fast.

Please don’t ask me why both are in the same ballpark. It could be because in the signal handler, we have to walk the native stack until we find the first Java frame. Renaissance should have an average, moderately deep Java stack, so the number of native frames on top can be comparable. Another factor is that the average queue size is between one and two.

As before: Take this data with a few grains of salt.

Conclusion

Proper queue sizing is tricky. This blog post shows that my back-of-the-envelope guesstimate of 500 as the default queue size is highly wasteful. But the new dynamic queue sizing ensures that both the worst-case and real-world use cases work reasonably well while still reducing memory usage. You can find the proposed change to OpenJDK on GitHub.

Do you have any opinions on my proposal or my blog post? Feel free to comment here or on social media.

P.S.: I can not recommend creating such log analysis code using AI agents. It’s nice when you get the first prototype, but its lack of a proper data structure makes the code really hard to understand and comprehend. I probably would have been faster coding it by hand (and only using AI to generate parts of it).

This blog post is part of my work in the SapMachine team at SAP, making profiling easier for everyone.

Author

Johannes Bechberger

Johannes Bechberger is a JVM developer working on profilers and their underlying technology in the SapMachine team at SAP. This includes improvements to async-profiler and its ecosystem, a website to view the different JFR event types, and improvements to the FirefoxProfiler, making it usable in the Java world. His work today comprises many open-source contributions and his blog, where he regularly writes on in-depth profiling and debugging topics. He also works on hello-ebpf, the first eBPF library for Java. His most recent contribution is the new CPU Time Profiler in JDK 25.

View all posts

New posts like these come out at least every two weeks, to get notified about new posts, follow me on BlueSky, Twitter, Mastodon, or LinkedIn, or join the newsletter:

Mostly nerdless

Every two weeks a text on profiling, debugging or eBPF

Java 25’s new CPU-Time Profiler: Queue Sizing (3)