JFR Timestamps and System.nanoTime

Did you ever wonder whether JFR timestamps use the same time source as System.nanoTime? This is important when you have miscellaneous logging besides JFR events; otherwise, you could not match JFR events and your logging properly. We assume here that you use System.nanoTime and not less-suited timing information from System.currentTimeMillis.

The journey into this started with a question on the JDK Mission Control slack channel, which led me into a rabbit hole:

Could I have a question regarding JFR timestamps? (working with Linux) Is there any difference between JFR timestamp implementation and System#nanoTime (any optimization)?

Petr Bouda

This question essentially boils down to comparing both methods’ OS time sources. We’re only considering Unix systems in the following.

Source of JFR timestamps

The JFR event time stamps are set in the JFR event constructor, which is defined in jfrEvent.hpp (and not in the Java code, as one might expect):

  JfrEvent(EventStartTime timing=TIMED) : 
    _start_time(0), _end_time(0),
    _untimed(timing == UNTIMED),
    _should_commit(false), _evaluated(false)
#ifdef ASSERT
  , _verifier()
#endif
  {
    if (!T::isInstant && !_untimed && is_enabled()) {
      set_starttime(JfrTicks::now());
    }
  }

Looking further reveals that JFRTicks calls FastUnorderedElapsedCounterSource which uses two different time sources:

FastUnorderedElapsedCounterSource::Type 
 FastUnorderedElapsedCounterSource::now() {
#if defined(X86) && !defined(ZERO)
  static bool valid_rdtsc = Rdtsc::initialize();
  if (valid_rdtsc) {
    return Rdtsc::elapsed_counter();
  }
#endif
  return os::elapsed_counter();
}

The RDTSC instruction reads the time stamp counter on x86 processors:

The time stamp counter (TSC) is a hardware counter found in all contemporary x86 processors. The counter is implemented as a 64-bit model-specific register (MSR) that is incremented at every clock cycle. The RDTSC (“read time stamp counter”) register has been present since the original Pentium.

Already because of the access method, TSC provides a low-overhead and high-resolution way to obtain CPU timing information. This traditional premise was violated when such factors as system sleep states, CPU “hotplugging”, “hibernation”, and CPU frequency scaling were introduced to the x86 lineage. This was however mainly a short abruption: in many new x86 CPUs the time stamp counter is again invariant with respect to the stability of the clock frequency. Care should be however taken in implementations that rely on this assumption.

NETBSD MANUAL

This instruction allows the OS to implement a monotonic real-time clock.

On non-x86 systems os::elapsed_counter is used, which, surprise, calls os::javaTimeNanos:

jlong os::elapsed_counter() {
  return os::javaTimeNanos() - initial_time_count;
}

Source of System.nanoTime

Now the remaining question is: Does System.nanoTime also call os::javaTimeNanos? The method is defined in the jvm.cpp:

JVM_LEAF(jlong, JVM_NanoTime(JNIEnv *env, jclass ignored))
  return os::javaTimeNanos();
JVM_END

So System.nanoTime is just a tiny wrapper around os::javaTimeNanos. So this solves the original question on non-x86 CPUs. But what about x86 CPUs?

First for Mac OS: It boils down to calling mach_absolute_time:

Returns current value of a clock that increments monotonically in tick units (starting at an arbitrary point), this clock does not increment while the system is asleep.

ApplE DEVELOPER DOCUMENTATION

Information on the implementation of this method is scarce, but source code from 2007 suggests that mach_absolute_time is RDTSC based. So there is (probably) no difference between JFR timestamps and System.nanoTime on Mac OS, regardless of the CPU architecture.

Now on Linux: Here, the used os::javaTimeNanos is implemented using clock_gettime(CLOCK_MONOTONIC, ...):

CLOCK_MONOTONIC Clock that cannot be set and represents monotonic time since some unspecified starting point.

Linux MAN PAGE

I tried to find something in the Linux Kernel sources, but they are slightly too complicated to find the solution quickly, so I had to look elsewhere. Someone asked a question on clock_gettime on StackOverflow. The answers essentially answer our question too: clock_gettime(CLOCK_MONOTONIC, ...) seems to use RDTSC.

Conclusion

JFR timestamps and System.nanoTime seem to use the same time source on all Unix systems on all platforms, as far as I understand it.

You can stop the JVM from using RDTSC by using the -XX:+UnlockExperimentalVMOptions -XX:-UseFastUnorderedTimeStamps JVM flags (thanks to Richard Startin for pointing this out). You can read Markus Grönlunds Mail on Timing Differences Between JFR and GC Logs for another take on JFR timestamps (or skip ahead):

JFR performance as it relates to event generation, which is also functional for JFR, reduce to a large extent to how quickly a timestamp can be acquired. Since everything in JFR is an event, and an event will have at least one timestamp, and two timestamps for events that represent durations, the event generation leans heavily on the clock source. Clock access latencies is therefore of central importance for JFR, maybe even more so than correctness. And historically, operating systems have varied quite a bit when it comes to access latency and resolution for the performance timers they expose.

What you see in your example is that os::elapsed_counter() (which on Windows maps to QueryPerformanceCounter() with a JVM relative epoch offset) and the rdtsc() counter are disjoint epochs, and they are treated as such in Hotspot. Therefore, attempting to compare the raw counter values is not semantically valid.

Relying on and using rdtsc() come with disclaimers and problems and is generally not recommended. Apart from the historical and performance related aspects already detailed, here is a short description of how it is treated in JFR:

JFR will only attempt to use this source if it has the InvariantTSC property, with timestamp values only treated relative to some other, more stable, clock source. Each “chunk” (file) in JFR reifies a relative epoch, with the chunk start time anchored to a stable timestamp (on Windows this is UTC nanoseconds). rdtsc() timestamps for events generated during that epoch are only treated relative to this start time during post-processing, which gives very high resolution to JFR events. As JFR runs, new “chunks”, and therefore new time epochs, are constructed, continuously, each anchored anew to a stable timestamp.

The nature of rdtsc() querying different cores / sockets with no guarantee of them having been synchronized is of course a problem using this mechanism. However, over the years, possible skews have proven not as problematic as one might think in JFR. In general, the relative relations between the recorded JFR events give enough information to understand a situation and to solve a problem. Of course, there are exceptions, for example, when analyzing low-level aspects expecting high accuracy, usually involving some correlation to some other non-JFR related component. For these situations, an alternative is to turn off rdtsc() usages in JFR using the flags: -XX:+UnlockExperimentalVMOptions -XX:-UseFastUnorderedTimeStamps. JFR will now use os::elapsed_counter() as the time source. This comes with higher overhead, but if this overhead is not deemed problematic in an environment, then this is of course a better solution.

As other have already pointed out, there have been evolution in recent years in how operating systems provide performance counter information to user mode. It might very well be that now the access latencies are within acceptable overhead, combined with high timer resolution. If that is the case, the rdtsc() usages should be phased out due to its inherent problems. This requires a systematic investigation and some policy on how to handle older HW/SW combinations – if there needs to be a fallback to continue to use rdtsc(), it follows it is not feasible to phase it out completely.

Markus Grönlund

Difference between System.currentTimeMillis and System.nanoTime

This is not directly related to the original question, but nonetheless interesting. System.currentTimeMillis is implemented using clock_gettime(CLOCK_REALTIME, ...) on all CPU architectures:

CLOCK_REALTIME System-wide realtime clock. Setting this clock requires appropriate privileges.

Linux MAN PAGE

This clock is not guaranteed to be monotonic:

CLOCK_REALTIME represents the machine’s best-guess as to the current wall-clock, time-of-day time. […] this means that CLOCK_REALTIME can jump forwards and backwards as the system time-of-day clock is changed, including by NTP.

CLOCK_MONOTONIC represents the absolute elapsed wall-clock time since some arbitrary, fixed point in the past. It isn’t affected by changes in the system time-of-day clock.

Ciro Santilli on STACKOVERFLOW

So does it make a difference? Probably only slightly, especially if you’re running shorter profiling runs. For longer runs, consider using System.nanoTime.

I hope you enjoyed coming down this rabbit hole with me and learned something about JFR internals along the way.

This blog post is part of my work in the SapMachine team at SAP, making profiling easier for everyone.

Using Firefox Profiler beyond the web

This blog post is the base for the second half of my upcoming talk at FOSDEM 2023 on the topic “Firefox Profiler beyond the web: Using Firefox Profiler to view Java profiling data.”

I detailed in my last blog post how the Firefox Profiler can be used to view Java profiling data:

But I’m of course not the only one who uses Firefox Profiler beyond the web, because using it has many advantages: You’re essentially getting a prototypical visualization for your data in an afternoon.

Other tools that use Firefox Profiler

There are other tools that output use Firefox Profiler for their front end. A great example is the Rust profiler samply by Markus Stange, the initial developer of Firefox Profiler:

samply is a command line CPU profiler which uses the Firefox profiler as its UI.

At the moment it runs on macOS and Linux. Windows support is planned. samply is still under development and far from finished, but works quite well already.

Give it a try:

% cargo install samply
% samply record ./your-command your-arguments
GiTHUB

Another example is the python profiler FunctionTrace:

A graphical Python profiler that provides a clear view of your application’s execution while being both low-overhead and easy to use.

FunctionTrace supports all of the useful profiling views you’re familiar with, including Stack Charts, Flame Graphs, and Call Trees, thanks to its integration with the Firefox Profiler.

There are also non-open source uses of Firefox Profiler, Luís Oliveira, for example, works on integration with Lisp:

We’re using the Firefox Profiler to debug performance issues at the Dutch Railways dispatching center.

Basic Structure

I hope I convinced you that the Firefox Profiler is really great for visualizing profiling data, even if this data comes from the world beyond web UIs. If not, please read my previous article. The main part of adapting to Firefox Profiler is to convert your data into the profiler format. The data is stored as JSON in a (optionally zipped) file and can be loaded into Firefox Profiler. See Loading in profiles from various sources for more information.

The basic structure of a tool using Firefox Profiler can be as follows, using my plugin as an example:

You have a converter from your profile format to the Firefox Profiler format. The converted file is then passed to the Firefox Profiler, either from profiler.firefox.com or a custom fork. You typically then wrap your UI and the converter, hosting both on a small webserver. This web server runs then on e.g. localhost. Hosting your own Firefox Profiler instance has two main advantages: First, you have always a tested combination of Firefox Profiler and Converter. Second, it works offline. The web server can then be embedded into a larger application, showing the UI using an embedded browser.

You can find the type definitions in the types folder of the Firefox Profiler repository. All of the following will be based on this. This part of the article will not focus on all the details, like markers, but more on a bird’s eye view of the format, so it will probably still apply with future revisions. I’ll also omit parts that are not that useful for non-web use cases. If you have any questions on the file format, feel free to ask them in the matrix channel.

The type definitions are written with flow. It is helpful to read its documentation if you want to understand the intricacies. But for now, it should suffice to know that x?: type means that the property x is optional and that | denotes either types.

Layout

A short interlude: The layout of Firefox Profiler consists basically of a timeline view and a methods and timing view:

The timeline allows you to select specific threads and a time slice to view the details in the detail section below the timeline.

Overview

The following shows the main components of the profile format, omitting and summarizing many properties. This diagram should give a rough overview of what comes next:

Profile

The topmost level of a profile is the Profile type:

type Profile = {|
  meta: ProfileMeta,    // meta information, like application name
  libs: Lib[],          // used shared native libraries
  ...
  counters?: Counter[], // CPU and memory counters
  ...
  threads: Thread[],    // samples per thread
  ...
|};

A profile consists of the metadata, shared libraries, CPU and memory counters, and the rest of the data per thread.

ProfileMeta

A profile can have lots of metadata shown in the UI. The ProfileMeta type specifies this:

type ProfileMeta = {|
  // The interval at which the threads are sampled.
  interval: Milliseconds,
  // The number of milliseconds since midnight January 1, 1970 GMT.
  startTime: Milliseconds,
  // The number of milliseconds since midnight January 1, 1970 GMT.
  endTime?: Milliseconds,
  ...
  // The list of categories as provided by the platform. The categories are present for
  // all Firefox profiles, but imported profiles may not include any category support.
  // The front-end will provide a default list of categories, but the saved profile
  // will not include them.
  categories?: CategoryList,
  // The name of the product, most likely "Firefox".
  product: 'Firefox' | string,
  ...
  // Arguments to the program (currently only used for imported profiles)
  arguments?: string,
  ...
  // The amount of logically available CPU cores for the program.
  logicalCPUs?: number,
  ...
  symbolicated?: boolean,  // usually false for imported profiles
  symbolicationNotSupported?: boolean, // usually true for imported profiles
  // symbolication is usually not important for imported and converted profiles
  ...
  // Profile importers can optionally add information about where they are imported from.
  // They also use the "product" field in the meta information, but this is somewhat
  // ambiguous. This field, if present, is unambiguous that it was imported.
  importedFrom?: string,

  // The following are settings that are used to configure the views for
  // imported profiles, as some features do not make sense for them

  // Do not distinguish between different stack types?
  usesOnlyOneStackType?: boolean, // true in our use case
  // Hide the "implementation" information in the UI (see #3709)?
  doesNotUseFrameImplementation?: boolean, // true in our use case
  // Hide the "Look up the function name on Searchfox" menu entry?
  sourceCodeIsNotOnSearchfox?: boolean,    // true in our use case
  // Indexes of the threads that are initially visible in the UI.
  // This is useful for imported profiles for which the internal visibility score
  // ranking does not make sense.
  initialVisibleThreads?: ThreadIndex[],
  // Indexes of the threads that are initially selected in the UI.
  // This is also most useful for imported profiles where just using the first thread
  // of each process might not make sense.
  initialSelectedThreads?: ThreadIndex[],
  // Keep the defined thread order
  keepProfileThreadOrder?: boolean,
|};

And there is more. It might feel overwhelming, but this data structure also allows you to tailor the profiler UI slightly to your needs.

Category

Many parts of the profile are associated with a Category and a subcategory. A category is defined as:

type Category = {|
  name: string,
  color: string,
  subcategories: string[],
|};

Categories are referenced by their index in the category list of the ProfileMeta data structure and subcategories by their index in the field of their parent category.

The categories are used to assign a color to the squares in front of the method names and give more information on every call tree node in the sidebar:

Now to the individual threads:

Thread

The thread data structure combines all information related to a single thread. There can be multiple threads per process Id. The thread with the name GeckoMain is handled differently than the others. It is the main thread that is shown in the process timeline.

type Thread = {|
  ...
  processStartupTime: Milliseconds,
  processShutdownTime: Milliseconds | null,
  registerTime: Milliseconds,
  unregisterTime: Milliseconds | null,
  ...
  name: string,
  ...
  pid: Pid,
  tid: Tid,
  ...
  // Strings for profiles are collected into a single table, and are referred to by
  // their index by other tables.
  stringTable: UniqueStringArray,
  ...
  samples: SamplesTable,
  ...
  stackTable: StackTable,
  frameTable: FrameTable,
  funcTable: FuncTable,
  resourceTable: ResourceTable,
  ...
|};

The file format stores all stack traces in a space-efficient format which the front end can handle fast. It uses an array of strings (stringTable) to store all strings that appear in the stack traces (like function names), the other data structures only refer to strings by their index in this array.

SampleS Table

This data structure associates a captured stack with a capture time and an optional weight:

/**
 * The Gecko Profiler records samples of what function was currently being executed, and
 * the callstack that is associated with it. This is done at a fixed but configurable
 * rate, e.g. every 1 millisecond. This table represents the minimal amount of
 * information that is needed to represent that sampled function. Most of the entries
 * are indices into other tables.
 */
type SamplesTable = {|
  ...
  stack: Array<IndexIntoStackTable | null>,
  time: Milliseconds[],
  // An optional weight array. If not present, then the weight is assumed to be 1.
  // See the WeightType type for more information.
  weight: null | number[],
  weightType: WeightType,  // 'samples' or 'tracing-ms'
  // CPU usage value of the current thread. Its values are null only if the back-end
  // fails to get the CPU usage from operating system.
  threadCPUDelta?: Array<number | null>,
  length: number,
|};

Filling this with data from a sampling profiler is easy, just add references to the stacks and their occurrence time. For example consider you sampled the stack A-B at 0 and A-B-C at 2, then the samples table is:

SamplesTable = {
  stack: [A-B, A-B-C], // references into the stack table
  time:  [0, 2],
  weigth: null,
  weigthType: 'samples',
  length: 2
}

Filling the threadCPUDelta property allows you to specify the CPU time a thread has used since the last sample. The Firefox Profiler uses this property to show the CPU usage curves in the timeline:

Stack Table

All stacks are stored in the stack table using a prefix array:

type StackTable = {|
  frame: IndexIntoFrameTable[],
  // Imported profiles may not have categories. In this case fill the array with 0s.
  category: IndexIntoCategoryList[],
  subcategory: IndexIntoSubcategoryListForCategory[],
  prefix: Array<IndexIntoStackTable | null>,
  length: number,
|};

Category and subcategory of a stack n gives information on the whole stack, the frame just on its topmost frame. The prefix denotes the stack related to the second-top-most frame or that this stack only has one frame if null. This allows the efficient storage of stacks.

Now consider our example from before. We could store the stack A-B-C as follows:

StackTable = {
  frame:  [A, B, C], // references into the frame table
  prefix: [1, 2, 0],
  ...
}

Frame Table

The frames themselves are stored in the frame table:

/**
 * Frames contain the context information about the function execution at the moment in
 * time. The caller/callee relationship between frames is defined by the StackTable.
 */
type FrameTable = {|
  // If this is a frame for native code, the address is the address of the frame's
  // assembly instruction,  relative to the native library that contains it.
  address: Array<Address | -1>,

  // The inline depth for this frame. If there is an inline stack at an address,
  // we create multiple frames with the same address, one for each depth.
  // The outermost frame always has depth 0.
  inlineDepth: number[],

  category: (IndexIntoCategoryList | null)[],
  subcategory: (IndexIntoSubcategoryListForCategory | null)[],
  func: IndexIntoFuncTable[],
  ...
  line: (number | null)[],
  column: (number | null)[],
  length: number,
|};

Each frame is related to a function, which is in turn stored in the FuncTable.

Func Table

The function table stores all functions with some metadata:

type FuncTable = {|
  // The function name.
  name: Array<IndexIntoStringTable>,

  // isJS and relevantForJS describe the function type. Non-JavaScript functions
  // can be marked as "relevant for JS" so that for example DOM API label functions
  // will show up in any JavaScript stack views.
  // It may be worth combining these two fields into one:
  // https://github.com/firefox-devtools/profiler/issues/2543
  isJS: Array<boolean>,
  relevantForJS: Array<boolean>,

  // The resource describes "Which bag of code did this function come from?".
  // For JS functions, the resource is of type addon, webhost, otherhost, or url.
  // For native functions, the resource is of type library.
  // For labels and for other unidentified functions, we set the resource to -1.
  resource: Array<IndexIntoResourceTable | -1>,

  // These are non-null for JS functions only. The line and column describe the
  // location of the *start* of the JS function. As for the information about which
  // which lines / columns inside the function were actually hit during execution,
  // that information is stored in the frameTable, not in the funcTable.
  fileName: Array<IndexIntoStringTable | null>,
  lineNumber: Array<number | null>,
  columnNumber: Array<number | null>,

  length: number,
|};

Resource Table

The last table I’ll show in this article is the resource table. It depends on you and what you map to it:

/**
 * The ResourceTable holds additional information about functions. It tends to contain
 * sparse arrays. Multiple functions can point to the same resource.
 */
type ResourceTable = {|
  length: number,
  ...
  name: Array<IndexIntoStringTable>,
  ...
  // 0: unknown, library: 1, addon: 2, webhost: 3, otherhost: 4, url: 5
  type: resourceTypeEnum[],
|};

This was quite a technical article, so thanks for reading till the end. I hope it helps you when you try to target the Firefox Profiler, and see you for the next blog post.

This blog post is part of my work in the SapMachine team at SAP, making profiling easier for everyone.

Firefox Profiler beyond the web

This blog post is the base of the first half of my upcoming talk at FOSDEM 2023 on the topic “Firefox Profiler beyond the web: Using Firefox Profiler to view Java profiling data.”

For the impatient: Take a look at my Java JFR Profiler IntelliJ plugin to easily profile your application and view JFR files directly in your IDE.

I got involved in the Firefox Profiler development, spending significant portions of my time at SAP in the last half year on it. It has been an exciting ride. I learned a lot and contributed a few features. So I was essentially developing React code when I wasn’t working on ASGST or other OpenJDK-related tools. But you may well ask the most important of all questions: Why? Why did I spend so much time on a profiler for JavaScript and the Firefox browser? It all started in August 2022…

How did I end up there?

I developed code related to the Java debugging protocol in the second quarter of 2022. I had grand plans, but it eventually did turn out to be far more complicated than expected, but that is a story for another blog post. You can read my short primer on Java debugging internals to get a glimpse of this work. During the development, I encountered a problem: How can I profile my code, especially unit tests? I had many unit tests like the following, which tested specific aspects of my code:

@Test
public void testEvaluateSwitchStatement() {
    var program = Program.parse("((= x 1) (switch (const x)" +
        "(case 1 (= v (collect 2))) (case 2 (= v (collect 1)))" +
        "))");
    var funcs = new RecordingFunctions();
    new Evaluator(vm, funcs).evaluate(program);
    assertEquals(List.of(wrap(2L)), funcs.values);
}

I wanted to use an open-source profiler, as I had no access to the paid version of IntelliJ, which includes profiling support. Multiple tools are available, but it essentially boils down to async-profiler and JMC. Both tools have their advantages and disadvantages regarding their UI, but it essentially boils down to ease of use vs. available features:

Async-profiler and its profiling visualizations are easy to use but do not have that many features. The only available visualization is flamegraphs with minimal interactivity, just zooming is supported. Flamegraphs are the bread-and-butter of profiling:

If you want more visualizations, like a tree view, timelines, or a JFR event view, you can export your profile into JFR format (or use JFR to record your profile directly) and view it in JMC. But the difference between the ease of use of both is vast: Whereas the flamegraphs of async-profiler are usable by anyone with a short introduction, using JMC has a steep learning curve, it is currently more a tool for experts to dig deep into the profiling data. This observation leads us to the first problem: There is, to my knowledge, no open-source tool that offers more visualizations than just flamegraphs and is as easy to use.

Another problem with both async-profiler and JFR is the missing integration into IDEs. I would like to just click on a button in a context menu to profile an individual test case:

Without the hassle of creating a main method that just calls this method: I want to be able to profile it by modifying the JVM options of a run configuration.

I thought I was probably not the only one with this use case who stumbled upon the two problems impeding profiling. I had some spare time in August, so I looked for ways to build a tool myself.

Building a basic version of an IDE plugin that solves the second of my two problems is relatively easy. There is already the open-source profiling plugin Panda by Rastislav Papp, on which I based mine. Panda has only a simple tree view visualization, so it does not cover my first problem with visualizations. So I still had to figure out how I could implement the UI. Implementing it directly in the IDE in Java is cumbersome, so I decided early on to use an embedded browser. I considered implementing it myself, with the help of libraries like d3-flamegraph or extending speedscope, but this proved too much work. And I had no experience in building production-ready web applications, or for that matter, React.

Advantages of Firefox Profiler

Here comes the Firefox Profiler: It might seem illogical to use it in a use case that its developers never thought about, but it has a lot going for it:

  • it has multiple visualizations and a timeline to select time slices
  • it is open-source but backed by a large company
  • it is actively developed by a small group of people
  • it has a matrix channel where you can ask a lot of questions and get great answers
  • it has a well-defined profile format which is rather extensively documented
  • its developers were open to collaborating with me, adapting Firefox Profiler for non-web use cases

It has still been a lot of work to add the necessary features for my use case, and it is an ongoing effort to integrate them into the mainline Firefox Profiler. But if the current Firefox Profiler meets all your needs UI-wise, then using it beyond the web is just a matter of writing a profiler.

Just keep in mind that you’ll have to map your data onto the profile data structure of Firefox Profiler.

My Java Profiler IntelliJ Plugin

My Java JFR profiler (GitHub) plugin is the result of all my efforts:

It uses my Firefox Profiler fork, which includes additions not yet in the upstream repository and has a modular implementation so that you can use the JFR to Firefox Profiler converter independently. The plugin supports gathering profiles using JFR and async-profiler (via ap-loader), the previous image with the different run configurations is from my plugin, and opening arbitrary JFR files (as long as they are not too large):

The plugin integrates with your IDE, navigating to a method in the source code when you double-click a method in the profile view. Shift double-click, and it shows you the code with the profiling information on the side:

Besides that, it has support for showing information on all JFR events:

The Firefox Profiler view contains a Function Table, Flame Graph, and Stack Chart view, combined with a timeline on top, so it truly solves the first problem of visualizations. And it solves the second problem, as profiling with JFR or async-profiler can’t be more accessible than clicking a single button.

The plugin is under active development and currently in the beta stage, so give it a try and feel free to create issues on GitHub. You can pick the Java JFR Profiler plugin from the JetBrains marketplace.

I will write an entire blog post covering the plugin usage in the next few weeks, so stay tuned for more information and possibly a screencast.

I’ll release another, more technical blog post in the next few days, which covers other tools that use Firefox Profiler as a visualization and the profile format. This will be the second part of my FOSDEM talk. I hope to see you there or in the stream.

This blog post and the IntelliJ Plugin are part of my work in the SapMachine team at SAP, making profiling easier for everyone.

Writing a Profiler from Scratch: The Profiling Loop

This is the second post in the series, building a profiler from scratch using AsyncGetCallTrace: Today, we’re covering wall-clock profiling and how to collect the obtain stack traces. If you’re unfamiliar with AsyncGetCallTrace, please check out my previous article in the series here.

Our goal today is to essentially write the primary loop of a profiler and do the following every n milliseconds (where n is our chosen profiling interval):

Schema of a profiler loop

The question that comes up is: Which threads should we sample? If we only sample currently running threads, then we have cpu-time profiling, but if we sample all threads (running and idle), we have the far more popular wall-clock profiling. The difference between both types is best described by Richard Warburton and Sadiq Jaffer in their article on opsian.com:

CPU Time. There’s very little point in trying to measure what’s happening off-cpu if you’ve got a single very hot thread, as many event-loop style applications have. You may see plenty of time being spent in other threads that are waiting around performing background work but aren’t actually a bottleneck in terms of your application. So CPU Time is useful when you know you’re bottlenecked on a CPU problem.

Wallclock Time is useful for problems where CPU utilisation seems low but application throughput or latency doesn’t meet expectations. This is because threads are probably blocked on something, for example lock contention, times when you’re waiting for downstream blocking network IO to happen or running with too many threads so some of them don’t get scheduled. The trigger for looking at Wallclock time would be low CPU usage during your period of performance issues.

Solving JVM Performance Problems with Profilers: Wallclock vs CPU Time Edition

The choice of the sampling mode, therefore, depends on the actual use case.

How do we obtain the list of threads in the first place? We could use the aptly named GetAllThreads method, but we cannot use the returned thread objects to quickly get the thread ids we need for sending the threads signals. There is another possibility: Registering for ThreadStart and ThreadEnd events and collecting the threads as they are created:

void JNICALL
ThreadStart(jvmtiEnv *jvmti_env,
            JNIEnv* jni_env,
            jthread thread)

Thread start events are generated by a new thread before its initial method executes. A thread may be listed in the array returned by GetAllThreads before its thread start event is generated. It is possible for other events to be generated on a thread before its thread start event. The event is sent on the newly started thread.

JVMTI documentation

We create a class called ThreadMap for this purpose (stored in a global variable): It maps thread ids to the internal ids (because thread ids might be reused during the lifetime of a process), and the jthread (for JVMTI operations) and the pthread (for sending signals) and maps these internal ids in turn to names, so we can later display them. It has to be thread-safe, as we’re accessing it from multiple threads in parallel. So we have to create a lock for its internal data structures. This is quite simple with modern C++:

struct ValidThreadInfo {
    jthread jthread;
    pthread_t pthread;
    bool is_running;
    long id;
};

class ThreadMap {
  std::recursive_mutex m;
  std::unordered_map<pid_t, ValidThreadInfo> map;
  std::vector<std::string> names;

  public:

    void add(pid_t pid, std::string name, jthread thread) {
      const std::lock_guard<std::recursive_mutex> lock(m);
      map[pid] = ValidThreadInfo{.jthread = thread, .pthread = pthread_self(), .id = (long)names.size()};
      names.emplace_back(name);
    }

    void remove(pid_t pid) {
      const std::lock_guard<std::recursive_mutex> lock(m);
      map.erase(pid);
    }
};

Obtaining the thread id for the current thread leads to our first platform-dependent code, using a syscall on Linux and a non-posix pthread method on mac:

pid_t get_thread_id() {
  #if defined(__APPLE__) && defined(__MACH__)
  uint64_t tid;
  pthread_threadid_np(NULL, &tid);
  return (pid_t) tid;
  #else
  return syscall(SYS_gettid);
  #endif
}

Yes, we could quickly implement our profiler for other BSDs like FreeBSD, but MacOS is the only one I have at hand.

Keep in mind to also add the main thread to the collection by calling the OnThreadStart method in the OnVMInit method, which handles the start of the VM.

Now we have an up-to-date list of all threads. For wall-clock profiling, this is fine, but for cpu-time profiling, we have to obtain a list of running threads. We conveniently stored the jthread objects in our thread map, so we can now use the GetThreadState method:

bool is_thread_running(jthread thread) {
  jint state;
  auto err = jvmti->GetThreadState(thread, &state);
  // handle errors as not runnable
  return err == 0 && state != JVMTI_THREAD_STATE_RUNNABLE;
}

This leads us to a list of threads that we can sample. It is usually not a good idea to sample all available threads: With wall-clock profiling, the list of threads can be so large that sampling all threads is too costly. Therefore one typically takes a random subset of the list of available threads. Async-profiler, for example, takes 8, which we use too.

Taking a random subset is quite cumbersome in C, but C++ has a neat library function since C++11: std:shuffle. We can use it to implement the random selection:

std::vector<pid_t> get_shuffled_threads() {
  const std::lock_guard<std::recursive_mutex> lock(m);
  std::vector<pid_t> threads = get_all_threads();
  std::random_device rd;
  std::mt19937 g(rd());
  std::shuffle(threads.begin(), threads.end(), g);
  return std::vector(threads.begin(), threads.begin() + std::min(MAX_THREADS_PER_ITERATION, (int)threads.size()));
}

Be aware that we had to change the mutex to a mutex which allows recursive reentrance, as get_all_threads also acquires the mutex.

The next step is creating a sampling thread that runs the loop I described at the beginning of this article. This superseedes the itimer sampler from the previous post, as the mechanism does not support sending signals to a random subset only. The sampling thread is started in the OnVMInit handler and joined the OnAgentUnload handler. Its implementation is rather simple:

std::atomic<bool> shouldStop = false;
std::thread samplerThread;

static void sampleLoop() {
  initSampler();
  std::chrono::nanoseconds interval{interval_ns};
  while (!shouldStop) {
    auto start = std::chrono::system_clock::now();
    sampleThreads();
    auto duration = std::chrono::system_clock::now() - start;
    // takes into account that sampleThreads() takes some time
    auto sleep = interval - duration;
    if (std::chrono::seconds::zero() < sleep) {
      std::this_thread::sleep_for(sleep);
    }
  }
  endSampler(); // print the collected information
}

static void startSamplerThread() {
  samplerThread = std::thread(sampleLoop);
}

The initSampler function preallocates the traces for N=8 threads, as we decided only to sample as many at every loop. The signal handlers can later use these preallocated traces to call AsyncGetCallTrace:

const int MAX_DEPTH = 512; // max number of frames to capture

static ASGCT_CallFrame global_frames[MAX_DEPTH * MAX_THREADS_PER_ITERATION];
static ASGCT_CallTrace global_traces[MAX_THREADS_PER_ITERATION];


static void initSampler() {
  for (int i = 0; i < MAX_THREADS_PER_ITERATION; i++) {
    global_traces[i].frames = global_frames + i * MAX_DEPTH;
    global_traces[i].num_frames = 0;
    global_traces[i].env_id = env;
  }
}

The sampleThreads the function sends the signals to each thread using pthread_signal. This is why store the pthread for every thread. But how does a signal handler know which of the traces to use? And how does the sampleThreads function know when all signal handlers are finished?

It uses two atomic integer variables for this purpose:

  • available_trace: the next unused available trace index in the global_traces array
  • stored_traces: the number of traces already stored

The signal handler is then quite simple:

static void signalHandler(int signo, siginfo_t* siginfo, void* ucontext) {
  asgct(&global_traces[available_trace++], MAX_DEPTH, ucontext);
  stored_traces++;
}

The sampleThreads function has to only wait till stored_traces is as expected:

static void sampleThreads() {
  // reset the global variables
  available_trace = 0;
  stored_traces = 0;
  // get the threads to sample
  auto threads = thread_map.get_shuffled_threads();
  for (pid_t thread : threads) {
    auto info = thread_map.get_info(thread);
    if (info) {
      // send a signal to the thread
      pthread_kill(info->pthread, SIGPROF);
    }
  }
  // wait till all handlers obtained there traces
  while (stored_traces < threads.size());
  // process the traces
  processTraces(threads.size());
}

We keep the processing of the traces from the last blog post and just store the total number of traces as well as the number of failed traces:

static void processTraces(size_t num_threads) {
  for (int i = 0; i < num_threads; i++) {
    auto& trace = global_traces[i];
    if (trace.num_frames <= 0) {
      failedTraces++;
    }
    totalTraces++;
  }
}

The two global variables don’t have to be atomic anymore, as only the sampler thread modifies them. A regular profiler would of course use the traces to obtain information on the run methods, but this is a topic for the next blog post in this series, so stay tuned. You can find the source code for the project on GitHub.

This blog series is part of my work in the SapMachine team at SAP, making profiling easier for everyone.

Writing a Profiler from Scratch: Introduction

Async-profiler is undoubtedly one of the most used open-source Java profilers out there. Its most commonly used feature is sampling a given Java program’s stack traces and visualizing them as a flame graph. I would recommend reading the excellent async-profiler manual by Krzysztof Ślusarski or taking a look at my profiling playlist on YouTube if you’re new to this tool. After using the async-profiler for a while, you might wonder: How does this tool work? And, of course, if you’re someone like me: Could I write a stripped-down version to learn how it really works? I did this a few years back in university when I wanted to understand how LR(1) parser generators work: Turns out it is a lot of work, but my parser generator can output GIFs.

This blog series will take you with me on this journey, where we start with a small tool and see where curiosity leads us. It’s helpful if you know some C/C++ and have profiled Java using async-profiler before, which helps you better understand my code snippets and the profiling Jargon.

All the code will present on GitHub, and I’m happy for any suggestions, bug reports, and comments either under this blog post or on GitHub.

The aim of this series is

  • to understand how a sampling profiler works
  • to write a simple profiler, and possibly related tools, based on the AsyncGetCallTrace API
  • to have some good tooling which I use in OpenJDK tests, to make the used APIs safer and more reliable

The aim is not to write anything vaguely production ready: async-profiler has too many features to count and is battle-tested.

Come with me on this journey of knowledge, starting with a simple tool that just counts how many times AsyncGetCallTrace could not trace the stack.

The fundamental API that we will be relying on is the AsyncGetCallTrace API of the OpenJDK: This API can be used to obtain the stack trace (which methods are on the stack) for a given thread at any time. This makes it quite useful for writing accurate profilers. The API was introduced in November 2002 for Sun Studio. However, Sun removed it in January 2003 and demoted it to an internal API, but most profiling applications still use it. The following part of this post is inspired by Nitsan Wakart’s post, The Pros and Cons of AsyncGetCallTrace Profilers.

The AsyncGetCallTrace API is defined in the forte.cpp file:

typedef struct {
  jint lineno;         // BCI in the source file, or < 0 for native methods
  jmethodID method_id; // method executed in this frame
} ASGCT_CallFrame;

typedef struct {
  JNIEnv *env_id;   // Env where trace was recorded
  jint num_frames;  // number of frames in this trace, < 0 gives us an error code
  ASGCT_CallFrame *frames; // recorded frames 
} ASGCT_CallTrace; 

void AsyncGetCallTrace(ASGCT_CallTrace *trace, // pre-allocated trace to fill
                       jint depth,             // max number of frames to walk
                       void* ucontext);        // signal context

One typically uses this API by pinging a thread using a signal, stopping the thread, and invoking the signal handler, which in turn calls AsyncGetCallTrace with the execution context of the stopped thread (the ucontext) so that AsyncGetCallTrace can walk the thread, skipping all C/C++ frames on the stack and only storing the native (from native methods) and Java frames in the frames array. The signal handler has to process the trace, but this is for another post. We will just store the number of times that AsyncGetCallTrace was successful and unsuccessful.

Be aware that we cannot allocate any memory outside the stack in a signal handler. For this reason, we have to preallocate the data structures for AsyncGetCallTrace. There are a few C library functions that are guaranteed to be signal safe. To learn more about signals, consider reading Introduction To Unix Signals Programming, or similar sources. But don’t be confused with the terms thread and process. In Unix (Mac OS, Linux, BSD), threads are processes that share the same address space. Every (non-virtual/Loom) thread in Java is backed by an OS thread.

As an example of calling AsyncGetCallTrace, consider profiling the following Java code:

public class BasicSample {

    public void waitForever() throws InterruptedException {
        System.out.print("Waiting forever...");
        for (int i = 0; i < 100; i++) {
            Thread.sleep(10);
            System.out.print(".");
        }
        System.out.println("done");
    }

    public static void main(String[] args) throws InterruptedException {
        new BasicSample().waitForever();
    }
}

During profiling, we call AsyncGetCallTrace often, but let’s visualize a trace when the JVM runs one of the println lines.

AsyncGetCallTrace on a small example, using the demo code for JEP 435

Interrupting a thread at any point and running some code can have stability implications. If you’re interested in these when using AsyncGetCallTrace, head over to the async-profiler manual, where I co-wrote the section on this topic. This small sample tool seems to be quite good at triggering a specific fault in JDK 19+36, run the tool of this blog post yourself to find it.

The signal handler in our small example is the following:

std::atomic<size_t> failedTraces = 0;
std::atomic<size_t> totalTraces = 0;

static void signalHandler(int signo, siginfo_t* siginfo, void* ucontext) {
  const int MAX_DEPTH = 512; // max number of frames to capture
  static ASGCT_CallFrame frames[MAX_DEPTH];
  ASGCT_CallTrace trace;
  trace.frames = frames;
  trace.num_frames = 0;
  trace.env_id = env; // we obtained this via the OnVMInit hook
  
  // call AsyncGetCallTrace
  asgct(&trace, MAX_DEPTH, ucontext);

  // process the results
  totalTraces++;
  if (trace.num_frames < 0) {
    failedTraces++;
  }
}

We use atomic variables here to increment the two counting variables in parallel. We cannot use any locks as creating locks is not signal-safe.

You see in line 13 that we cannot call AsyncGetCallTrace directly, as it is not exported in any JVM header. So we have to obtain the pointer to this function via dlsym at the beginning, which is a bit brittle:

static void initASGCT() {
  asgct = reinterpret_cast<ASGCTType>(dlsym(RTLD_DEFAULT, "AsyncGetCallTrace"));
  if (asgct == NULL) {
    fprintf(stderr, "=== ASGCT not found ===\n");
    exit(1);
  }
}

Additionally, we have to copy the declarations for ASGCT_CallFrame, ASGCT_CallTrace, and AsyncGetCallTrace into our project.

After writing a signal handler, we must use some mechanism to create signals. There are multiple ways, like perf or using a thread that signals all threads every few milliseconds, but we’ll use the most straightforward option which is a timer:

static bool startITimerSampler() {
  time_t sec = interval_ns / 1000000000;
  suseconds_t usec = (interval_ns % 1000000000) / 1000;
  struct itimerval tv = {{sec, usec}, {sec, usec}};

  // ...

  if (setitimer(ITIMER_PROF, &tv, NULL) != 0) {
    return false;
  }
  return true;
}

Our code uses the timers in PROF mode: “A profiling timer that counts both processor time used by the process, and processor time spent in system calls on behalf of the process. This timer sends a SIGPROF signal to the process when it expires.” (see gnu.org) The result is roughly similar to the CPU event of the async-profiler. It is inaccurate, but we’ll tackle this problem in another blog post.

You can find the final code in the GitHub repo as libSmallProfiler.cpp. It includes all the boiler-plate code for JVMTI agents that I omitted in this blog post for brevity. Feel free to file issues or PRs with improvements or suggestions there. When we finally run the tool with a JVM and the example Java program, we get the following output via java -agentpath:cpp/libSmallProfiler.dylib=interval=0.001s -cp samples BasicSample:

Waiting forever.......................................................................................................done
Failed traces:          5
Total traces:          15
Failed ratio:       33.33%

This tool might seem to be rather useless, but one can adjust its sampling interval by specifying the interval option: This makes it quite helpful in testing the stack walking code of AsyncGetCallTrace.

I hope I didn’t frighten you too much with all this Unix C, and hopefully see you again in around two weeks for my next blog post in which we create a tool that outputs a method list.

The code of this blog post is based on the libAsyncGetCallTraceTest.cpp and the libAsyncGetStackTraceSampler.cpp of the OpenJDK. This blog series is part of my work in the SapMachine team at SAP, making profiling easier for everyone.