Writing a tiny JSON Parser

JSON is one of the foundational data formats, especially with modern REST APIs, but many existing libraries are either large or small and hard to comprehend. In this blog post, I’ll show you how to implement a tiny JSON parser that follows the official JSON grammar directly. This parser will not be the fastest or the most featureful, but it will be good enough for many use cases, where you need to parse simple JSON structures returned by a REST API. The resulting library is femtojson.

TLDR: I built femtojson, a tiny JSON library.

Usage

This library is as minimal as possible, only emitting existing List, Map, and primitive types, instead of custom wrapper classes:

import me.bechberger.util.json.JSONParser;
import java.io.IOException;
import java.util.Map;
import java.util.List;

public class Example {
    public static void main(String[] args) throws IOException {
        // Parse a simple object
        String jsonObject = "{\"name\": \"Alice\", \"age\": 30}";
        Map<String, Object> obj = (Map<String, Object>) JSONParser.parse(jsonObject);
        
        System.out.println(obj.get("name"));  // Output: Alice
        System.out.println(obj.get("age"));   // Output: 30
        
        // Parse an array
        String jsonArray = "[1, 2, 3, 4, 5]";
        List<Object> numbers = (List<Object>) JSONParser.parse(jsonArray);
        
        System.out.println(numbers.get(0)); // Output: 1
        
        // Parse nested structures
        String complexJson = "{\"items\": [1, \"two\", 3.14], \"active\": true}";
        Map<String, Object> complex = (Map<String, Object>) JSONParser.parse(complexJson);
        
        List<Object> items = (List<Object>) complex.get("items");
        System.out.println(items.get(1)); // Output: two
        
        Boolean active = (Boolean) complex.get("active");
        System.out.println(active); // Output: true
    }
}

The library also contains a pretty printer and tests to check that all edge cases are handled properly.

We start this blog post by looking at the grammar:

JSON Grammar

The grammar in McKeeman form with my own annotations and slightly reordered to make it more readable:

json   # this is the entry point to the grammar
   element

element # a JSON element is a JSON with some whitespace
        # e.g. '  1  ' is also valid JSON
    ws value ws

value   # a JSON value is either a primitive, 
        # an array, or an object
   object
   array
   string
   number
   "true"
   "false"
   "null"

object  # a JSON object is either empty ('{ }') or has members
    '{' ws '}'
    '{' members '}'

members # members is a comma separated list
    member
    member ',' members

member  # a member is '"key": value', with arbitrary whitespace 
    ws string ws ':' element

array   # an array is either empty or has elements
    '[' ws ']'
    '[' elements ']'

elements # elements is like members
    element
    element ',' elements

string  # a string is characters inside '"'
    '"' characters '"'

characters
    ""
    character characters

character
    '0020' . '10FFFF' - '"' - '\'
    '\' escape

escape
    '"'
    '\'
    '/'
    'b'
    'f'
    'n'
    'r'
    't'
    'u' hex hex hex hex

hex
    digit
    'A' . 'F'
    'a' . 'f'

number
    integer fraction exponent

integer
    digit
    onenine digits
    '-' digit
    '-' onenine digits

digits
    digit*

digit
    '0'
    onenine

onenine
    '1' . '9'

fraction
    ""
    '.' digits

exponent
    ""
    'E' sign digits
    'e' sign digits

sign
    ""
    '+'
    '-'

ws   # all supported whitespace characters (as hex codepoints)
    ""
    '0020' ws
    '000A' ws
    '000D' ws
    '0009' ws

Transforming the Grammar

We could start using this grammar directly to implement the parser, but we want to simplify it. Ideally, we want a grammar where, for every rule, we can decide between the options using as few characters as possible. With JSON, we can achieve this with only one character of so-called lookahead.

The only problematic bits are related to the concatenation of the same rule, sometimes with a character in between. Let’s rewrite the grammar, introducing the many rules into our previously pure McKeenan form:

json
   element

element
    ws value ws

value
   object
   array
   string
   number
   "true"
   "false"
   "null"

object  # a JSON object is either empty ('{ }') or has members
    '{' ws '}'
    '{' member (',' member)* '}'

member  # a member is '"key": value', with arbitrary whitespace 
    ws string ws ':' element

array   # an array is either empty or has elements
    '[' ws ']'
    '[' element (',' elements)* ']'

string  # a string is characters inside '"'
    '"' character* '"

character # essentially all non control characters excluding '"' and '\'
    '0020' . '10FFFF' - '"' - '\'
    '\' escape

escape   # the characters that can be escaped + special characters
    '"'
    '\'
    '/'
    'b'
    'f'
    'n'
    'r'
    't'
    'u' hex hex hex hex

hex     # valid hexadecimal character
    digit
    'A' . 'F'
    'a' . 'f'

number  # numbers a floating points with optional exponents
    integer fraction exponent

integer
    digit
    onenine digits
    '-' digit
    '-' onenine digits

digits
    digit
    digit digits

digit
    '0'
    onenine

onenine
    '1' . '9'

fraction
    ""
    '.' digits

exponent
    ""
    'E' sign digits
    'e' sign digits

sign
    ""
    '+'
    '-'

ws   # all supported whitespace characters (as hex codepoints)
    ""
    '0020' ws
    '000A' ws
    '000D' ws
    '0009' ws

Implementing the Rules

This grammar can now be easily implemented by creating a method for every rule:

/*
R
  'a' A
  'b' B
*/
void parseR() {
  if (current == 'a') {
    advance()
    parseA();
  } else if (current == 'b') {
    advance();
    parseB();
  }
  throw new ParseException(...);
}

The star expression can be expressed as loops:

/*
R
  A (',' A)*
*/
... parseR() {
  parseA();
  while (current == ',') {
    parseA();
  }
}

Of course, one would collect the results of the parse steps. You can read more about these so-called recursive descent parsers on Wikipedia.

A great thing about a parser that exposes individual rules is that we can use the same parser to parse subsets of the JSON grammar, e.g., only allow objects at the top level. Making the API simpler to use.

Conclusion

With some grammar engineering, we can build a grammar that is straightforward to implement as a recursive descent parser. The resulting parser library with a pretty printer is only 12KB big.

See you next week with something JDK specific on JCmd and JMX Diagnostic Beans.

This blog post is part of my work in the SapMachine team at SAP, making profiling easier for everyone.

Author

  • Johannes Bechberger

    Johannes Bechberger is a JVM developer working on profilers and their underlying technology in the SapMachine team at SAP. This includes improvements to async-profiler and its ecosystem, a website to view the different JFR event types, and improvements to the FirefoxProfiler, making it usable in the Java world. His work today comprises many open-source contributions and his blog, where he regularly writes on in-depth profiling and debugging topics. He also works on hello-ebpf, the first eBPF library for Java. His most recent contribution is the new CPU Time Profiler in JDK 25.

    View all posts

New posts like these come out at least every two weeks, to get notified about new posts, follow me on BlueSky, Twitter, Mastodon, or LinkedIn, or join the newsletter:

Leave a Reply

Your email address will not be published. Required fields are marked *