JSON is one of the foundational data formats, especially with modern REST APIs, but many existing libraries are either large or small and hard to comprehend. In this blog post, I’ll show you how to implement a tiny JSON parser that follows the official JSON grammar directly. This parser will not be the fastest or the most featureful, but it will be good enough for many use cases, where you need to parse simple JSON structures returned by a REST API. The resulting library is femtojson.
TLDR: I built femtojson, a tiny JSON library.
Usage
This library is as minimal as possible, only emitting existing List, Map, and primitive types, instead of custom wrapper classes:
import me.bechberger.util.json.JSONParser;
import java.io.IOException;
import java.util.Map;
import java.util.List;
public class Example {
public static void main(String[] args) throws IOException {
// Parse a simple object
String jsonObject = "{\"name\": \"Alice\", \"age\": 30}";
Map<String, Object> obj = (Map<String, Object>) JSONParser.parse(jsonObject);
System.out.println(obj.get("name")); // Output: Alice
System.out.println(obj.get("age")); // Output: 30
// Parse an array
String jsonArray = "[1, 2, 3, 4, 5]";
List<Object> numbers = (List<Object>) JSONParser.parse(jsonArray);
System.out.println(numbers.get(0)); // Output: 1
// Parse nested structures
String complexJson = "{\"items\": [1, \"two\", 3.14], \"active\": true}";
Map<String, Object> complex = (Map<String, Object>) JSONParser.parse(complexJson);
List<Object> items = (List<Object>) complex.get("items");
System.out.println(items.get(1)); // Output: two
Boolean active = (Boolean) complex.get("active");
System.out.println(active); // Output: true
}
}
The library also contains a pretty printer and tests to check that all edge cases are handled properly.
We start this blog post by looking at the grammar:
JSON Grammar
The grammar in McKeeman form with my own annotations and slightly reordered to make it more readable:
json # this is the entry point to the grammar
element
element # a JSON element is a JSON with some whitespace
# e.g. ' 1 ' is also valid JSON
ws value ws
value # a JSON value is either a primitive,
# an array, or an object
object
array
string
number
"true"
"false"
"null"
object # a JSON object is either empty ('{ }') or has members
'{' ws '}'
'{' members '}'
members # members is a comma separated list
member
member ',' members
member # a member is '"key": value', with arbitrary whitespace
ws string ws ':' element
array # an array is either empty or has elements
'[' ws ']'
'[' elements ']'
elements # elements is like members
element
element ',' elements
string # a string is characters inside '"'
'"' characters '"'
characters
""
character characters
character
'0020' . '10FFFF' - '"' - '\'
'\' escape
escape
'"'
'\'
'/'
'b'
'f'
'n'
'r'
't'
'u' hex hex hex hex
hex
digit
'A' . 'F'
'a' . 'f'
number
integer fraction exponent
integer
digit
onenine digits
'-' digit
'-' onenine digits
digits
digit*
digit
'0'
onenine
onenine
'1' . '9'
fraction
""
'.' digits
exponent
""
'E' sign digits
'e' sign digits
sign
""
'+'
'-'
ws # all supported whitespace characters (as hex codepoints)
""
'0020' ws
'000A' ws
'000D' ws
'0009' ws
Transforming the Grammar
We could start using this grammar directly to implement the parser, but we want to simplify it. Ideally, we want a grammar where, for every rule, we can decide between the options using as few characters as possible. With JSON, we can achieve this with only one character of so-called lookahead.
The only problematic bits are related to the concatenation of the same rule, sometimes with a character in between. Let’s rewrite the grammar, introducing the many rules into our previously pure McKeenan form:
json
element
element
ws value ws
value
object
array
string
number
"true"
"false"
"null"
object # a JSON object is either empty ('{ }') or has members
'{' ws '}'
'{' member (',' member)* '}'
member # a member is '"key": value', with arbitrary whitespace
ws string ws ':' element
array # an array is either empty or has elements
'[' ws ']'
'[' element (',' elements)* ']'
string # a string is characters inside '"'
'"' character* '"
character # essentially all non control characters excluding '"' and '\'
'0020' . '10FFFF' - '"' - '\'
'\' escape
escape # the characters that can be escaped + special characters
'"'
'\'
'/'
'b'
'f'
'n'
'r'
't'
'u' hex hex hex hex
hex # valid hexadecimal character
digit
'A' . 'F'
'a' . 'f'
number # numbers a floating points with optional exponents
integer fraction exponent
integer
digit
onenine digits
'-' digit
'-' onenine digits
digits
digit
digit digits
digit
'0'
onenine
onenine
'1' . '9'
fraction
""
'.' digits
exponent
""
'E' sign digits
'e' sign digits
sign
""
'+'
'-'
ws # all supported whitespace characters (as hex codepoints)
""
'0020' ws
'000A' ws
'000D' ws
'0009' ws
Implementing the Rules
This grammar can now be easily implemented by creating a method for every rule:
/*
R
'a' A
'b' B
*/
void parseR() {
if (current == 'a') {
advance()
parseA();
} else if (current == 'b') {
advance();
parseB();
}
throw new ParseException(...);
}
The star expression can be expressed as loops:
/*
R
A (',' A)*
*/
... parseR() {
parseA();
while (current == ',') {
parseA();
}
}
Of course, one would collect the results of the parse steps. You can read more about these so-called recursive descent parsers on Wikipedia.
A great thing about a parser that exposes individual rules is that we can use the same parser to parse subsets of the JSON grammar, e.g., only allow objects at the top level. Making the API simpler to use.
Conclusion
With some grammar engineering, we can build a grammar that is straightforward to implement as a recursive descent parser. The resulting parser library with a pretty printer is only 12KB big.
See you next week with something JDK specific on JCmd and JMX Diagnostic Beans.
This blog post is part of my work in the SapMachine team at SAP, making profiling easier for everyone.




