Let’s create a Python Debugger together: FOSDEM Talk

Posted on February 8, 2024 by Johannes Bechberger

A small addendum to the previous six parts of my journey down the Python debugger rabbit hole (part 1, part 2, part 3, part 4, part 5, and part 6).

I gave a talk on the topic of Python 3.12’s new monitoring and debugging API at FOSDEM’s Python Devroom:

Furthermore, I’m excited to announce my acceptance to PyCon Berlin this year. When I started my blog series last year, I would’ve never dreamed of speaking at a large Python conference. I’m probably the only OpenJDK developer there, but I’m happy to meet many new people from a different community.

This article is part of my work in the SapMachine team at SAP, making profiling and debugging easier for everyone.

Finding all used Classes, Methods and Functions of a Python Module

Posted on December 1, 2023 by Johannes Bechberger

Another blog post in which I use sys.settrace. This time to solve a real problem.

When working with new modules, it is sometimes beneficial to get a glimpse of which entities of a module are actually used. I wrote something comparable in my blog post Instrumenting Java Code to Find and Handle Unused Classes, but this time, I need it in Python and with method-level granularity.

TL;DR

Download trace.py from GitHub and use it to print a call tree and a list of used methods and classes to the error output:

import trace
trace.setup(r"MODULE_REGEX", print_location_=True)

Implementation

This could be a hard problem, but it isn’t when we’re using sys.settrace to set a handler for every method and function call, reapplying the knowledge we gained in my Let’s create a debugger together series to develop a small utility.

There are essentially six different types of functions (this sample code is on GitHub):

def log(message: str):
    print(message)


class TestClass:
    # static initializer of the class
    x = 100

    def __init__(self):
        # constructor
        log("instance initializer")

    def instance_method(self):
        # instance method, self is bound to an instance
        log("instance method")

    @staticmethod
    def static_method():
        log("static method")

    @classmethod
    def class_method(cls):
        log("class method")


def free_function():
    log("free function")

This is important because we have to handle them differently in the following. But first, let’s define a few helpers and configuration variables:

indent = 0
module_matcher: str = ".*"
print_location: bool = False

We also want to print a method call-tree, so we use indent to track the current indentation level. The module_matcher is the regular expression that we use to determine whether we want to consider a module, its classes, and methods. This could, e.g., be __main__ to only consider the main module. The print_location tells us whether we want to print the path and line location for every element in the call tree.

Now to the main helper class:

def log(message: str):
    print(message, file=sys.stderr)


STATIC_INIT = "<static init>"

@dataclass
class ClassInfo:
    """ Used methods of a class """
    name: str
    used_methods: Set[str] = field(default_factory=set)

    def print(self, indent_: str):
        log(indent_ + self.name)
        for method in sorted(self.used_methods):
            log(indent_ + "  " + method)

    def has_only_static_init(self) -> bool:
        return (
                    len(self.used_methods) == 1 and
                    self.used_methods.pop() == STATIC_INIT)

used_classes: Dict[str, ClassInfo] = {}
free_functions: Set[str] = set()

The ClassInfo stores the used methods of a class. We store the ClassInfo instances of used classes and the free function in global variables.

Now to the our call handler that we pass to sys.settrace:

def handler(frame: FrameType, event: str, *args):
    """ Trace handler that prints and tracks called functions """
    # find module name
    module_name: str = mod.__name__ if (
        mod := inspect.getmodule(frame.f_code)) else ""

    # get name of the code object
    func_name = frame.f_code.co_name

    # check that the module matches the define regexp
    if not re.match(module_matcher, module_name):
        return
    
    # keep indent in sync
    # this is the only reason why we need
    # the return events and use an inner trace handler
    global indent
    if event == 'return':
        indent -= 2
        return
    if event != "call":
        return

    # insert the current function/method
    name = insert_class_or_function(module_name, func_name, frame)

    # print the current location if neccessary
    if print_location:
        do_print_location(frame)
    
    # print the current function/method
    log(" " * indent + name)

    # keep the indent in sync
    indent += 2

    # return this as the inner handler to get
    # return events
    return handler


def setup(module_matcher_: str = ".*", print_location_: bool = False):
    # ...
    sys.settrace(handler)

Now, we “only” have to get the name for the code object and collect it properly in either a ClassInfo instance or the set of free functions. The base case is easy: When the current frame contains a local variable self, we probably have an instance method, and when it contains a cls variable, we have a class method.

def insert_class_or_function(module_name: str, func_name: str,
                             frame: FrameType) -> str:
    """ Insert the code object and return the name to print """
    if "self" in frame.f_locals or "cls" in frame.f_locals:
        return insert_class_or_instance_function(module_name,
                                                 func_name, frame)
   # ...

def insert_class_or_instance_function(module_name: str,
                                      func_name: str,
                                      frame: FrameType) -> str:
    """
    Insert the code object of an instance or class function and
    return the name to print
    """
    class_name = ""

    if "self" in frame.f_locals:
        # instance methods
        class_name = frame.f_locals["self"].__class__.__name__

    elif "cls" in frame.f_locals:
        # class method
        class_name = frame.f_locals["cls"].__name__
        # we prefix the class method name with "<class>"
        func_name = "<class>" + func_name
    
    # add the module name to class name
    class_name = module_name + "." + class_name
    get_class_info(class_name).used_methods.add(func_name)
    used_classes[class_name].used_methods.add(func_name)
    
    # return the string to print in the class tree
    return class_name + "." + func_name

But how about the other three cases? We use the header line of a method to distinguish between them:

class StaticFunctionType(Enum):
    INIT = 1
    """ static init """
    STATIC = 2
    """ static function """
    FREE = 3
    """ free function, not related to a class """


def get_static_type(code: CodeType) -> StaticFunctionType:
    file_lines = Path(code.co_filename).read_text().split("\n")
    line = code.co_firstlineno
    header_line = file_lines[line - 1]
    if "class " in header_line:
        # e.g. "class TestClass"
        return StaticFunctionType.INIT
    if "@staticmethod" in header_line:
        return StaticFunctionType.STATIC
    return StaticFunctionType.FREE

These are, of course, just approximations, but they work well enough for a small utility used for exploration.

If you know any other way that doesn’t involve using the Python AST, feel free to post in a comment below.

Using the get_static_type function, we can now finish the insert_class_or_function function:

def insert_class_or_function(module_name: str, func_name: str,
                             frame: FrameType) -> str:
    """ Insert the code object and return the name to print """
    if "self" in frame.f_locals or "cls" in frame.f_locals:
        return insert_class_or_instance_function(module_name,
                                                 func_name, frame)
    # get the type of the current code object
    t = get_static_type(frame.f_code)

    if t == StaticFunctionType.INIT:
        # static initializer, the top level class code
        # func_name is actually the class name here,
        # but classes are technically also callable function
        # objects
        class_name = module_name + "." + func_name
        get_class_info(class_name).used_methods.add(STATIC_INIT)
        return class_name + "." + STATIC_INIT
    
    elif t == StaticFunctionType.STATIC:
        # @staticmethod
        # the qualname is in our example TestClass.static_method,
        # so we have to drop the last part of the name to get
        # the class name
        class_name = module_name + "." + frame.f_code.co_qualname[
                                         :-len(func_name) - 1]
        # we prefix static class names with "<static>"
        func_name = "<static>" + func_name
        get_class_info(class_name).used_methods.add(func_name)
        return class_name + "." + func_name
 
    free_functions.add(frame.f_code.co_name)
    return module_name + "." + func_name

The final thing left to do is to register a teardown handler to print the collected information on exit:

def teardown():
    """ Teardown the tracer and print the results """
    sys.settrace(None)
    log("********** Trace Results **********")
    print_info()


# trigger teardown on exit
atexit.register(teardown)

Usage

We now prefix our sample program from the beginning with

import trace

trace.setup(r"__main__")

collect all information for the __main__ module, which is directly passed to the Python interpreter.

We append to our program some code to call all methods/functions:

def all_methods():
    log("all methods")
    TestClass().instance_method()
    TestClass.static_method()
    TestClass.class_method()
    free_function()


all_methods()

Our utility library then prints the following upon execution:

standard error:

    __main__.TestClass.<static init>
    __main__.all_methods
      __main__.log
      __main__.TestClass.__init__
        __main__.log
      __main__.TestClass.instance_method
        __main__.log
      __main__.TestClass.<static>static_method
        __main__.log
      __main__.TestClass.<class>class_method
        __main__.log
      __main__.free_function
        __main__.log
    ********** Trace Results **********
    Used classes:
      only static init:
      not only static init:
       __main__.TestClass
         <class>class_method
         <static init>
         <static>static_method
         __init__
         instance_method
    Free functions:
      all_methods
      free_function
      log

standard output:

    all methods
    instance initializer
    instance method
    static method
    class method
    free function

Conclusion

This small utility uses the power of sys.settrace (and some string processing) to find a module’s used classes, methods, and functions and the call tree. The utility is pretty helpful when trying to grasp the inner structure of a module and the module entities used transitively by your own application code.

I published this code under the MIT license on GitHub, so feel free to improve, extend, and modify it. Come back in a few weeks to see why I actually developed this utility…

This article is part of my work in the SapMachine team at SAP, making profiling and debugging easier for everyone.

Let’s create a Python Debugger together: PyData Talk

Posted on November 17, 2023 by Johannes Bechberger

A small addendum to the previous five parts of my journey down the Python debugger rabbit hole (part 1, part 2, part 3, part 4, and part 5).

I gave a talk on this topic, based on my blog posts, at PyData Karlsruhe:

You can find all the source code of the demos here. It was a great pleasure giving this talk, and the audience received it well.

This might be the end of my journey into Python debuggers, but I feel some untold topics are out there. So, if you have any ideas, feel free to comment. See you in my next blog post and possibly at the next Python conference that accepts my talk proposal.

The presentation was part of my work in the SapMachine team at SAP, making profiling and debugging easier for everyone.

Let’s create a Python Debugger together: Tiny Addendum (exec and name)

Posted on November 14, 2023 by Johannes Bechberger

A small addendum to the previous four parts of my journey down the Python debugger rabbit hole (part 1, part 2, part 3, and part 4).

I tried the debugger I finished last week on a small sample application for my upcoming talk at the PyData Südwest meet-up, and it failed. The problem is related to running the file passed to the debugger. Consider that we debug the following program:

def main():
    print("Hi")

if __name__ == "__main__":
    main()

We now set the breakpoint in the main method when starting the debugger and continuing with the execution of the program. The problem: It never hits the breakpoint. But why? Because it never calls the main method.

The cause of this problem is that the __name__ variable is set to dbg2.py (the file containing the code is compiling and running the script). But how do we run the script? We use the following (based on a Real Python article):

_globals = globals().copy()

# ...

class Dbg:
  
   # ...   
    
   def run(self, file: Path):
        """ Run a given file with the debugger """
        self._main_file = file
        # see https://realpython.com/python-exec/#using-python-for-configuration-files
        compiled = compile(file.read_text(), filename=str(file), mode='exec')
        sys.argv.pop(0)
        sys.breakpointhook = self._breakpoint
        self._process_compiled_code(compiled)
        exec(compiled, _globals)

This code uses the compile method to compile the code, telling this method that the file belongs to the program file.

The mode argument specifies what kind of code must be compiled; it can be 'exec' if source consists of a sequence of statements, 'eval' if it consists of a single expression, or 'single' if it consists of a single interactive statement (in the latter case, expression statements that evaluate to something other than None will be printed).
Python Documentation for The Mode Argument of the Compile Method

We then remove the first argument of the program because it is the debugged file in the case of the debugger and run some post-processing on the compiled code object. This is the reason why we can’t just use eval. Finally, we use exec to execute the compiled code with the global variables that we had before creating the Dbg class and others.

The problem is that exec it doesn’t set the import-related module attributes, such as __name__ and __file__ properly. So we have to emulate these by adding global variables:

        exec(compiled, _globals | 
                      {"__name__": "__main__", "__file__": str(file)})

It makes of course sense that exec behaves this way, as it is normally used to evaluate code in the current context.

With this now fixed, it is possible to debug normal applications like the line counter that I use in my upcoming talk at the 16th November in Karlsruhe.

I hope you liked this short addendum and see you next time with a blog post on something more Java-related.

Let’s create a Python Debugger together: Part 4 (Python 3.12 edition)

Posted on November 10, 2023 by Johannes Bechberger

The fourth part of my journey down the Python debugger rabbit hole (part 1, part 2, and part 3).

In this article, we’ll be looking into how changes introduced in Python 3.12 can help us with one of the most significant pain points of our current debugger implementation: The Python interpreter essentially calls our callback at every line of code, regardless if we have a breakpoint in the currently running method. But why is this the case?

Continue reading →

Let’s create a Python Debugger together: Part 3 (Refactoring)

Posted on November 6, 2023 by Johannes Bechberger

This is the necessary third part of my journey down the Python debugger rabbit hole; if you’re new to the series, please take a look at part 1 and part 2 first.

I promised in the last part of this series that I’ll show you how to use the new Python APIs. However, some code refactoring is necessary before I can finally proceed. The implementation in dbg.py mixes the sys.settrace related code and code that can be reused for other debugging implementations. So, this is a short blog post covering the result of the refactoring. The code can be found in dbg2 .py.

Continue reading →

Let’s create a Python Debugger together: Part 2

Posted on October 6, 2023 by Johannes Bechberger

The second part of my journey down the Python debugger rabbit hole.

In this blog post, we extend and fix the debugger we created in part 1: We add the capability to

single step over code, stepping over lines, into function calls, and out of functions,
and adding conditions to breakpoints.

You can find the resulting MIT-licensed code on GitHub in the python-dbg repository in the file dbg.py. I added a README so you can glance at how to use it.

Continue reading →

Let’s create a Python Debugger together: Part 1

Posted on September 20, 2023 by Johannes Bechberger

A journey down the Python debugger rabbit hole.

Have you ever wondered how debuggers work? What happens when you set a breakpoint and hit it later? Debuggers are tools that we as developers use daily in our work, but few know how they are actually implemented.

Continue reading →

Mostly nerdless

Every two weeks a text on profiling, debugging or eBPF

Tag Archives: Python

Let’s create a Python Debugger together: FOSDEM Talk

Finding all used Classes, Methods and Functions of a Python Module

TL;DR

Implementation

Usage

Conclusion

Let’s create a Python Debugger together: PyData Talk

Let’s create a Python Debugger together: Tiny Addendum (exec and name)

Let’s create a Python Debugger together: Part 4 (Python 3.12 edition)

Let’s create a Python Debugger together: Part 3 (Refactoring)

Let’s create a Python Debugger together: Part 2

Let’s create a Python Debugger together: Part 1