# Introduction allow me to introducing myself # Whadya mean “interruptible” ## Have you ever had this happen? * you start some complicated computation in NumPy (or whatever) * you realize you made a mistake * you press control-C * it doesn’t stop running for several seconds ## What went wrong * Whenever this happens, it’s a bug in a compiled-code extension. * It didn’t call `PyErr_CheckSignals` often enough. * This talk is about: - Why extensions need to do that - Why, today, many extensions *don’t* do that - How we can make it easier to do that ## Audience background check Raise your hand if … * you’ve written code in a compiled language * C specifically. Not Ada, not C++, not D, not Fortran, not Swift, not Rust, not Zig, etc etc etc. * you have written a compiled-code extension for CPython * you did that using the C-API directly — no intermediate libraries or helpers * you have written a multithreaded program * you have written a signal handler * you know the difference between thread-safe code and async-signal-safe code ## What happens when you press control-C1 * Starts out like any other keystroke * Reaches the _terminal line discipline_ as ASCII character `'\x03'` * (by default) Converted to a _signal_, `SIGINT` * Python interpreter reacts to `SIGINT` by raising `KeyboardInterrupt` * Normal exception handling process for `KeyboardInterrupt` - It’s not a subclass of `Exception`, to discourage people catching it too early ## `KeyboardInterrupt` is already trouble for pure Python - writing code that’s exception safe is already hard (`with` statements do help) - exceptions that can be thrown _at any point_ are worse > If a signal handler raises an exception, the exception will be > propagated to the main thread and may be raised after any bytecode > instruction. Most notably, a KeyboardInterrupt may appear at any > point during execution. Most Python code, including the standard > library, cannot be made robust against this, and so a > KeyboardInterrupt (or any other exception resulting from a signal > handler) may on rare occasions put the program in an unexpected > state. (`signal` module docs) - as usual, C makes everything harder ## How CPython deals with signals * C-level signals can be “delivered” after _any machine instruction_ - Like how `KeyboardInterrupt` can be thrown after any bytecode instruction, but nastier * For safety, Python’s C-level signal handler just sets flags in the interpreter state * The interpreter’s main loop checks those flags periodically (right after some — not all — bytecode instructions) * When those flags are set, it calls the Python-level signal handlers * The default *Python-level* handler for `SIGINT` throws `KeyboardInterrupt` ## “Delivered”? “Any machine instruction”? It’s hard to appreciate just how nasty C-level signals are if you haven’t seen for yourself what they do at the machine level, so I’m gonna show you. Here’s a pure Python function that takes a long time to execute. ```python import random def f(n=100000000): return [random.random() for _ in range(n)] ``` Let’s run this under the _C_ debugger. I start the interpreter, set breakpoints on a few functions I know are interesting, and then I let it run normally… ``` $ gdb /usr/bin/python3.12 […] (gdb) break main Breakpoint 1 at 0x1060: file ./Programs/python.c, line 14. (gdb) run Starting program: /usr/bin/python3.12 [...] Breakpoint 1, main (argc=1, argv=0x7fffffffd8a8) at ./Programs/python.c:14 14 { (gdb) break signal_handler Breakpoint 2 at 0x7f948f276a00: file ./Modules/signalmodule.c, line 347. (gdb) continue Continuing. Python 3.12.9 (main, Feb 23 2025, 20:12:24) [GCC 14.2.1 20241221] on linux Type "help", "copyright", "credits" or "license" for more information. >>> ``` Now I can type at the Python interpreter as usual. I define `f()`, evaluate it, and immediately hit control-C… ``` >>> f() ^C Program received signal SIGINT, Interrupt. pymalloc_pool_extend (pool=..., size=1) at Objects/obmalloc.c:1361 1361 *(pymem_block **)(pool->freeblock) = NULL; (gdb) signal SIGINT Continuing with signal SIGINT. Breakpoint 2, signal_handler (sig_num=2) at ./Modules/signalmodule.c:347 { (gdb) backtrace #0 signal_handler (sig_num=2) at ./Modules/signalmodule.c:347 #1 #2 pymalloc_pool_extend (pool=..., size=1) at Objects/obmalloc.c:1361 #3 pymalloc_alloc (_unused_ctx=..., state=..., nbytes=32) at Objects/obmalloc.c:1546 #4 _PyObject_Malloc (ctx=..., nbytes=32) at Objects/obmalloc.c:1564 ... ``` `pymalloc_alloc` doesn’t contain code to call `signal_handler`, but somehow, `signal_handler` was called anyway. This is the black magic of signals. The operating system has “preempted” normal execution of the CPython interpreter, at a completely arbitrary point - happens to be inside pymalloc - forged two stack frames and adjusted the registers to make the CPU _behave as if_ `pymalloc_pool_extend` had called `signal_handler` from that arbitrary point, and then resumed execution of the process. I need to make you _feel_ this so I’m going to show you the assembly language. `signal_handler` itself isn’t special, but… ``` (gdb) up #1 (gdb) disassemble Dump of assembler code for function __restore_rt: => <__restore_rt+0>: mov $0xf,%rax # SYS_rt_sigreturn <__restore_rt+7>: syscall End of assembler dump. (gdb) up #2 pymalloc_pool_extend (pool=..., size=1) at Objects/obmalloc.c:1361 (gdb) disassemble ($pc-6),($pc+7) Dump of assembler code from 0x7ffff79323b6 to 0x7ffff79323c3: <_PyObject_Malloc+470>: mov %edi,0x8(%rdx) <_PyObject_Malloc+473>: mov %ecx,0x28(%rdx) => <_PyObject_Malloc+476>: movq $0x0,(%rdi) <_PyObject_Malloc+483>: jmp <_PyObject_Malloc+412> ``` There’s no call instructions! We’re going to return from the signal handler to a special C library function, that was never actually _called_, and from there to a totally arbitrary point inside pymalloc! Who thinks it’s safe for the CPython interpreter to run the code that throws KeyboardInterrupt right now? Literally right now, inside pymalloc? It’s not. And that’s why the C-level signal handler just sets flags in the interpreter state. ## Signals arriving inside compiled-code extensions Now let’s do the same thing with NumPy instead. ```python import numpy as np rng = np.random.default_rng() def f(n=1000000000): return rng.random(n) ``` I added another zero to `n` because this is quite a bit faster. Running under the debugger again… ``` >>> f() ^C Thread 1 "python3" received signal SIGINT, Interrupt. 0x00007ffff00b23a7 in random_standard_uniform_fill () from .../numpy/random/_generator.cpython-312-x86_64-linux-gnu.so (gdb) signal SIGINT Continuing with signal SIGINT. Thread 1 "python3" hit Breakpoint 2, signal_handler (sig_num=2) at ./Modules/signalmodule.c:347 347 { (gdb) backtrace #0 signal_handler (sig_num=2) at ./Modules/signalmodule.c:347 #1 #3 random_standard_uniform_fill () #4 __pyx_f_5numpy_6random_7_common_double_fill () #5 __pyx_pw_5numpy_6random_10_generator_9Generator_15random () #6 method_vectorcall_FASTCALL_KEYWORDS (...) at Objects/descrobject.c:427 #7 _PyObject_VectorcallTstate (...) at ./Include/internal/pycore_call.h:92 #8 PyObject_Vectorcall (...) at Objects/call.c:325 #9 _PyEval_EvalFrameDefault (...) at Python/bytecodes.c:2715 ``` The signal handler’s gonna set those flags, but the next _chance_ the _interpreter_ is gonna get to check them is when we return all the way to `method_vectorcall_FASTCALL_KEYWORDS`. And it isn’t _actually_ going to check them until some time after we return to the main bytecode evaluation loop, in `_PyEval_EvalFrameDefault`. NumPy _could_ check, by calling `PyErr_CheckSignals`. _Does_ it? ``` (gdb) break PyErr_CheckSignals (gdb) continue … 20 seconds later … Thread 1 "python3" hit Breakpoint 3, PyErr_CheckSignals () at ./Modules/signalmodule.c:1761 1761 { (gdb) backtrace #0 PyErr_CheckSignals () at ./Modules/signalmodule.c:1761 #1 PyObject_Repr (...) at Objects/object.c:544 #2 PyFile_WriteObject (...) at ./Python/sysmodule.c:732 #4 cfunction_vectorcall_O (...) at Objects/methodobject.c:509 #5 _PyObject_VectorcallTstate (...) at ./Include/internal/pycore_call.h:92 #6 PyObject_CallOneArg (func=) at Objects/call.c:401 #7 _PyEval_EvalFrameDefault (...) at Python/bytecodes.c:593 ``` It does not. ## Could CPython do anything different? ### No. * There’s no safe way to throw an exception from inside a compiled-code module without that module’s cooperation. * Just like there’s no safe way to do that from inside pymalloc. * Even in a compiled language with exceptions, code has to be _designed_ to be exception safe. ## Could NumPy do anything different? ### Yes, but… * Abstractly simple: Call `PyErr_CheckSignals` periodically in code that may run for a long time. If it returns `-1`, cancel what you were doing and return to the interpreter as quickly as possible. * Not obvious from C-API docs that you need to do this * Cancelling what you were doing may require significant code reorganization * `PyErr_CheckSignals` can run arbitrary Python code - Yes, that means you need to hold the GIL (I haven’t looked at the free-threading work at all yet) - Can’t be cheating on reference counts, etc. - Substantial overhead to call it at all, especially if you dropped the GIL ## Example of code changes required ### Non-interruptible code ```c static PyObject *fft(PyObject *td, PyObject *fd) { PyObject *rv = 0; kiss_fft_state *st = 0; Py_buffer tb, fb; Py_ssize_t samples = get_buffers(td, &tb, fd, &fb); if (samples == (Py_ssize_t) -1) { return 0; } st = kiss_fft_alloc((uint32_t) samples); if (st == 0) { PyErr_NoMemory(); goto out; } Py_BEGIN_ALLOW_THREADS kiss_fft(st, (kiss_fft_cpx *)tb.buf, (kiss_fft_cpx *)fb.buf, ssbase); Py_END_ALLOW_THREADS rv = fd; Py_INCREF(rv); out: free(st); PyBuffer_Release(&fb); PyBuffer_Release(&tb); return rv; } ``` ### Interruptible code ```c static PyObject *fft(PyObject *td, PyObject *fd) { int interrupted = 0; PyObject *rv = 0; kiss_fft_state *st = 0; Py_buffer tb, fb; Py_ssize_t samples = maybe_interruptible_get_buffers(td, &tb, fd, &fb); if (samples == (Py_ssize_t) -1) { return 0; } if (! (st = kiss_fft_alloc((uint32_t) samples)) { PyErr_NoMemory(); goto out; } if (PyErr_CheckSignals()) goto out; Py_BEGIN_ALLOW_THREADS interrupted = kiss_fft(st, (kiss_fft_cpx *)tb.buf, (kiss_fft_cpx *)fb.buf, ssbase); Py_END_ALLOW_THREADS if (interrupted) goto out; rv = fd; Py_INCREF(rv); out: free(st); PyBuffer_Release(&fb); PyBuffer_Release(&tb); return rv; } ``` ## That didn’t look so bad, right? But did you notice `kiss_fft` is now returning a value? “Interrupted”? ```c static void kf_work(kiss_fft_cpx *Fout, const kiss_fft_cpx *f, const size_t fstride, const struct kiss_fft_factor *factors, const kiss_fft_state *st) { kiss_fft_cpx *Fout_beg = Fout; const uint32_t p = factors->radix; const uint32_t m = factors->stride; const kiss_fft_cpx *Fout_end = Fout + p * m; if (m == 1) { do { *Fout = *f; f += fstride; } while (++Fout != Fout_end); } else { do { kf_work(Fout, f, fstride * p, factors + 1, st); f += fstride; } while ((Fout += m) != Fout_end); } Fout = Fout_beg; // recombine the p smaller DFTs switch (p) { case 2: kf_bfly2(Fout, fstride, st, m); break; case 4: kf_bfly4(Fout, fstride, st, m); break; default: abort(); } } ``` This is the main loop of the FFT implementation. It’s recursive. We need to figure out where to put calls to `PyErr_CheckSignals`, and we need to pass its return value back out… ```c static int kf_work(kiss_fft_cpx *Fout, const kiss_fft_cpx *f, const size_t fstride, const struct kiss_fft_factor *factors, const kiss_fft_state *st) { kiss_fft_cpx *Fout_beg = Fout; const uint32_t p = factors->radix; const uint32_t m = factors->stride; const kiss_fft_cpx *Fout_end = Fout + p * m; if (m == 1) { do { *Fout = *f; f += fstride; } while (++Fout != Fout_end); } else { do { int rv = kf_work(Fout, f, fstride * p, factors + 1, st); if (rv) return rv; f += fstride; } while ((Fout += m) != Fout_end); } { PyGILState_STATE s = PyGILState_Ensure(); int rv = PyErr_CheckSignals(); PyGILState_Release(s); if (rv) return rv; } Fout = Fout_beg; // recombine the p smaller DFTs switch (p) { case 2: kf_bfly2(Fout, fstride, st, m); break; case 4: kf_bfly4(Fout, fstride, st, m); break; default: abort(); } { PyGILState_STATE s = PyGILState_Ensure(); int rv = PyErr_CheckSignals(); PyGILState_Release(s); if (rv) return rv; } return 0; } ``` Those changes are pretty small, but to write them, I had to understand this particular FFT implementation in _detail_. And this is a _simple_ FFT library. “KISS” stands for “Keep It Simple, Simon.” Officially. ## Overhead … ## How can we make this better? * Ideas I have: - Improve core documentation - Make `PyErr_CheckSignals` cheaper - Tools like Cython and Numba could help out - PyO3 could too with creative use of `async` ## Doc improvements Here’s the documentation for `PyErr_CheckSignals` that I mentioned earlier. > `int PyErr_CheckSignals()` > > This function interacts with Python’s signal handling. > > If the function is called from the main thread and under the main Python interpreter, it checks whether a signal has been sent to the processes and if so, invokes the corresponding signal handler. If the `signal` module is supported, this can invoke a signal handler written in Python. > > The function attempts to handle all pending signals, and then returns `0`. However, if a Python signal handler raises an exception, the error indicator is set and the function returns `-1` immediately (such that other pending signals may not have been handled yet: they will be on the next `PyErr_CheckSignals()` invocation). > > If the function is called from a non-main thread, or under a non-main Python interpreter, it does nothing and returns `0`. > > **This function can be called by long-running C code that wants to be interruptible by user requests (such as by pressing Ctrl-C).** This is buried deep in the C-API docs. […] ## Speed up `PyErr_CheckSignals` Specifically: Make it so you _don’t_ need to hold the GIL to call it. Right now (3.12) it looks like this (simplified just a little): ```c int PyErr_CheckSignals(void) { int status = 0; PyThreadState *tstate = _PyThreadState_GET(); if (_Py_atomic_load_relaxed(&tstate->interp->ceval->gc_scheduled)) { /* run GC */ } if (_Py_atomic_load(&is_tripped)) { status = /* run signal handlers */ } } ``` Both “run GC” and “run signal handlers” need the GIL. But the atomic loads do _not_. That’s the whole point of an atomic load. So let’s make it be like this instead: ```c int PyErr_CheckSignals(void) { int status = 0; PyThreadState *tstate = _PyThreadState_GET(); if (_Py_atomic_load_relaxed(&tstate->interp->ceval->gc_scheduled) || _Py_atomic_load(&is_tripped)) { PyGILState_STATE s = PyGILState_Ensure(); if (_Py_atomic_load_relaxed(&tstate->interp->ceval->gc_scheduled)) { /* run GC */ } if (_Py_atomic_load(&is_tripped)) { status = /* run signal handlers */ } PyGILState_Release(s); } return status; } ``` With this change, you still need to be _able_ to reclaim the GIL in order to call `PyErr_CheckSignals`, but you don’t have to do it yourself. It will do it for you, only when necessary. ---- on reaction latency: https://www.tactuallabs.com/papers/howMuchFasterIsFastEnoughCHI15.pdf https://link.springer.com/chapter/10.1007/978-3-319-58475-1_1 https://dl.gi.de/server/api/core/bitstreams/80cf4901-1392-48a0-a0c8-e653cb0ca021/content no real consensus. some evidence people can notice delays of 5–10ms particularly in direct interaction (e.g. drawing on a tablet) for “stop what you’re doing and give me a command prompt”, 50ms or below is probably a good target