2025-10-01 –, Louis Armand 1 - Est
Beyond embarrassingly parallel processing problems, data must be shared between workers for them to do something useful. This can be done by:
- sharing memory between threads, with the issue of preventing access to shared data to avoid race conditions.
- copying memory to subprocesses, with the challenge of synchronizing data whenever it is mutated.
In Python, using threads is not an option because of the GIL (global interpreter lock), which prevents true parallelism. This might change in the future with the removal of the GIL, but usual problems with multithreading will appear, such as using locks and managing their complexity. Subprocesses don't suffer from the GIL, but usually need to access a database for sharing data, which is often too slow. Algorithms such as HAMT (hash array mapped trie) have been used to efficiently and safely share data stored in immutable data structures, removing the need for locks. In this talk we will show how CRDTs (conflict-free replicated data type) can be used for the same purpose.
The talk will consist of:
- an introduction to parallel processing.
- a overview of the pros and cons of multithreading and multiprocessing.
- a presentation of CRDTs, with an emphasis on the pycrdt Python library.
- an example of using CRDTs and Python's subinterpreters to achieve parallelism.
David Brochart is the main author of pycrdt, a Python library providing bindings to Yrs, the Rust port of Yjs, a popular implementation of CRDTs in JavaScript. While pycrdt is extensively used in Jupyter for real-time collaboration, it can be used to implement distributed data structures, allowing to share data without using locks usually associated with multithreading.