PyCon LT 2022

Trojan Source Code - Can we trust open-source anymore?
2022-05-26 , Python Room

Recently, a paper titled Trojan Source is published to demonstrate how a visibly valid contribution can contain malicious code by exporting the Unicode control characters. Some of these have been tested on Python and it works. How can it happen? Shall the Python and open-source communities be concerned?


Background:

After researchers at the University of Cambridge published a paper about a malicious attack named Trojan Source, which exploited the fact that some program interpreters, like CPython, can handle Unicode. This has caused concerns in the open-source community about the malicious contribution that looks totally legitimate in human eyes but contains invisible attacks. As a member of the Python community, we should all be aware of that and understand how we can prevent this attack to happen.

About this talk:

In this talk, Cheuk will decode the finding in this paper to a level that can be understood by everyone. She will start with a joke example of how you can mess up someone by using Unicode. She will then explain what is Unicode and why it causes trouble. Afterwards, she will explain the Python examples in the paper and why it can be dangerous. Lastly, she will open up a discussion on how we should defend ourselves from those attacks and what we can do as a community.

Outline (30 mins talk):

5 minutes - Introduction, the opening of the talk

In this session, Cheuk will ask audiences to debug a code snippet that looks absolutely fine but will not work as code. She will explain that this is the same concept used in Trojan Source.

10 mins - What is Unicode

In this session, Cheuk will give an introduction about what is Unicode, what it is to a computer and why we need Unicode in computers. She will also explain how the benefit of having Unicode can also be a downfall to making us vulnerable to the Trojan Source attack.

10 mins - How Trojan Source works in Python

In this session, Cheuk will show a few examples using the Trojan Source in legitimate Python code. She will point out how the attack is hiding in the source code and in what cases it can be dangerous.

5 mins - How to protect ourselves

In this session, Cheuk will open the discussion and make a few suggestions of how we can protect ourselves as a community. This will lead to the Q&A session where the audience can weigh in on their own thought.

Target audiences

From those who are curious to maintainers of open-source libraries. This is the knowledge we should all know and be aware of. Cheuk will explain in a way that expects no prior knowledge is needed.

What will audiences learn

About Trojan Source attacks and how it works. They may also learn about how interpreters, especially Python interpreters, work with Unicode. Plus, they may have increased awareness about security in the open-source world.


What topics define your talk the best?:

python, open source, security

Before working in Developer Relations, Cheuk has been a Data Scientist in various companies which demands high numerical and programmatical skills, especially in Python. To follow her passion for the tech community, now Cheuk is the Developer Relations Lead at TerminusDB - an open-source graph database. Cheuk maintains its Python client and engages with its user community daily.

Besides her work, Cheuk enjoys talking about Python on personal streaming platforms and podcasts. Cheuk has also been a speaker at Universities and various conferences. Besides speaking at conferences, Cheuk also organises events for developers. Conferences that Cheuk has organized include EuroPython (which she is a board member of), PyData Global and Pyjamas Conf. Believing in Tech Diversity and Inclusion, Cheuk constantly organizes workshops and mentored sprints for minority groups. In 2021, Cheuk has become a Python Software Foundation fellow.