PyConDE & PyData Berlin 2024

How Python helped us uncover secrets of protein motion
04-24, 13:10–13:40 (Europe/Berlin), B05-B06

This presentation will give an overview of the scientific project that focuses on understanding how proteins move and function. Along the way a very large collection of Python tools was used, and on top of them our own innovative approaches are based. To be able to understand everything about living beings, including our health and origin of deseases in humans, we have to know how proteins do what they do. Hence is of utmost importance to understand their structure and function. Thanks to extraordinary technique called X-ray crystallography we are able to see how the proteins look at atomic scale, but it is impossible to see how they move. Therefore the next best thing we can do is to simulate the motion of the protein by so-called molecular dynamics (MD) simulations. These simulations generate incredible amounts of data, generally hundreds of GB of data per 1 microsecond of protein movement! Extracting useful and meaningful information from it is a daunting task.
We are going to show how we have used many Python tools to tackle this problem in the project. Using Django to place everything in an interactive web app (https://alokomp.irb.hr/), along with Pandas, Numpy, Scipy, Dask, Jupyther, NetworkX, Bokeh, Datashader and many more under the hood, we have created an innovative new way of seeing protein move and communicate.


Proteins are one of the main building blocks of the living world. They are largely responsible for the amazing diversity that we witness in the nature around us. Although proteins are composed of sequences of just 20 amino acids, clever nature’s design has endowed them with an incredibly diverse set of functions. It is not an overstatement to say that this diversity and the myriad of ways proteins interact with each other is at the very heart of life. Therefore it is of utmost importance to understand their structure and function.
Proteins are very large molecules, composed of thousands up to even millions of atoms connected in a giant hairball like structures. But still they are too tiny to be seen by any sort of microscope, even the most powerful ones. That is why in order to “see” how they look we use X-rays and shine them on crystals made entirely of single proteins species in the fascinating method of X-ray crystallography. It then gives us the picture of how the proteins look to unprecedented atomic detail.
In order to do their function proteins also move their parts, but unfortunately this motion is too quick to be seen by any device. X-ray crystallography alone, although mighty in giving us the details, gives us only one static image. It is a bit like trying to tell a story of a movie just by seeing a movie poster. Therefore we have to simulate the motion of the protein by so-called molecular dynamics (MD) simulations. Basically we give the computer the initial positions of all the atoms that we know from X-ray crystallography and then kick them and see how the protein moves in time, in very tiny steps. This results in so-called MD trajectories which contain all atom positions in millions of steps. Needles to say that this results in super heavy data that usually contains hundreds of GB of data that needs to be processed somehow.
In the project called “Allosteric communication pathways in oligomeric enzymes” (https://alokomp.irb.hr/) we have faced that very problem. How to extract information about protein movement from such enormous quantities of data? Of course the answer was using marvelous Python suite of tools available. Python has established itself as a de facto standard programming language in data science, and with already available plethora of options for X-ray crystallography and MD analysis it was a logical choice (not to mention its awesomeness and being our favourite anyway). The whole project really displays how mature and diverse Python is to be able to tackle every single aspect of such a specialized problem. To begin with, we have centered the entire project around a web page built using Django. It serves both as a front-end wih general information, but also as a web app for diving into the data. Behind it is a PostgreSQL relational database containing all the structural and derived data from a family of proteins, called PNPs, which serve as sort of proof of concept (https://alokomp.irb.hr/pdbase/structures/). It also contains data derived from MD simulations and analysed with MDanalysis tool (https://www.mdanalysis.org/). It is hard to mention all the Python tools we have used for analysis of the data in the database. Of course the backbone of it are indispensable Pandas, Numpy, Scipy, Dask, Jupyther, NetworkX, Bokeh, HoloViz to name but a few. More specifically we have developed a special approach (“avocado” plots, example https://alokomp.irb.hr/md/avocados/1458/A) to visualize the motion of protein as a whole in time, as a series of snapshots each containing plots of millions of points, using awesome Datashader library (https://datashader.org). We have also used Ruptures (https://github.com/deepcharles/ruptures) library to detect changes in the positions of protein and to detect correlations. Everything is wrapped up in a form of interactive web app which can be used to visually browse vast amounts of data, giving a whole new perspective on a highly complex multidimensional data.


Expected audience expertise: Domain

Novice

Expected audience expertise: Python

Novice

Abstract as a tweet (X) or toot (Mastodon)

Uncovering protein motion by leveraging awsome Python tools.

Public link to supporting material, e.g. videos, Github, etc.

https://alokomp.irb.hr/

Dr. Zoran Štefanić senior research associate and Head of the Laboratory for Chemical and Biological Crystallography at the Ruđer Bošković Institute. His main areas of expertise include: chemical crystallography of small organic molecules and hydrogen bonded networks, macromolecular crystallography, strong background in physics and mathematics, development of computer algorithms mainly in Python programming language, database design and web development. As the principal investigator of the ALOKOMP project, he is responsible for the overall coordination of the research and managing all activities between team members, organizational and financial matters, as well as for the publication of results, annual scientific and financial reports to the Croatian Science Foundation. More specifically in this project his tasks will be data collection and 3D structure determination of new enzyme structures, development of central relational database, programming of algorithms for data extraction, development of web server, and co-mentorship of PhD student.

A structural biology researcher enthusiastic about crystallography, molecular dynamics, and programming.