2023-10-27 –, track 1
Variant, a term once only known to the researchers of biological sciences, is now familiar to the general people. Rising of the new variants of SARS-Cov2 virus with new mutations has become a concern during this COVID-19 pandemic. How do the researchers identify these variants from the analysis of genomics data? How could Python be used in this analysis? This talk will address these questions.
Variant, a term once only known to the researchers of biological sciences, is now quite familiar to the general people. Rising of the new variants of SARS-Cov2 virus with novel mutations have become a topic of concern during this COVID-19 pandemic. How do the researchers identify these variants from the analysis of genomics data? How could Python be used in this analysis? This talk will address these questions.
Mutations in any organism are usually identified after performing a Next Generation Sequence analysis experiment named variant calling. Variant calling generates the output in a specialized file format called Variant Call Format (VCF) file. VCF file carries the meta data and the information of thousands of mutations and is generally large in size. Thus, it is challenging to extract information and identify mutations from this file, especially when there are hundreds of samples. The Python package scikit-allel provides utilities for exploring this large-scale genetic variation data in VCF file and helps to identify important mutations from the downstream analysis. This package depends on scipy, matplotlib, seaborn, pandas, scikit-learn, h5py and zarr. After identifying the mutations, the next step is the visualization of the mutations in a meaningful way. This task might be simpler for a small size virus like SARS-Cov2, but complicated for eukaryotic organisms with multiple chromosomes like mouse or human. Another python package QMplot is handy and useful for the visualization of thousands of mutations in each chromosome, making the interpretation of the extracted mutations easier for the biologists. This package uses numpy, scipy, pandas and matplotlib.
During this talk I will show the usability of these Python packages for analyzing high throughput genetic variation data and discuss the avenues of the development of new Python packages to make this analysis more efficient. Besides informing the Python community about the application of Python in genomics research, this talk will be informative to the developers who want to work in the intersection of computer science and genomics.
Outline
Intro (5 min)
Who are we?
Introducing the concept of mutation and variants
Explaining the genetic variation data (VCF file)
Why Python for Variant analysis? (10 min)
High dimensional data
Introducing scikit-allel
Filtering and identification of Mutation using scikit-allel
Visualization of mutations (10 min)
Explaining the importance of visualization of mutation to get meaningful insight
Introducing QMplot for mutation visualization
Explaining example plots created on publicly available data
Q&A (5 mins)
Haque Ishfaq is a PhD student at McGill University and Mila - Quebec AI Institute. His research interests span machine learning, statistical learning theory, reinforcement learning and bandits. Before moving to Montreal, he enjoyed several years of sunny weather at Stanford University where he completed his bachelors degree in mathematical and computational science as a McCaw Scholar and masters degree in statistics.
Atia Amin is a PhD student at McGill University in the department of Human Genetics where she is working on bioinformatics. She finished her masters program at the University of South Dakota where she performed research on engineering abiotic stress resistant crop plants. Having born and brought up in Bangladesh, Atia is also passionate about popularizing STEM among girls in Bangladesh and other developing countries.