PyCon UK 2019

Managing Big Data in Machine Learning projects
2019-09-16 , Ferrier Hall

My talk will focus on Version Control Systems (VCS) for big-data projects. With the advent of Machine Learning (ML) , the development teams find it increasingly difficult to manage and collaborate on projects that deal with huge amounts of data and ML models apart from just source code.


My talk will help audience to understand the importance of having Version Control System for Big Data and Machine Learning (ML) models that goes hand-in-hand with the corresponding source code. This makes it very easy for the development teams to scale the team and at the same time maintain the quick agility of the production pipeline. For example, the ML team can test their new ML models based on the infrastructure developed by the software team by training on the new sets of data uploaded by the data team. And each team need to have their own VCS that blends well with each other. Version control of projects at this level of complexity needs to go beyond the traditional VCS for source code.
Then, my talk will introduce an example project as a case study that involves Big Data and ML algorithms. Finally, my talk will focus on developing this project based on DVC (https://dvc.org/) which is an open-source VCS for Machine Learning projects and it is very popular among companies in Artificial Intelligence (AI) space.


Is your proposal suitable for beginners? – yes

I am a Data Infrastructure Engineer at a self-driving startup called Oxbotica. I have been a Software/Data Engineer for 4-5 years. I have worked on various types of software roles using Python such as automated testing, data processing tools, algorithms and web applications. I have a dual-masters degree from KTH Royal Institute of Technology, Sweden and University College London, UK. I like traveling to new places and meet new people. I am passionate about python projects and I love to talk about my experience and help other people to gain skills.