2019-09-16, 15:00–15:30, Ferrier Hall
My talk will focus on Version Control Systems (VCS) for big-data projects. With the advent of Machine Learning (ML) , the development teams find it increasingly difficult to manage and collaborate on projects that deal with huge amounts of data and ML models apart from just source code.
My talk will help audience to understand the importance of having Version Control System for Big Data and Machine Learning (ML) models that goes hand-in-hand with the corresponding source code. This makes it very easy for the development teams to scale the team and at the same time maintain the quick agility of the production pipeline. For example, the ML team can test their new ML models based on the infrastructure developed by the software team by training on the new sets of data uploaded by the data team. And each team need to have their own VCS that blends well with each other. Version control of projects at this level of complexity needs to go beyond the traditional VCS for source code.
My talk will first start with the challenges with the managing big data projects using the traditional VCS. Then my talk will discuss about the current version control solutions for managing Big Data Python projects. This will be about 10 minutes.
The remaining part of my talk will be about 15 minutes. In this part, my talk will introduce an example project as a case study that involves Big Data and ML algorithms. Finally, my talk will focus on developing this project based on DVC (https://dvc.org/) which is an open-source VCS for Machine Learning projects and it is very popular among companies in Artificial Intelligence (AI) space.