Architecture for the extraction, automation and massive data processing PyCon Sweden 2021

Architecture for the extraction, automation and massive data processing
.ical

2021-10-21 10:30–10:55, Data

Live broadcast: https://www.youtube.com/watch?v=OcgLuOs1Hrc

Present a solution that integrates various components in its architecture, both computational resources, databases and its own python applications and other open source ones. The idea is to show the problems and challenges posed by traditional scraping and how we have been able to build solutions that reduce them, even more so if what is sought is to do it en masse and in parallel. This also means building an automated flow for the post-processing and transformation of the data using machine learning services such as NLP and classification.

Due to the diversity of content on the web, its formats and technologies, the talk proposes a micro-service architecture solution built in Python, but that integrates a workflow with advanced scraping techniques and that allows the transformation of the data obtained. up to service application for NLP and ML classification. The proposal implies the use of Linux, postgresql, redis, mongodb, clickhouse, airflow, among others, but above all, their own developments and frameworks that consider not only the extraction process but also the consumption of RAM, parallel processing and even the website blocking, as well as the analysis and transformation processes of the data obtained.

Alfonso de la Guarda

CTO and Technology Architect for Veo365.com , Prix.tips, Scraprix.com, Prixlead and Machinalix.
Old School Hacker.
Computer Science, Anthropology and Social Communicator.
Game and low level programmer since 1983, starting with CBM 64 and Amiga through the most important computer technologies, operating systems and programming languages.
Community Developer for Be Inc (Beos).
Community Developer for OLPC Project.
Free Software and Open Source guy.
Linux fan since 1997, with implementations from basic network servers until flight simulators for defense.
Technology consultant for many institutions, including Peruvian Army, EsSalud, SISOL, SALUDPOL and Health Ministry of Chile.

Architecture for the extraction, automation and massive data processing .ical 2021-10-21 10:30–10:55, Data

Architecture for the extraction, automation and massive data processing
.ical

2021-10-21 10:30–10:55, Data