PyCon Lithuania 2024

How to Utilize Machine Learning for Better Web Scraping
2024-04-03 , Room 228

Join Tadas Gedgaudas in an enlightening talk on revolutionizing web scraping with machine learning. Uncover how ChatGPT can adapt to website layout changes, making scraping more efficient and reducing maintenance needs. Delve into data structurization with ML, the seamless integration of ChatGPT for parsing, and its practical impact for developers.


In rule-based web scraping, the slightest change in website layout breaks the process, prompting the script overhaul to adapt to a new layout. With machine learning (ML), you don’t have to set up or readjust a dedicated parser for an individual web page. The trained model recognizes prices, descriptions, or anything it was trained to do, even after layout changes.

During his talk, Tadas Gedgaudas, a developer at Oxylabs, will share his knowledge of large language models – ChatGPT in this case – and their integration into the web scraping process.

Tadas will cover the following:
➡️ Nuances of data structurization with and without ML.
➡️ A walkthrough of getting, preparing, and submitting data to ChatGPT.
➡️ A detailed demo of combining ChatGPT with Oxylabs Web Scraper API to scrape and parse web pages without building your own tools.

The talk is an essential stepping stone for developers and decision-makers to understand how ML-enabled parsing saves time, drastically reduces maintenance, and turns any website into structured data.

For your convenience, Tadas has provided code samples of his presentation. You can access an open-source Oxy® Parser library here: https://github.com/oxylabs/OxyParser.

From the very beginning of his software development career, Tadas focused on web data extraction. In fact, his very first project was a web scraper. As a web scraping engineer, Tadas is product-minded and, one could say, obsessed with making software as performant as possible. In turn, he practices productivity tracking and even dedicates his pastime to crafting an open-source ML-powered data parsing library.