What we learned from scraping 1 billion webpages every month PyConDE & PyData Berlin 2019

What we learned from scraping 1 billion webpages every month

Web is broken. We learned the hard way. Developers tend to hack, hacks tend to break the web. In this talk, I share what we learned how websites don't obey the protocols and how developers had caused the web became a chaotic medium.

As Prisync, we crawl a large portion of the web every day for 6 years. First we approach the problem with a naive aspect, but we learned our lesson via experience. Developers create workarounds and hacks all over the time. But doing so has –most probably, unexpected– consequences. Some of the glitches we experiences so far:

There are ;

websites not responding properly
websites responding different output to identical requests
websites not responding at all
websites not obeying HTTP at all
websites with broken firewall rules
websites served on archaic webservers, which even are not aware of current state of transfer protocol
websites taking advantage of vulnerabilities (a.k.a. "clever hacks")

In this talk, I share examples of those "hacks" and I propose some methods to keep the web healthy.

Domains: Business & Start-Ups, Big Data, Infrastructure, Web, Data Engineering Domain Expertise: some Python Skill Level: none Link to talk slides:

https://docs.google.com/presentation/d/1nSA0KtV1nVK7v6HKw4-EUrCGc9TCcsP06R7gm9_N8dg/edit?usp=sharing

Abstract as a tweet:

We broke the web via simple hacks. Instead of order, we caused chaos. How to fix that?

Public link to supporting material:

https://docs.google.com/presentation/d/1nSA0KtV1nVK7v6HKw4-EUrCGc9TCcsP06R7gm9_N8dg/edit?usp=sharing