What we learned from scraping 1 billion webpages every month

Web is broken. We learned the hard way. Developers tend to hack, hacks tend to break the web. In this talk, I share what we learned how websites don't obey the protocols and how developers had caused the web became a chaotic medium.


As Prisync, we crawl a large portion of the web every day for 6 years. First we approach the problem with a naive aspect, but we learned our lesson via experience. Developers create workarounds and hacks all over the time. But doing so has –most probably, unexpected– consequences. Some of the glitches we experiences so far:

There are ;

  • websites not responding properly
  • websites responding different output to identical requests
  • websites not responding at all
  • websites not obeying HTTP at all
  • websites with broken firewall rules
  • websites served on archaic webservers, which even are not aware of current state of transfer protocol
  • websites taking advantage of vulnerabilities (a.k.a. "clever hacks")

In this talk, I share examples of those "hacks" and I propose some methods to keep the web healthy.


Domains:

Business & Start-Ups, Big Data, Infrastructure, Web, Data Engineering

Domain Expertise:

some

Python Skill Level:

none

Link to talk slides:

https://docs.google.com/presentation/d/1nSA0KtV1nVK7v6HKw4-EUrCGc9TCcsP06R7gm9_N8dg/edit?usp=sharing

Abstract as a tweet:

We broke the web via simple hacks. Instead of order, we caused chaos. How to fix that?

Public link to supporting material:

https://docs.google.com/presentation/d/1nSA0KtV1nVK7v6HKw4-EUrCGc9TCcsP06R7gm9_N8dg/edit?usp=sharing