Our devices are continually sending and receiving a complex set of instructions and information every time they interact over HTTP. While this mostly invisible interaction is primarily made up of the same standard set of attributes what oddities would we discover if we spidered 10,000,000 domains?
HTTP headers are the–mostly–hidden backbone of our online infrastructure. However, for a standard designed to be consumed entirely by code, and rarely seen by people, HTTP headers contain a surprising amount of geeky humour and many oddities.
Since reading the convoluted history of the browser user-agent and finding out that MySpace’s servers were powered by “Nerd Rage” I’ve been curious as to what other interesting histories headers had, or what easter eggs mischievous developers had hidden for others like them to find.
Join me on this deep dive into HTTP headers as I go through how I spidered 10,000,000 domains. We’ll look at the challenges of writing an efficient, concurrent HTTP spider in Python as well as some of my findings from the harvested headers.
Aaron Bassett has lived in Ireland, Scotland, Hungary, and the Netherlands. He is a recovering senior software engineer turned award-winning Developer Advocate at MongoDB.
Aaron has been working online since 2005 and has always enjoyed sharing what he learned by organising and speaking at local meetups. He spoke at his first conference in 2013, and since then he’s spoken at conferences on a range of topics all over the world. He has a passion for mentoring and has been involved with Social Innovation Camp UK, Social Innovation Camp Kosovo, Startup Weekend, Open Glasgow, DjangoGirls and global diversity CFP day.