IIPC TSS Webinar: Browser-based crawling of news websites behind paywalls
The IIPC Technical Speaker Series (TSS) facilitates knowledge sharing and fosters conversations and collaborations among IIPC members around web archiving technical work. At this webinar, László Tóth and Yves Maurer of the National Library of Luxembourg will discuss their work crawling news websites behind paywalls using browser-based crawling technologies.
The National Library of Luxembourg (BnL) is developing a set of in-house workflows to archive national news websites that are behind paywalls. The goal is to capture, as frequently as possible, the contents of these websites in a faithful and efficient manner. Since the access to the aforementioned articles is restricted to subscribed users, it is not possible to use cloud-based archiving tools to achieve the desired results. On the other hand, publishers of these websites have given IP-address-based access to the BnL’s web crawler – based on Webrecorder’s Browsertrix technology – which is thus able to fetch the restricted content. The BnL’s workflows also take care of indexing the resulting WARC files and playing back the contents using a hybrid system comprised of OutbackCDX and Pywb for review and quality assurance. The entire workflow is controlled using Camunda, an industrial-strength workflow engine.
Yves Maurer is the deputy head of the IT and digital innovation division at the National Library of Luxembourg and is the technical lead on the Luxembourg Web Archive since 2016. He has an active role in all things digital happening at the library, from digital preservation, digital legal deposit, AI methods for enhancing usability of digitised materials, open data, transitioning to a new ILS etc. Previously he was responsible for the BnL’s digitisation program from 2007 onwards and the setting up of the portal of Luxembourg newspapers at eluxemburgensia.lu. In that period, he was a member of professional boards relating to digitisation at IFLA and Igelu. Previously he was Vice-President of Development at Atril Language Engineering in Madrid and responsible for the flagship DéjàVu Computer-Assisted Translation software. He holds an Msci in Mathematics and Computer Science from Imperial College London.
László is a software engineer involved in the development of tools related to web archiving at the National Library of Luxembourg. This includes web crawling, ingest workflows and playback. Previously, he worked as a developer for a European media company, specializing in software concerned with broadcasting, media and post-production. László holds an MSc in Advanced Computing Science from the University of East Anglia (United Kingdom) and outside of software development he is mainly interested in mathematics and classical music.