IIPC-IFLA News Media Section Workshop: Browser-based Crawling of News Websites Behind Paywalls
The IFLA News Media and International Internet Preservation Consortium (IIPC) are teaming up again to host a series of workshops focusing on archiving news media. As most news is now published online, there is a growing interest to better understand the current best practices in web archiving. The main goal of our workshops is to examine and compare how organizations of varying sizes tackle this topic and to learn from their collective experiences. Through presentations and informal discussions, we will showcase diverse organizational approaches to archiving news media, including audiovisual content and social media, highlight key challenges, and explore innovative solutions.
Our initial event on December 4 featured a short introduction to the work of IIPC and IFLA News Media Section as well as three use cases from the National Library of France, National and University Library of Iceland and Library of Congress. We invite you to stay tuned for information about subsequent events that will take place in the first quarter of 2025.
The workshop will be moderated by Ana Krahmer (Director, Digital Newspaper Unit) of the University of North Texas Libraries. [TBC]
AGENDA
10:00-10:05 : Introduction to IFLA News Media section & IIPC
10:05-11:25: Presentations
11:25-11:55: Q&A with all speakers
11:55-12:00: Wrap-up
In a digital age characterized by information abundance, access to online content remains a significant challenge. Many valuable resources are hidden behind paywalls and login screens, making it difficult for researchers, archivists, and data enthusiasts to retrieve, preserve, and analyze this content. In this workshop, experts from web archives across Europe aim to empower participants with the knowledge and tools necessary to tackle these obstacles effectively and will equip participants with the skills and knowledge required to navigate paywalls, conduct web crawls effectively, ensure data quality, and foster ongoing communication with site owners. The following key components will be covered:
- Accessing Paywalled Content:
- Techniques to bypass paywalls and access restricted websites
- Negotiating with newspapers and publishers to obtain login credentials
- Strategies for requesting IP Authentication from site administrators
- Browser plugins and user agent customization to enhance access
- Actually Crawling Content:
- Exploration of web crawling tools, including Heritrix and Browsertrix
- Utilizing Browsertrix Cloud and Browsertrix Crawler for efficient and scalable crawling
- Using Browsertrix Behaviors for harvesting special content, such as videos, podcasts and flipbooks
- Introduction to other essential tools for web harvesting
- Quality Assurance of Content:
- Deduplication techniques and best practices
- Implementing dashboards for IP-validation to ensure data integrity
- Workshop segment on setting up the initial infrastructure and performing proxy at home
- Communication with Site Owners:
- Emphasizing the importance of communication with site owners
- Highlighting the direct correlation between effective communication and access privileges
- Strategies for maintaining ongoing relationships with content providers
SPEAKERS
Anders Klindt Myrvoll, Royal Danish Library
Anders Klindt Myrvoll has been the Programme Manager at the national Danish web archive, Netarkivet, at the Royal Danish Library since 2018. Together with colleagues, he is collecting, preserving and providing access to the Danish web. Prior to web archiving, Anders worked for more than 13 years in the broadcast, film and media industry, collaborating globally on high-end localization, making original content for children, saving digital cultural heritage, strategy, optimization, leadership and much more. You can find him on Linkedin or @andersklindt on X/Twitter.
Antares Reich, Austrian National Library
Antares is a crawl engineer and is responsible for the set-up and quality assurance of all crawls at the Austrian National Library. Previously he worked as a software developer for cashier systems and as a local assistant to a member of the European Parliament. He loves books and to play music.
Joel Nieminen, National Library of Finland
László Tóth, National Library of Luxembourg
László is a software engineer involved in the development of tools related to web archiving at the National Library of Luxembourg. This includes web crawling, ingest workflows and playback. Previously, he worked as a developer for a European media company, specializing in software concerned with broadcasting, media and post-production. László holds an MSc in Advanced Computing Science from the University of East Anglia (United Kingdom) and outside of software development he is mainly interested in mathematics and classical music.
Samuli Sairanen, National Library of Finland
RESOURCES
- Jari Heikkinen, Topi Chamchoon, Samuli Sairanen, Joel Nieminen, Sanna Haukkala: Collecting Online Newspapers and Bypassing Paywalls. IFLA International News Media Conference 2024.
- Proceedings of the 16th IFLA ILDS conference: Beyond the paywall – Resource sharing in a disruptive ecosystem : held at the National Library of Technology in Prague, Czech Republic, October 9-11, 2019.