IIPC-IFLA News Media Section Workshop: Browser-based Crawling of News Websites Behind Paywalls

Name: IIPC-IFLA News Media Section Workshop: Browser-based Crawling of News Websites Behind Paywalls
Start: 2025-02-13
End: 2025-02-13

The IFLA News Media and International Internet Preservation Consortium (IIPC) are teaming up again to host a series of workshops focusing on archiving news media. As most news is now published online, there is a growing interest to better understand the current best practices in web archiving. The main goal of our workshops is to examine and compare how organizations of varying sizes tackle this topic and to learn from their collective experiences. Through presentations and informal discussions, we will showcase diverse organizational approaches to archiving news media, including audiovisual content and social media, highlight key challenges, and explore innovative solutions.

Our initial event on December 4 featured a short introduction to the work of IIPC and IFLA News Media Section as well as three use cases from the National Library of France, National and University Library of Iceland and Library of Congress. We invite you to stay tuned for information about subsequent events that will take place in the first quarter of 2025.

The workshop will be moderated by Kopana Terry (Historic Newspapers Curator) of the University of Kentucky.

AGENDA

10:00-10:05 : Introduction to IFLA News Media section & IIPC

10:05-11:25: Presentations

11:25-11:55: Q&A with all speakers

11:55-12:00: Wrap-up

In a digital age characterized by information abundance, access to online content remains a significant challenge. Many valuable resources are hidden behind paywalls and login screens, making it difficult for researchers, archivists, and data enthusiasts to retrieve, preserve, and analyze this content. In this workshop, experts from web archives across Europe aim to empower participants with the knowledge and tools necessary to tackle these obstacles effectively and will equip participants with the skills and knowledge required to navigate paywalls, conduct web crawls effectively, ensure data quality, and foster ongoing communication with site owners. The following key components will be covered:

Accessing Paywalled Content:
- Techniques to bypass paywalls and access restricted websites
- Negotiating with newspapers and publishers to obtain login credentials
- Strategies for requesting IP Authentication from site administrators
- Browser plugins and user agent customization to enhance access
Actually Crawling Content:
- Exploration of web crawling tools, including Heritrix and Browsertrix
- Utilizing Browsertrix Cloud and Browsertrix Crawler for efficient and scalable crawling
- Using Browsertrix Behaviors for harvesting special content, such as videos, podcasts and flipbooks
- Introduction to other essential tools for web harvesting
Quality Assurance of Content:
- Deduplication techniques and best practices
- Implementing dashboards for IP-validation to ensure data integrity
- Workshop segment on setting up the initial infrastructure and performing proxy at home
Communication with Site Owners:
- Emphasizing the importance of communication with site owners
- Highlighting the direct correlation between effective communication and access privileges
- Strategies for maintaining ongoing relationships with content providers

SPEAKERS

Anders Klindt Myrvoll, Royal Danish Library

Anders has been the Programme Manager at the national Danish web archive, Netarkivet, at the Royal Danish Library since 2018. Together with colleagues, he is collecting, preserving and providing access to the Danish web. Prior to web archiving, Anders worked for more than 13 years in the broadcast, film and media industry, collaborating globally on high-end localization, making original content for children, saving digital cultural heritage, strategy, optimization, leadership and much more. You can find him on Linkedin or @andersklindt on X/Twitter.

Antares Reich, Austrian National Library

Antares is a crawl engineer and is responsible for the set-up and quality assurance of all crawls at the Austrian National Library. Previously he worked as a software developer for cashier systems and as a local assistant to a member of the European Parliament. He loves books and to play music.

Joel Nieminen, National Library of Finland

Joel is an Information Systems Specialist who has been working at the Legal Deposit Office of the National Library of Finland since 2022. With a degree in Computer Science, he specializes in web crawling and data extraction, adeptly navigating paywalls to access valuable information while adhering to ethical standards. He combines technical expertise with a passion for open data, advocating for information accessibility. During his free time, you can find him enjoying the Finnish nature.

László Tóth, National Library of Luxembourg

László is a software engineer involved in the development of tools related to web archiving at the National Library of Luxembourg. This includes web crawling, ingest workflows and playback. Previously, he worked as a developer for a European media company, specializing in software concerned with broadcasting, media and post-production. László holds an MSc in Advanced Computing Science from the University of East Anglia (United Kingdom) and outside of software development he is mainly interested in mathematics and classical music.

Samuli Sairanen, National Library of Finland

Samuli has been an Information Systems Specialist at the Legal Deposit Office of the National Library of Finland since 2019, working with processes and challenges of electronic material management, though has lately been more involved in web archiving and its infrastructure.

RESOURCES

The event is finished.

Tags: IFLA

Date

13 Feb 2025

Expired!

Time

3:00 PM - 5:00 PM

Local Time

Timezone: America/New_York
Date: 13 Feb 2025
Time: 10:00 AM - 12:00 PM

More Info

Next Event

Training Working Group Call
Date

12 Mar 2025
Time

3:00 PM - 4:00 PM

IIPC-IFLA News Media Section Workshop: Browser-based Crawling of News Websites Behind Paywalls

AGENDA

SPEAKERS

Anders Klindt Myrvoll, Royal Danish Library

Antares Reich, Austrian National Library

Joel Nieminen, National Library of Finland

László Tóth, National Library of Luxembourg

Samuli Sairanen, National Library of Finland

RESOURCES

The event is finished.

Date

Time

Local Time

More Info

Next Event

Date

Time

Related Events

Research, Services, and Tools for Accessing Web Archives Series: Full-Text Indexing of Very Large Collections

Research, Services, and Tools for Accessing Web Archives Series: Accessing and Using Web Archives Data and Metadata

Research, Services, and Tools for Accessing Web Archives Series: Giving Access to Collections: Platforms and Functionalities