IIPC TSS Webinar: Scaling Browsertrix Crawls: Use Cases

The IIPC Technical Speaker Series (TSS) facilitates knowledge sharing and fosters conversations and collaborations among IIPC members around web archiving technical work. This webinar will feature Browsertrix use cases from the UK Web Archive, the National Library of Norway, and the National Library of Australia.

This event is for IIPC members. Please contact events@netpreserve.org for a registration link.

NATIONAL LIBRARY OF AUSTRALIA

The National Library of Australia has been experimenting with browser-based crawling for broad crawls such as .gov.au domain crawls. We will talk about challenges encountered trying Browsertrix for broad crawls and some potential solutions we’ve been exploring in our own experimental browser-based crawler, inspectable per-site queues, running browsers on server clusters via SSH, caching and deduplication, why robots.txt still matters and how resource requirements for browsers compare to Heritrix.

SPEAKER: Alex Osborne, Web Archive Technical Lead, Co-Lead of the IIPC Tools Development Portfolio

UK WEB ARCHIVE

Archiving social media poses significant challenges, particularly during key events like the 2024 General Election. To enhance our efforts, the UK Web Archive piloted Browsertrix Cloud, complementing our existing tools to better capture content around the event. Our workflows allowed us to archive complex web pages and multiple social media accounts effectively. In this webinar, we aim to share insights with the web archiving community, offering feedback on our processes and the role Browsertrix Cloud played in preserving critical public discourse during the election. This tool proved invaluable in capturing dynamic, fast-changing online content.

SPEAKERS:

  • Helena Byrne, Web Archives Curator
  • Carlos Lelkes-Rarugal, Assistant Web Archivist
  • Nicola Bingham, Web Archives Lead Curator
NATIONAL LIBRARY OF NORWAY

The National Library of Norway has begun scaling the curation of dynamic websites to increase capacity for subject-specific collections and other specialized content. Many seeds require regular scoping, maintenance (eg. passwords), and quality assurance. Librarians and curators, who have deep subject, selection, and metadata description experience in their fields, are well-positioned to begin supporting this work to build and maintain our collections. But evolving the way we work takes time, and internal development is needed to introduce new concepts, tools, and workflows among librarians from a variety of backgrounds.

This talk will discuss our approach to building partnership and community with stakeholders, establishing best practices, and developing an internal training program. We will give an overview of our current practices and how we are using Browsertrix for curation and quality assurance, discuss the learning objects we created to establish and document workflows, and share the pedagogical approaches that are guiding our training sessions.  

SPEAKER: Katherine Boss, Web Archives Curator

  • 00

    days

  • 00

    hours

  • 00

    minutes

  • 00

    seconds

Date

18 Dec 2024

Time

UTC
2:00 PM - 3:00 PM

Local Time

  • Timezone: America/New_York
  • Date: 18 Dec 2024
  • Time: 9:00 AM - 10:00 AM

Labels

Members only

Next Event