WORKSHOPS

Image

  • WS-01: TRAINING THE TRAINERS - HELPING WEB ARCHIVING PROFESSIONALS BECOME CONFIDENT TRAINERS
  • WS-02: LEVERAGING PARQUET FILES FOR EFFICIENT WEB ARCHIVE COLLECTION ANALYTICS
  • WS-03: CRAFTING APPRAISAL STRATEGIES FOR THE CURATION OF WEB ARCHIVES
  • WS-04: RUN YOUR OWN FULL STACK SOLRWAYBACK (2024)
  • WS-05: UNLOCKING ACCESS: NAVIGATING PAYWALLS AND ENSURING QUALITY IN WEB CRAWLING (BEHIND PAYWALL WEBSITES - CRAWL, QA & MORE)
  • WS-06: BROWSER-BASED CRAWLING FOR ALL: INTRODUCTION TO QUALITY ASSURANCE WITH BROWSERTRIX CLOUD

  • WORKSHOP#01: Training the Trainers - Helping Web Archiving Professionals Become Confident Trainers

    Claire Newing1, Ricardo Basilio2, Lauren Baker3, Kody Willis4
    1The National Archives, United Kingdom; 2Arquivo.pt, Portugal; 3Library of Congress, United States of America; 4Internet Archive, United States of America

    The 'Training the Trainers' workshop aims to provide participants with concepts, methodologies, and materials to help them create and deliver training courses on web archiving in the context of their organizations and adapted to their communities.

    Knowing how to archive the web is becoming an increasingly necessary skill for those who manage information in organizations (those responsible for institutional memory).

    More and more people are realizing how important it is for organizations to keep track of the content they publish online. Whether for large organizations like a country's government or a small association, preserving their memory is increasingly valued. An organization's web presence through its website and social media channels is part of its digital heritage. The news about an organization published by newspapers, radio stations, podcasts and other sites is important for institutional history and so this content should be considered as part of web archiving.

    The problem lies in the fact that, in practice, web archiving is a vaguely known activity and within the reach of few. It is often thought of as something for IT professionals. In addition, there are concerns about the legal implications. Countries' legal frameworks are sometimes unsuited to content published on the web. Access is also limited. Recorded content can only be shown in catalogues and digital libraries, where protocols such as Dublin Core or OAI-PMH are used. As a result, there is little investment in web archiving and insufficient response to the need for preservation.

    In this session, we invite participants to become trainers. We challenge those with knowledge of web archiving to translate it into training activities. Basic web archiving is not that difficult. With a little training, anyone can be a web archivist and a promoter of the use of web archives by researchers.

    The workshop is promoted by the IIPC Training Working Group, which has created training modules for initial training on web archiving, available for use by the community at https://netpreserve.org/web-archiving/training-materials/. The aim now is to go one step further, offering intermediate training content, exemplary training cases, tried and tested strategies so that there are more and more confident trainers and training offers on web archiving.

    The first part will introduce the concepts and terminology of web archiving from a training perspective. What are the main ones? Digital heritage, cultural heritage, WARC format, time stamp, collection creation? If you had to choose one, which would you include in a training course?

    We will then present two cases of training programs and share our experience: the IIPC TWG initial training program and a the ResPaDon project for researchers at the University of Lille.

    In the second part, we'll carry out a practical web page recording exercise, focusing on aspects related to training and learning, as well as the tools used. What does it take for anyone to be able to record a page?

    For this "hands on" part of the session, we will focus on browser-based recording, using the ArchiveWeb.page tool (available for all) to explore the questions that arise whenever a trainer wants to impart knowledge about web preservation.

    In conclusion, we will give participants a set of recommendations or guidelines that should guide the web archiving trainer.

    Participants will be able to use their own computer, if they have an Internet connection, or share the exercise with other participants.

    At the end of the session, as expected learning outcomes, participants should be able to:

    • choose the most important concepts and terminology to include in a training session
    • be familiar with available training materials and various training experiences on web archiving
    • design a training program
    • include practical web archiving exercises, using available tools
    • demonstrate how small-scale web archiving can be integrated into larger projects, such as national archives or institutional archives.

    WORKSHOP#02: Leveraging Parquet Files for Efficient Web Archive Collection Analytics

    Sawood Alam1, Mark Phillips2
    1Internet Archive, United States of America; 2University of North Texas, United States of America

    In this workshop/tutorial we intend to bring a hands-on experience for the participants to analyze a substantial web archive collection. The tutorial will include introduction to some existing archival collection summarization tools like CDX Summary and Archived Unleashed Toolkit, the process of converting CDX(J) files to Parquet files, numerous SQL queries to analyze those Parquet files for practical and common use-cases, and visualization of generated reports.

    Since 2008, the End of Term (EOT) Web Archive has been gathering snapshots of the federal web, consisting of the publicly accessible ".gov" and ".mil" websites. In 2022, the End of Term team began to package these crawls into a public dataset which they released as part of the Amazon Open Data Partnership program. In total, over 460TB of WARC data was moved from local repositories at the Internet Archive and the University of North Texas Libraries. From the original WARC content, derivative datasets were created that address common use cases for web archives. These derivatives include WAT, WET, CDX, WARC Metadata Sidecar, and Parquet files. The Parquet files were generated primarily using the CDX file and their ZipNum index, which include many derived columns (such as the domain name or the TLD) that were otherwise not available directly in the CDX files as separate columns. Furthermore, the Parquet files can be extended to include additional columns (such as soft-404, language, and detected content-type) from the WARC Metadata Sidecar files. These files are publicly accessible from an Amazon S3 bucket for research and analysis.

    The toolchain used to generate the derivative files in the EOT datasets were reused from the Common Crawl project. Moreover, the EOT datasets are organized in a similar structure as used by the Common Crawl dataset to make them a drop-in replacement for researchers who have used the Common Crawl datasets before. We plan to leverage EOT datasets in the workshop/tutorial for hands-on experience. Furthermore, tutorials created for the workshop will be added to the EOT dataset documentation for future reuse and to serve as a guide for researchers.

    The Parquet format is a column-oriented data file format designed for efficient data storage and retrieval. It is used to provide a different way of accessing the data held in the CDX derivatives. The Parquet format is used in many big-data applications and is supported by a wide range of tools. This derivative allows for arbitrary querying of the dataset using standard query formats like SQL and can be helpful for users who want to better understand what content is in their web archive collections using tools and query languages they are familiar with.

    CDX files are in a text-based columnized data format (similar to CSV files) that are sorted lexicographically. These are optimized for archival playback, but not necessarily for data analysis. For example, counting the number of mementos (captures) for a given TLD in a web archive collection would require processing the only rows that start with the given TLD, because the CDX files are primarily sorted by SURTs (also known as URL keys), which places the TLD at the very beginning of each line that allows binary search to locate desired lines in a large file. However, counting mementos of certain HTTP status code (say, "200 OK") would need processing the entire CDX dataset. Similarly, counting the number of captures for a given year would require traversing the whole CDX index.

    On the contrary, in Parquet files the data is partitioned and stored column-wise. This means, querying on one column does not need processing of the bits of the other columns. Moreover, Parquet files store data in a binary format (as opposed to the text-based format used in CDX files) and apply run-length encoding and other compression techniques to optimize for storage space and IO operations. This reduces the processing time for most of the data analytics tasks as compared to corresponding CDX data. Parquet files support SQL-like queries and have an ecosystem of toolsets while CDX files are usually analyzed using traditional Unix text-processing CLI tools or scripts that operate on similar principles.

    In addition to allowing analysis of individual web archive collections using Parquet files, we hope that this workshop would encourage web archives to convert their entire CDX data into Parquet files and expose API endpoints to perform queries against their holdings. Moreover, we anticipate a CDX API implementation in the future that can operate on Parquet files, replacing CDX files completely, if it proves to be more efficient both in storage and lookups suitable for archival playback.

    WORKSHOP#03: Crafting Appraisal Strategies for the Curation of Web Archives

    Melissa Wertheimer
    Library of Congress, United States of America

    Web archives preserve web-based evidence of events, stories, and the people and communities who create them. Web archiving is also a vital tool to build a diverse and authentic historical record through intentional digital curation. Information professionals determine the “what” and “when” of web archives: collection topics, seed URL lists, crawl durations, resource allocation, metadata, and more. Appraisal documentation - the “why” and “how” of web archives - reveals the intentions and processes behind the digital curation to ensure accountability in the preservation of born-digital cultural heritage.

    Melissa Wertheimer will present an expanded 80-minute version of a 2022 National Digital Stewardship Alliance Digital Preservation Conference (“DigiPres”) workshop. The co-authors and co-presenters of the original 2022 version were Meghan Lyon (Library of Congress, United States) and Tori Maches (University of California at San Diego, United States). All three professionals come from traditional archives backgrounds where appraisal documentation, archival values, and appraisal methods are standard practice for repositories. The workshop will facilitate an experimental environment in which participants consider how such archival values and appraisal methods used for analog and hybrid special collections also apply to web archives curation.

    The intended audience includes attendees who make curatorial and collection development decisions for their organization’s web archiving initiatives. These web archiving practitioners will roll up their sleeves and craft targeted appraisal strategies in writing for thematic and event-based web archive collections.

    Attendees will explore the use of prose, decision trees, and rubrics as forms of appraisal documentation for web archive collections. They will practice the application of archival values such as intrinsic value, evidential value, and interrelatedness as well as appraisal methods such as sampling and technical appraisal to evaluate whether websites are both in scope for collecting and feasible to capture.

    Workshop participants are encouraged to bring working materials for hypothetical or realized web archive collections, including seed lists and collection scopes, although workshop leaders will provide sample seed lists along with the workshop materials. Workshop leaders will provide Google Drive access to workshop materials that include a sample seed list for a thematic collection, sample seed list for an event-based collection, sample appraisal documentation in the form of a narrative, and sample appraisal documentation in the form of a rubric.

    Attendees will gain a comprehensive overview of American and Canadian archival theory and a list of supporting resources for reference. Participants will also develop an understanding of the differences between collection development policies, collection scopes, and appraisal strategies for web archives. They will also learn to apply existing appraisal theories and archival values to web archives selection and curation, and to evaluate and apply different types of appraisal documentation to meet their needs. The workshop will be web archiving tool agnostic; these concepts are relevant regardless of which tools attendees might use to capture and preserve web content.

    Participants will have the most ideal experience with their own laptop or tablet, open minds, passion for mapping theory to practice, and willingness to discuss and debate selection criteria and appraisal strategies and documentation with colleagues. The workshop includes a brief overview presentation followed by time for both individual work and group discussion.

    WORKSHOP#04: Run Your Own Full Stack SolrWayback (2024)

    Thomas Egense, Victor Harbo Johnston, Anders Klindt Myrvoll
    Royal Danish Library

    An in-person, updated, version of the ´21 and ‘23 WAC workshop Run Your Own Full Stack SolrWayback

    • https://netpreserve.org/event/wac2021-solrwayback-1/
    • https://netpreserve.org/ga2023/programme/abstracts/#workshop_06

    This workshop will:

    • Explain the ecosystem for SolrWayback
      (https://github.com/netarchivesuite/solrwayback)
    • Perform a walkthrough of installing and running the SolrWayback bundle. Participants are expected to follow the installation guide and will be helped whenever stuck.
    • Leave participants with a fully working stack for index, discovery and playback of WARC files
    • End with an open discussion of SolrWayback configuration and features

    Prerequisites

    • Participants should have a Linux, Mac or Windows computer with Java installed. To see java is installed type this in a terminal: java -version
    • For windows computers administration-user may be required.
    • Downloading the latest release of SolrWayback Bundle from:https://github.com/netarchivesuite/solrwayback/releases beforehand is recommended.
    • Having institutional WARC files available is a plus, but sample files can be downloaded from https://archive.org/download/testWARCfiles
    • A mix of WARC-files from different harvests/years will showcase SolrWaybacks capabilities the best way possible.

    Target audience

    Web archivists and researchers with medium knowledge of web archiving and tools for exploring web archives. Basic technical knowledge of starting a program from the command line is required; the SolrWayback bundle is designed for easy deployment.

    Background

    SolrWayback 4 (https://github.com/netarchivesuite/solrwayback) is a major rewrite with a strong focus on improving usability. It provides real time full text search, discovery, statistics extraction & visualisation, data export and playback of webarchive material. SolrWayback uses Solr (https://solr.apache.org/) as the underlying search engine. The index is populated using Web Archive Discovery (https://github.com/ukwa/webarchive-discovery). The full stack is open source.

    WORKSHOP#05: Unlocking Access: Navigating Paywalls and Ensuring Quality in Web Crawling (Behind Paywall Websites - Crawl, QA & More)

    Anders Klindt Myrvoll1, Thomas Martin Elkjær Smedebøl1, Samuli Sairanen2, Joel Nieminen2, Antares Reich3, László Tóth4
    1
    Royal Danish Library; 2National Library of Finland; 3Austrian National Library; 4National Library of Luxembourg

    In a digital age characterized by information abundance, access to online content remains a significant challenge. Many valuable resources are hidden behind paywalls and login screens, making it difficult for researchers, archivists, and data enthusiasts to retrieve, preserve, and analyze this content. This tutorial, led by experts from web archives across Europe, aims to empower participants with the knowledge and tools necessary to tackle these obstacles effectively. This tutorial provides a comprehensive guide to web crawling, quality assurance, and essential techniques for accessing content - focusing on content behind paywalls or log-in.

    In recent years, institutions like the Austrian National Library, National Library of Luxembourg, Royal Danish Library, and National Library of Finland have been addressing the challenges posed by paywalls and restricted access to online content. Each institution has developed unique strategies and expertise in acquiring and preserving valuable online information. This workshop serves as an opportunity to pool this collective knowledge and provide hands-on training to those eager to venture into the world of web crawling.

    Content of the Workshop

    This tutorial will equip participants with the skills and knowledge required to navigate paywalls, conduct web crawls effectively, ensure data quality, and foster ongoing communication with site owners. The following key components will be covered:

    1. Accessing Paywalled Content:
      • Techniques to bypass paywalls and access restricted websites
      • Negotiating with newspapers and publishers to obtain login credentials
      • Strategies for requesting IP Authentication from site administrators
      • Browser plugins and user agent customization to enhance access
    1. Actually Crawling Content:
      • Exploration of web crawling tools, including Heritrix and Browsertrix
      • Utilizing Browsertrix Cloud and Browsertrix Crawler for efficient and scalable crawling
      • Using Browsertrix Behaviors for harvesting special content, such as videos, podcasts and flipbooks
      • Introduction to other essential tools for web harvesting
    1. Quality Assurance of Content:
      • Deduplication techniques and best practices
      • Implementing dashboards for IP-validation to ensure data integrity
      • Workshop segment on setting up the initial infrastructure and performing proxy at home
    1. Communication with Site Owners
      • Emphasizing the importance of communication with site owners
      • Highlighting the direct correlation between effective communication and access privileges
      • Strategies for maintaining ongoing relationships with content providers

    There will be a short tutorial from each institution looking at different subjects from the list above.

    Expected Learning Outcomes

    Upon completing this tutorial, participants will have gained a robust skill set and deep understanding of the challenges and opportunities presented by paywall-protected websites. Specific learning outcomes include:

    • Proficiency in accessing paywalled content using various techniques
    • Better knowledge of how and when to use web crawling tools such as Heritrix and Browsertrix
    • Skills to ensure data quality through deduplication and visualization of IP-validation
    • Strategies for initiating and sustaining productive communication with site owners
    • The ability to apply these skills to unlock valuable content for research, archiving, and analysis

    Target Audience

    This tutorial is designed for anyone seeking to access and work with content behind paywalls or login screens. Whether you are a researcher, archivist, librarian, or data enthusiast, this tutorial will provide valuable insights and practical skills to overcome the challenges of restricted online access.

    Technical Requirements

    Participants are only required to bring a laptop equipped with an internet connection. This laptop will serve as their control interface for NAS, Heritrix, Browsertrix, and other relevant tools during the workshop.

    WORKSHOP#06: Browser-Based Crawling For All: Introduction to Quality Assurance with Browsertrix Cloud

    Andrew Jackson1, Anders Klindt Myrvoll2, Ilya Kreymer3
    1
    Digital Preservation Coalition, United Kingdom; 2Royal Danish Library; 3Webrecorder, United States of America

    Through the IIPC-funded “Browser-based crawling system for all” project (https://netpreserve.org/projects/browser-based-crawling/), members have been working with Webrecorder to support the development of their new crawling tools: Browsertrix Crawler (a crawl system that automates web browsers to crawl the web) and Browsertrix Cloud (a user interface to control multiple Browsertrix Crawler crawls). The IIPC funding is particularly focussed on making sure IIPC members can make use of these tools.

    This workshop will be led by Webrecorder, who will invite you to try out an instance of Browsertrix Cloud and explore how it might work for you, and how the latest QA features might help. IIPC members that have been part of the project will be on hand to discuss institutional issues like deployment and integration.

    The workshop will start with a brief introduction to high-fidelity web archiving with Browsertrix Cloud. Participants will be given accounts and be able to create their own high-fidelity web archives using the Browsertrix Cloud crawling system during the workshop. Users will be presented with an interactive user interface to create crawls, allowing them to configure crawling options. We will discuss the crawling configuration options and how these affect the result of the crawl. Users will have the opportunity to run a crawl of their own for at least 30 minutes and to see the results.

    After a quick break, we will then explore the latest Quality Assurance features of Browsertrix Cloud. This includes ‘patch crawling’ by using the ArchiveWeb.Page browser extension to archive difficult pages, and then integrating those results into a Browsertrix Cloud collection..

    In the final wrap-up of the workshop, we will discuss challenges, what worked and what didn’t, what still needs improvement, etc.. We will also outline how participants can provide access to the web archives they created, either using standalone tools or by integrating them into their existing web archive collections. IIPC member institutions that have been testing Browsertrix Cloud thus far will share their experiences in this area.

    The format of the workshop will be as follows:

    • Introduction to Browsertrix Cloud
    • Use Cases and Examples by IIPC project partners
    • Hands-On: Setup and Crawling with Browsertrix Cloud (Including Q&A / help while crawls are running)
    • Hands-On: Quality Assurance with Browsertrix Cloud
    • Wrap-Up: Final Q&A / Discuss Access 7 Integration of Browsertrix Cloud Into Existing Web Archiving Workflows with IIPC project partners

    Webrecorder will lead the workshop and the hands-on portions of the workshop. The IIPC project partners will present and answer questions about use-cases at the beginning and around integration into their workflows at the end.

    Participants should come prepared with a laptop and a couple of sites that they would like to crawl, especially those that are generally difficult to crawl by other means and require a ‘high fidelity approach’. (Examples include social media sites, sites that are behind a paywall, etc..) Ideally, the sites can be crawled during the course of 30 mins (though crawls can be interrupted if they run for too long)

    This workshop is intended for curators and anyone wishing to create and use web archives and are ready to try Browsertrix Cloud on their own with guidance from the Webrecorder team. The workshop does not require any technical expertise besides basic familiarity with web archiving. The participants’ experiences and feedback will help shape not only the remainder of the IIPC project, but the long-term future of this new crawling toolset.