WAC 2025 Workshops

Image

WORKSHOP LIST

Tuesday, 8 April | IIPC MEMBERS ONLY

TRAINING WORKING GROUP WORKSHOP: Case Studies ‘Write-a-thon’ - Documenting Best Practices

Wednesday, 9 April

WORKSHOP #01: Exploring Dilemmas in the Archiving of Legacy Webportals: An Exercise in Reflective Questioning
WORKSHOP #02: Web Archive Collections As Data

Thursday, 10 April

WORKSHOP #03: Introduction to Web Graphs
WORKSHOP #04: How to Develop a New Browsertrix Behavior

 

WORKSHOP ABSTRACTS

TRAINING WORKING GROUP WORKSHOP: Case Studies ‘Write-a-thon’ - Documenting Best Practices (IIPC MEMBERS ONLY)

Claire Newing1, Lauren Baker2, Kody Willis3
Organization(s): 1: The National Archives (UK), United Kingdom; 2: Library of Congress; 3: Internet Archive

The vision of the IIPC Training Working Group (TWG) is to make IIPC the world leader for training on web archiving. When we survey IIPC members on their training needs they often suggest that they would find a collection of case studies useful. We believe that creating such a collection would support the conference theme 'Towards Best Practices in Web Archiving'. The proposed session would be open to everyone with a goal to grow the resources available to web archiving practitioners.

The TWG have discussed the creation of case studies during regular meetings and established that, although members wish to contribute and have many good ideas, they are struggling to find time to write them. The proposed ‘Write-a-thon’ will provide participants with a dedicated eighty minute long session to focus on writing up their case studies. It will comprise three twenty minute long writing blocks separated by five minute long breaks. During the breaks, participants will be encouraged to discuss their work and gain feedback from others if desired. This format has been used successfully by a writing group at the author’s employing institution. Participants will be asked to submit their completed case studies at the end of the session and the TWG Co-chairs will make them available to all on Google Drive. They will have the ongoing support of the TWG with continued, regular work on case studies throughout the year.

The session will be aimed at any conference attendees who wish to submit a case study on any topic relevant to web archiving. Suggested topics include: a process/workflow which works well; a decision rubric for selection; a method developed for capturing a specific type of content; building specialist search queries; a successful tool used for training. To support the creation of case studies during the session and beyond, participants will be provided a case study template and a sample completed case study for reference.

The overall outcome will be a case study collection hosted on Google Docs which will be open to all members. We plan to launch it shortly after the conference and will actively recruit others to add case studies on an ongoing basis.

Participants will:

  • Become familiar with a range of web archiving case study topics of interest to participants
  • Identify what elements are important to include in an IIPC case study
  • Contribute to building the resources available to web archiving practitioners for conducting their work
  • Network with participants to share best practice about creating case studies and web archiving topics more generally

WORKSHOP #01: Exploring Dilemmas in the Archiving of Legacy Webportals: An Exercise in Reflective Questioning

Daniel Steinmeier, Sophie Ham
National Library of the Netherlands

Since 2023 the National Library of the Netherlands (KBNL) is proud to curate a digital collection that has become UNESCO world heritage: the Digital City (De Digitale Stad, henceforth: DDS). Material belonging to this collection consists of an original freeze from 1996, as well as two student projects and miscellaneous material that was contributed by users and founders over the course of multiple events. The two student projects were the first attempt to revive the portal of DDS and store it as a disk image. The two groups of students used two methods for this reviving: one based on emulation, the other based on migration. But what choices were made during restoration and which version is more authentic?

Furthermore, KBNL has several websites, scientific articles and newspaper clippings in its collections that might serve as context information. Do we consider this context information crucial for understanding DDS or do we rather leave users to find these resources by themselves if they are interested?

Even without considering the plethora of archival material that currently is DDS, the original portal already was a mixed bag of different protocols. Most of them are currently not mainstream anymore like IRC and Usenet newsgroups and were never part of DDS itself but only linked to. The portal also consisted of links to offsite websites not archived, like some of the users homepages or ‘houses’.

The original hardware – not part of the collection - was running on proprietary software that is now thoroughly obsolete. There was a multi-user dungeon where users could program their own objects but this depended on real-time user interaction. Some of the functionality depended on live data which isn’t available anymore, like who was logged in. The original software was command-line and based on Freenet-software. Shortly after the initial launch an HTML-interface was introduced. Even then the command-line interface stayed available for less-privileged users. The navigation of the HTML-version relied heavily on image maps that require a binary executable to function correctly.

From newspaper evidence we can gather that sometimes functionality wasn’t available or stopped working. There was both a general part of the portal and a personalized part based on login, the latter also containing email. There have also been cases of harmful or polarizing content being published in newsgroups. At the time the norm was self-regulation by the community and laissez-faire but time has moved on and our users may have come to expect a more active approach of regulation, or at least some form of acknowledgement, from us as heritage organizations.

As can be seen from this description, there is a lot of complexity when we consider archiving DDS and making it accessible to our users. We can think of a lot of difficult dilemmas when making decisions on what to archive and how to present it. Do we want users to experience how it is to create a homepage in DDS or do we want to present a historically correct picture of the homepages existing at the time? What should be considered part of the object and what part of the context? Is the migrated or the emulated version more authentic? What is more important, the privacy of the original users or providing full access to researchers? What do we consider belonging to DDS and what not? Only the HTML? Or also any news group material that might still be online but isn’t part of the archival material? Do users want a real authentic experience or rather a convenient way of viewing the content?

Even though DDS was a Dutch portal, it was based on software of the American Free-nets and inspired other cities in Europe and Asia. Therefore, we think this case might have a lot of recognizable features that also apply to the archiving of other legacy portals. Arguably, there are no right or wrong answers. They are typically dilemmas where multiple options have both benefits and drawbacks.

In our workshop we want to present a couple of these real-world dilemmas to participants to stimulate discussion based on principles of reflective questioning and open dialogue. The idea is that we present a few cases related to DDS that participants can discuss in groups. Each group has to choose a preferred solution and present their reasoning to the group. People are encouraged to explore the reasons for choosing one or the other, for instance by reflecting on their own organizational context or personal assumptions regarding digital preservation. We try to stay away from providing clear cut answers or guidance but rather provide participants with the opportunity to explore these questions together.

Participants will learn how to ask the right questions to delve deeper into their own reasoning process during decision making, based on our method of reflective questioning. Participants should be able to use this method and the cases presented to benefit their own curatorial decision making process regarding legacy webportals in their own collections. For KBNL, the group discussions may provide important community input and food for thought on some of the decisions we are going to be making regarding DDS in the near future.

WORKSHOP #02: Web Archive Collections As Data

Gustavo Candela1, Chase Dooley2, Abbie Grotke2, Olga Holownia3, Jon Carlstedt Tønnessen4
Organization(s): 1: University of Alicante, Spain; 2: Library of Congress, United States of America; 3: IIPC, United States of America; 4: National Library of Norway, Norway

GLAM (Galleries, Libraries, Archives and Museums) have started to make available their digital collections suitable for computational use following the Collections as Data principles[1]. The International GLAM Labs Community[2] has explored innovative and creative ways to publish and reuse the content provided by cultural heritage institutions. As part of their work, and as a collaborative-led effort, a checklist[3] was defined and focused on the publication of collections as data. The checklist provides a set of steps that can be used for creating and evaluating digital collections suitable for computational use. While web archiving institutions and initiatives have been providing access to their collections - ranging from sharing seedlists to derivatives to “cleaned” WARC files - there is currently no standardised checklist to prepare those collections for researchers.

This workshop aims to involve web archiving practitioners and researchers in reevaluating whether the GLAM Labs checklist can be adapted for web archive collections. The first part of the workshop will introduce the GLAM checklist, followed by two use cases that show how the web archiving teams have been working with their institutions’ Labs to prepare large data packages and corpora for researchers. In the second part of the workshop, we want to involve the audience in identifying the main challenges to implementing the GLAM checklist and determining which steps require modifications so that it can be used successfully for web archive collections.

First use case
The Library of Congress has been working to refine and improve workflows that enable creation and publishing of web archive data packages for computational research use. With a recently hired Senior Digital Collections Data Librarian, and working with our institution’s Labs, web archiving staff have prepared new data packages for web archive data in response to recent research requests. We will provide some background into this work and developments that led to the creation of the data librarian role, and will share details about how we are creating our data packages and sharing derivative datasets with researchers. Using a recent data package release, we will compare local practices in providing data to researchers with the GLAM checklist and talk through ways in which our institution does or does not comply.

Second use case
The National Library of Norway recently launched its first Web News Corpus, making more than 1.5 million texts from 268 news websites available for computational analysis through API. The aim is to facilitate text analysis at scale.[4] This presentation will provide a brief description of “warc2corpus”, our workflow for turning WARCs into text corpora, aiming to satisfy the FAIR principles, while also taking immaterial rights into account.[5]

In this presentation, we will showcase how users can:

  • tailor research corpora based on keywords and various metadata,
  • visualise general insights,
  • exercise different types of ‘distant reading’, both with the Library Labs package for Python and with user-friendly web applications.[6]

REFERENCES:
[1] Padilla, T. (2017). “On a Collections as Data Imperative”. UC Santa Barbara. pp. 1–8;
[2] https://glamlabs.io/
[3] Candela, G. et al. (2023), "A checklist to publish collections as data in GLAM institutions", Global Knowledge, Memory and Communication. https://doi.org/10.1108/GKMC-06-2023-0195
[4]: Tønnessen, J. (2024). “Web News Corpus”. National Library of Norway. https://www.nb.no/en/collection/web-archive/research/web-news-corpus/
[5]: Tønnessen J., Birkenes M., Bremnes T. (2024). “corpus-build”. GitHub. National Library of Norway. https://github.com/nlnwa/corpus-build; Birkenes M., Johnsen, L., Kåsen, A. (2023). “NB DH-LAB: a corpus infrastructure for social sciences and humanities computing.” CLARIN Annual Conference Proceedings.
[6]: “dhlab documentation”. National Library of Norway. https://dhlab.readthedocs.io/en/latest/

WORKSHOP #03: Introduction to Web Graphs

Sebastian Nagel, Pedro Ortiz Suarez, Thom Vaughan
Organization(s): Common Crawl Foundation

The workshop will begin with a brief introduction to the concept of the webgraph or hyperlink graph - a directed graph whose nodes correspond to web pages and whose edges correspond to hyperlinks from one web page to another. We will also look at aggregations of the page-level webgraph at the level of Internet hosts or pay-level domains. The host-level and domain-level graphs are at least an order of magnitude smaller than the original page-level graph, which makes them easier to study.

To represent and process webgraphs, we utilize the WebGraph framework, which was developed at the Laboratory of Web Algorithms (LAW) of the University of Milano. As a "framework for graph compression aimed at studying web graphs," it allows very large webgraphs to be stored and accessed efficiently. Even on a laptop computer, it's possible to store and explore a graph with 100 million nodes and more than 1 billion edges. The WebGraph framework is also used to compress other types of graphs, such as social network graphs or software dependency graphs. In addition, the framework and related software projects include tools for the analysis of web graphs and the computation of their statistical and topological properties. The WebGraph framework implements a number of graph algorithms, including PageRank and other centrality measures. It is an open-source Java project, but a re-implementation in the Rust language has recently been released. Over the past two decades, the WebGraph format has been widely used by researchers, for example those at LAW or Web Data Commons, to distribute graph dumps. It has also been used by open data initiatives, including the Common Crawl Foundation and the Software Heritage project.

The workshop focuses on interactive exploration of one of the precompiled and publicly available webgraphs. We look at graph properties and metrics, learn how to map node identifiers (just numbers) and node labels (URLs), and compute the shortest path between two nodes. We also show how to detect "cliques", i.e. densely connected subgraphs, or how to run PageRank and related centrality algorithms to rank the nodes of our graph. We share our experiments on how these applications are used for collection curation: how cliques can be used to discover sites with content in a regional language, how link spam is detected or how global domain ranks are used to select a representative sample of websites. Finally, we will build a small webgraph from scratch using crawl data.

Participants will learn how to explore webgraphs (even large ones) in an interactive way and learn how graphs can be used to curate collections. Basic programming skills and basic knowledge of the Java programming language are a plus but not required. Since this is an interactive workshop, attendees should bring their own laptops, preferably with the Java 11 (or higher) JDK and Maven installed. Nevertheless, it will be possible to follow the steps and explanations without having to type them into a laptop. We will provide download and installation instructions, as well as all teaching materials, prior to the workshop.

WORKSHOP #04: How to Develop a New Browsertrix Behavior

Ilya Kreymer, Tessa Walsh
Organization(s): Webrecorder

Behaviors are a key part of Browsertrix and Browsertrix Crawler, as they make it possible to automatically have the crawler browsers take certain actions on web pages to help capture important content. This tutorial will walk attendees through the process of creating a new behavior and using it with Browsertrix Crawler.

Browsertrix Crawler includes a suite of standard behaviors, including auto-scrolling pages, auto-playing videos, and capturing posts and comments on particular social media sites. By default, all of the standard set of behaviors are enabled for each crawl. Users have the ability to instead disable behaviors entirely or select only a subset of the standard set of behaviors to use on a crawl.

At times, users may need additional custom behaviors to navigate and interact with a site in specific ways automatically during crawling if they want the resulting web archive and replay to reflect the full experience of the live site. For instance, a new behavior could click on interactive buttons in a particular order, “drive” interactive components on a page, or open up posts sequentially on a new social media site and load comments.

This tutorial will walk through the process of creating a new behavior step by step, using the existing written tutorial for creating new behaviors on GitHub as a model. In addition to demonstrating how to write a behavior’s code (using JavaScript), the tutorial will also discuss how to know when a behavior is the appropriate solution for a given crawling problem, how to test behaviors during development, how to use custom behaviors with Browsertrix Crawler running locally in Docker, and finally how to use custom behaviors from the Browsertrix web interface (a feature that is currently planned and will be completed by the conference date).

Participants will not be expected to write any code or follow along on their own laptops in real time during the tutorial. The purpose is instead to demonstrate how one would approach developing a new behavior, lower the barrier to entry for developers and practitioners who may be interested in doing so, and to give attendees the opportunity to ask questions of Webrecorder developers in real time. We would additionally love to foster a conversation about how to develop a community library of available behaviors moving forward to make it easier than ever for users to find and use behaviors that meet their needs.

The tutorial will be led by Ilya Kreymer and Tessa Walsh, developers at Webrecorder with intimate knowledge of the Browsertrix ecosystem. The target audience is technically-minded web archiving practitioners and developers - in other words, people who could either themselves write new custom behaviors or communicate the salient points to developers at their institutions. Because this is not a hackathon-style workshop, the tutorial could have as many participants as the venue allows. By the conclusion of the tutorial, attendees should understand the concept of how Browsertrix Behaviors work, when developing a new behavior is a good solution to their problems, the steps involved in developing and testing a new behavior, and where to find additional resources to help them along the way. Our hope is to foster a decentralized community of practice around behaviors to the entire IIPC community’s benefit.