Workshops - IIPC

There are a number of workshops and tutorials in the conference programme this year. As spaces are limited for these sessions, please sign up during your registration process if you wish to attend.

OVERVIEW

11 May, 13:30-15:30 (WAITLIST ONLY) WKSHP-01: DESCRIBING COLLECTIONS WITH DATASHEETS FOR DATASETS
11 May, 16:20-17:20 (WAITLIST ONLY) WKSHP-02: A PROPOSED FRAMEWORK FOR USING AI WITH WEB ARCHIVES IN LAMS
12 May, 8:30-10:00 (WAITLIST ONLY) WKSHP-03: FAKE IT TILL YOU MAKE IT: SOCIAL MEDIA ARCHIVING AT DIFFERENT ORGANIZATIONS FOR DIFFERENT PURPOSES
12 May, 8:30-10:00 (WAITLIST ONLY) WKSHP-04: BROWSER-BASED CRAWLING FOR ALL: GETTING STARTED WITH BROWSERTRIX CLOUD
12 May, 10:30-12:00 (WAITLIST ONLY) WKSHP-05: SUPPORTING COMPUTATIONAL RESEARCH ON WEB ARCHIVES WITH THE ARCHIVE RESEARCH COMPUTE HUB (ARCH)
12 May, 13:00-15:00 (WAITLIST ONLY) WKSHP-06: RUN YOUR OWN FULL STACK SOLRWAYBACK

WKSHP-01: Describing Collections with Datasheets for Datasets

Emily Maemura¹, Helena Byrne²

¹University of Illinois; ²British Library, United Kingdom

Significant work in web archives scholarship has focused on addressing the description and provenance of collections and their data. For example, Dooley et al. (2018) propose recommendations for descriptive metadata, and Maemura et al. (2018) develop a framework for documenting elements of a collection’s provenance. Additionally, documentation of the data processing and curation steps towards generating a corpus for computational analysis are described extensively in Brügger (2021), Brügger, Laursen & Nielsen (2019) and Brügger, N., Nielsen, J., & Laursen, D. (2020). However, looking beyond libraries, archives, or cultural heritage settings provides alternative forms for the description of data. One approach to the challenge of describing large datasets comes from the field of machine learning where Gebru et al. (2018, 2021) propose developing “Datasheets for Datasets,” a form of short document answering a standard set of questions arranged by stages of the data lifecycle.

This workshop explores how web archives collections can be described using the framework provided by Datasheets for Datasets. Specifically, this work builds on the template for datasheets developed by Gebru et al. that is arranged into seven sections: Motivation; Composition; Collection Process; Preprocessing/Cleaning/Labeling; Use; Distribution; and, Maintenance. The workflow they present includes a total of 57 questions to answer about a dataset, focusing on the specific needs of machine learning researchers. We consider how these questions can be adopted for the purposes of describing web archives datasets. Participants will consider and assess how each question might be adapted and applied to describe datasets from the UK Web Archive curated collections. After a brief description of the Datasheets for Datasets framework, we will break into small groups to perform a card-sorting exercise. Each group will evaluate a set of questions from the Datasheets framework and assess them using the MoSCoW technique, sorting questions into categories of Must, Should, Can’t, and Won’t have. Groups will then describe their findings from the card-sorting exercise in order to generate a broader discussion of priorities and resources available for generating descriptive metadata and documentation for public web archives datasets.

Format:120 minute workshop where participants will do a card sorting activity in small groups to review the practicalities of the Datasheets for Datasets Framework when applied to web archives. Ideally participants can prepare by reading through questions prior to the workshop.

We anticipate the following schedule:

5 min: Introduction
15 min: Overview of Datasheets for Datasets
5 min: Overview of UKWA Datasets
60 min: Card-sorting Exercise in small groups
5 min: Comfort Break
20 min: Discussion of small group findings
5 min: Conclusion and Wrap-up

Target Audience: Web Archivists, Researchers

Anticipated number of participants: 12-16

Technical requirements: overhead projector with computer and large tables for a big card sorting activity.

Learning outcomes:

Raise awareness of the Datasheets for Datasets Framework in the web archiving community.
Understand what type of descriptive metadata web archive experts think should accompany web archive collections published as data.
Generate discussion and promote communication between web archivists and research users on priorities for documentation.

Coordinators: Emily Maemura (University of Illinois), Helena Byrne (British Library)

Emily Maemura is an Assistant Professor in the School of Information Sciences at the University of Illinois Urbana-Champaign. Her research focuses on data practices and the activities of curation, description, characterization, and re-use of archived web data. She completed her PhD at the University of Toronto's Faculty of Information, with a dissertation exploring the practices of collecting and curating web pages and websites for future use by researchers in the social sciences and humanities.

Helena Byrne is the Curator of Web Archives at the British Library. She was the Lead Curator on the IIPC Content Development Group 2022, 2018 and 2016 Olympic and Paralympic collections. Helena completed a Master’s in Library and Information Studies at University College Dublin, Ireland in 2015. Previously she worked as an English language teacher in Turkey, South Korea, and Ireland. Helena is also an independent researcher that focuses on the history of women's football in Ireland. Her previous publications cover both web archives and sports history.

References

Brügger, N. (2021). Digital humanities and web archives: Possible new paths for combining datasets. International Journal of Digital Humanities. https://doi.org/10.1007/s42803-021-00038-z

Brügger, N., Laursen, D., & Nielsen, J. (2019). Establishing a corpus of the archived web: The case of the Danish web from 2005 to 2015. In N. Brügger & D. Laursen (Eds.), The historical web and digital humanities: The case of national web domains (pp. 124–142). Routledge/Taylor & Francis Group.

Brügger, N., Nielsen, J., & Laursen, D. (2020). Big data experiments with the archived Web: Methodological reflections on studying the development of a nation’s Web. First Monday. https://doi.org/10.5210/fm.v25i3.10384

Dooley, J., & Bowers, K. (2018). Descriptive Metadata for Web Archiving: Recommendations of the OCLC Research Library Partnership Web Archiving Metadata Working Group (p. ). OCLC Research. https://doi.org/10.25333/C3005C

Gebru, T., Morgenstern, J., Vecchione, B., Vaughan, J. W., Wallach, H., Daumé III, H., & Crawford, K. (2018). Datasheets for Datasets. ArXiv:1803.09010 [Cs]. http://arxiv.org/abs/1803.09010

Gebru, T., Morgenstern, J., Vecchione, B., Vaughan, J. W., Wallach, H., III, H. D., & Crawford, K. (2021). Datasheets for datasets. Communications of the ACM, 64(12), 86–92. https://doi.org/10.1145/3458723

Maemura, E., Worby, N., Milligan, I., & Becker, C. (2018). If These Crawls Could Talk: Studying and Documenting Web Archives Provenance. Journal of the Association for Information Science and Technology, 69(10), 1223–1233. https://doi.org/10.1002/asi.24048

WKSHP-02: A proposed framework for using AI with web archives in LAMs

Abigail Potter

Library of Congress, United States of America

There is tremendous promise in using artificial intelligence, and specifically machine learning techniques to help curators, collections managers and users to understand, use, steward and preserve web archives. Libraries, archives, museums and other public cultural heritage organizations who manage web archives have shared challenges in operationalizing AI technologies and unique requirements for managing digital heritage collections at a very large scale. Through research, experimentation and collaboration the LC Labs team has developed a set of tools to document, analyze, prioritize and assess AI technologies in a LAM context. This framework is in draft form and in need of additional use cases and perspectives, especially web archives use cases. The facilitators will introduce the framework and ask participants to use the proposed framework to evaluate their own proposed or in process ML or AI use case that increases understanding of and access to web archives.

Sharing the framework elements, gathering feedback, and documenting web archives use cases are the goals of the workshop.
Sample Elements and Prompts from the framework:
Organizational Profile: How will or does your organization want to use AI or Machine learning?
Define the Problem you are trying to solve.
Write a user story about the AI/ML task or system your are planning/doing
Risks and Benefits: What are the benefits and risks to users, staff and the organization when an AI/ML technology is/will be used?
What systems or policies will/do the AI/ML task or system impact or touch?
What are the limitations of future use of any training, target, validation or derived data?
Data Processing Plan: What documentation are/will you require when using AI or ML technologies - what existing open source or commercial platforms offer
pathways into use of AI?
What are the success metrics and measures for the AI/ML task?
What are the quality benchmarks for the AI/ML output?
What could come next?

WKSHP-03: Fake it Till You Make it: Social Media Archiving at Different Organizations for Different Purposes

Susanne van den Eijkel¹, Zefi Kavvadia², Lotte Wijsman³

¹KB, National Library of the Netherlands; ²International Institute for Social History; ³National Archives of the Netherlands

Abstract

Different organizations, different business rules, different choices. That seems obvious. However, different perspectives can alter the choices that you make and therefore the results you get when you’re archiving Social Media. In this tutorial, we would like to zoom in on the different perspectives an organization can have. A perspective can be formed over a mandate or type of organization, the designated community of an institution, or a specific tool that you use. Therefore, we would like to highlight these influences and how they can affect the results that you get.

When you start with Social Media archiving, you won’t get the best results right away. It is really a process of trial and error, where you aim for good practice and not necessarily best practice (and is there such a thing as best practice?). With a practical assignment we want to showcase the importance of collaboration between different organizations. What are the worst practices that we have seen so far? What’s best to avoid, and why? What could be a solution? And why is it a good idea to involve other institutions at an early stage?

This tutorial relates to the conference topics of community, research and tools. It builds on previous work from the Dutch Digital Heritage Network and the BeSocial project from the National Library of Belgium. Furthermore, different tools will be highlighted and it will me made clear why different tooling can result in different results.

Format

In-person tutorial, 90 minutes.

Introduction: who are the speakers, where do they work, introduction on practices related to different organizations.
Assignment: participants will do a practical assignment related to social media archiving. They’ll receive personas for different institutions (library, government, archive) and ask themselves the question: how does your own organization's perspective influence the choices you make? We will gather the results on Post-its and end with a discussion.
Wrap-up: conclusions of discussion.

Target audience

This tutorial is aimed at those who want to learn more about doing social media archiving at their organizations. It is mainly meant for starters in social media archiving, but not necessarily complete beginners (even though they are definitely welcome too!). Potential participants could be archivists, librarians, repository managers, curators, metadata specialists, (research) data specialists, and generally anyone who is or could be involved in the collection and preservation of social media content for their organization.

Expected number of participants: 20-25.

Expected learning outcome(s)

Participants will understand:

Why Social Media archiving is different than Web Archiving;
Why different perspectives lead to different choices and results;
How tools can affect the potential perspectives you can work with.

In addition, participants will get insight into:

The different perspectives from which you can do social media archiving from;
How different organizations (could) work on social media archiving.

Coordinators

Susanne van den Eijkel is a metadata specialist for digital preservation at the National Library of the Netherlands. She is responsible for all the preservation metadata, writing policies and implementing them. Her main focus are born-digital collections, especially the web archives. She focuses on web material after it has been harvested, and not so much on selection and tools and is therefore more involved with which metadata and context information is available and relevant for preservation. In addition, she works on the communication strategy of her department; is actively involved in the Dutch Digital Heritage Network and provides guest lectures on digital preservation and web archiving.

Zefi Kavvadia is a digital archivist at the International Institute of Social History in Amsterdam, the Netherlands. She is part of the institute’s Collections Department, where she is responsible for processing of digital archival collections. She is also actively contributing to research, planning, and improving of the IISH digital collections workflows. While her work covers potentially any type of digital material, she is especially interested in the preservation of born-digital content and is currently the person responsible for web archiving at IISH. Her research interests range from digital preservation and archives, to web and social media archiving, and research data management, with a special focus on how these different but overlapping domains can learn and work together. She is active in the web archiving expert group of the Dutch Digital Heritage Network and the digital preservation interest group of the International Association of Labour History Institutions.

Lotte Wijsman is the Preservation Researcher at the National Archives in The Hague. In her role she researches how we can further develop preservation at the National Archives of the Netherlands and how we can innovate the archival field in general. This includes considering our current practices and evaluating how we can improve these with e.g. new practices and tools. Currently, Lotte is active in research projects concerning subjects as social media archiving, AI, a supra-organizational Preservation Watch function, and environmentally sustainable digital preservation. Furthermore, she is a guest teacher at the Archiefschool and Reinwardt Academy (Amsterdam University of the Arts).

WKSHP-04: Browser-Based Crawling For All: Getting Started with Browsertrix Cloud

Andrew N. Jackson¹, Anders Klindt Myrvoll², Ilya Kreymer³

¹The British Library, United Kingdom; ²Royal Danish Library; ³Webrecorder

Through the IIPC-funded “Browser-based crawling system for all” project, members have been working with Webrecorder to support the development of their new crawling tools: Browsertrix Crawler (a crawl system that automates web browsers to crawl the web) and Browsertrix Cloud (a user interface to control multiple Browsertrix Crawler crawls). The IIPC funding is particularly focussed on making sure IIPC members can make use of these tools.

This workshop will be led by Webrecorder, who will invite you to try out an instance of Browsertrix Cloud and explore how it might work for you. IIPC members that have been part of the project will be on hand to discuss institutional issues like deployment and integration.

The workshop will start with a brief introduction to high-fidelity web archiving with Browsertrix Cloud. Participants will be given accounts and be able to create their own high-fidelity web archives using the Browsertrix Cloud crawling system during the workshop. Users will be presented with an interactive user interface to create crawls, allowing them to configure crawling options. We will discuss the crawling configuration options and how these affect the result of the crawl. Users will have the opportunity to run a crawl of their own for at least 30 minutes and to see the results. We will then discuss and reflect on the results.

After a quick break, we will discuss how the web archives can be accessed and shared with others, using the ReplayWeb.page viewer. Participants will be able to download the contents of their crawls (as WACZ files) and load them on their own machines. We will also present options for sharing the outputs with others directly, by uploading to an easy-to-use hosting option such as Glitch or our custom WACZ Uploader. Either method will produce a URL which participants can then share with others, in and outside the workshop, to show the results of their crawl. We will discuss how, once complete, the resulting archive is no longer dependent on the crawler infrastructure, but can be treated like any other static file, and, as such, can be added to existing digital preservation repositories.

In the final wrap-up of the workshop, we will discuss challenges, what worked and what didn’t, what still needs improvement, etc.. We will also discuss how participants can add the web archives they created into existing web archives that they may already have, and how Browsertrix Cloud can fit into and augment existing web archiving workflows at participants' institutions. IIPC member institutions that have been testing Browsertrix Cloud thus far will share their experiences in this area.

The format of the workshop will be as follows:

Introduction to Browsertrix Cloud - 10 min
Use Cases and Examples by IIPC project partners - 10 min
Break - 5 min
Hands-On - Setup and Crawling with Browsertrix Cloud (Including Q&A / help while crawls are running) - 30 min
Break - 5 min
Hands-On - Replaying and Sharing Web Archives - 10 min
Wrap-Up - Final Q&A / Discuss Integration of Browsertrix Cloud Into Existing Web Archiving Workflows with IIPC project partners - 20 min

Webrecorder will lead the workshop and the hands-on portions of the workshop. The IIPC project partners will present and answer questions about use-cases at the beginning and around integration into their workflows at the end.

Participants should come prepared with a laptop and a couple of sites that they would like to crawl, especially those that are generally difficult to crawl by other means and require a ‘high fidelity approach’. (Examples include social media sites, sites that are behind a paywall, etc..) Ideally, the sites can be crawled during the course of 30 mins (though crawls can be interrupted if they run for too long)

This workshop is intended for curators and anyone wishing to create and use web archives and are ready to try Browsertrix Cloud on their own with guidance from the Webrecorder team. The workshop does not require any technical expertise besides basic familiarity with web archiving. The participants’ experiences and feedback will help shape not only the remainder of the IIPC project, but the long-term future of this new crawling toolset.

The workshop should be able to accommodate up to 50 participants.

WKSHP-05: Supporting Computational Research on Web Archives with the Archive Research Compute Hub (ARCH)

Jefferson Bailey, Alex Dempsey, Kody Willis, Helge Holzmann

Internet Archive, United States of America

Format: 90 or 120-minute workshop and tutorial

Target Audience: The target audience is professionals working in digital library services that are collecting, managing, or providing access to web archives, scholars using web archives and other digital collections in their work, library professionals working to support computational access to digital collections, and digital library technical staff.

Anticipated Number of Participants: 25

Technical Requirements: A meeting room with wireless internet access and a projector or video display. Participants must bring laptop computers and there should be power outlets. The coordinators will handle preliminary activities over email and provide some technical support beforehand as far as building or accessing web archives for use in the workshop.

Abstract: Every year more and more scholars are conducting research on terabytes and even petabytes of digital library and archive collections using computational methods such as data mining, natural language processing, and machine learning. Web archives are a significant collection of interest for these researchers, especially due to their contemporaneity, size, multi-format nature, and how they can represent different thematic, demographic, disciplinary, and other characteristics. Web archives also have longitudinal complexity, with frequent changes in content (and often state of existence) even at the same URL, gobs of metadata both content-based and transactional, and many characteristics that make them highly suitable for data mining and computational analysis. Supporting computational use of web archives, however, poses many technical, operational, and procedural challenges for libraries. Similarly, while platforms exist for supporting computational scholarship on homogenous collections (such as digitized texts, images, or structured data), none exist that handle the vagaries of web archive collections while also providing a high level of automation, seamless user experience, and support for both technical and non-technical users.

In 2020, Internet Archive Research Services and the Archives Unleashed received funding for joint technology development and community building to combine their respective tools that enable computational analysis of web and digital archives in order to build an end-to-end platform supporting data mining of web archives. The program also simultaneously is building out a community of computational researchers doing scholarly projects via a program supporting cohort teams of scholars that receive direct technical support for their projects. The beta platform, Archives Research Compute Hub (ARCH), is currently being used by dozens of researchers in the digital humanities, social and computer science researchers, and by dozens of libraries and archives that are interested in supporting local researchers and sharing datasets derived from their web collection in support of large-scale digital research methods.

ARCH lowers the barriers for conducting research of web archives, using data processing operations to generate 16 different derivatives from WARC files. Derivatives range in use from graph analysis, text mining, and file format extraction, and ARCH makes it possible to visualize, download, and integrate these datasets into third-party tools for more advanced study. ARCH enables analysis of the more than 20,000 web archive collections - over 3 PB of data - collected by over 1,000 institutions using Archive-It that cover a broad range of subjects and events and ARCH also includes various portions of the overall Wayback Machine global web archive totalling 50+ PB and going back to 1996.

This workshop will be a hands-on training covering the full lifecycle of supporting computational research on web archives. The agenda will include an overview of the conceptual challenges researchers face when working with web archives, the procedural challenges that librarians face in making web archives available for computational use, and most importantly, will provide an in-depth tutorial on using the ARCH platform and its suite of data analysis, dataset generation, data visualization, and data publishing tools, both from the perspective of a collection manager, a research services librarian, and a computational scholar. Workshop attendees will be able to build small web archive collections beforehand or will be granted access to existing web archive collections to use during the workshop. All participants will also have access to any datasets and data visualizations created as part of the workshop.

Anticipated Learning Outcomes:

Given the conference, we expect the attendees primarily to be web archivists, collection managers, digital librarians, and other library and archives staff. After the workshop, attendees will:

Understand the full lifecycle of making web and digital archives available for computational use by researchers, scholars, and others. This includes gaining knowledge of outreach and promotion strategies to engage research communities, how to handle computational research requests, how to work with researchers to scope and refine their requests, how to make collections available as data, how to work with internal technical teams facilitating requests, dataset formats and delivery methods, and how to support researchers in ongoing data analysis and publishing.
Gain knowledge of the specific types of data analysis and datasets that are possible with web archive collections, including data formats, digital methods, tools, infrastructure requirements, and the related methodological affordances and limitations for scholarship related to working with web archives as data.
Receive hands-on training on using the ARCH platform to explore and analyze web archive collections, from both the perspective of a collection manager and that of a researcher.
Be able to use the ARCH platform to generate derivative datasets, create corresponding data visualizations, publish these datasets to open-access repositories, and conduct further analysis with additional data mining tools.
Have tangible experience with datasets and related technologies in order to perform specific analytic tasks on web archives such as exploring graph networks of domains and hyperlinks, extract and visualize images and other specific formats, and perform textual analysis and other interpretive functions.
Have insights into digital methods through their exposure to a variety of different active, real-life use cases from scholars and research teams currently using the ARCH platform for digital humanities and similar work.

WKSHP-06: Run your own full stack SolrWayback

Thomas Egense, Toke Eskildsen, Jørn Thøgersen, Anders Klindt Myrvoll

Royal Danish Library, Denmark

An in-person, updated, version of the ‘21 WAC workshop, Run your own full stack SolrWayback.

This workshop will

Explain the ecosystem for SolrWayback 4 (https://github.com/netarchivesuite/solrwayback)
Perform a walkthrough of installing and running the SolrWayback bundle. Participants are expected to mirror the process on their own computer and there will be time for solving installation problems
Leave participants with a fully working stack for index, discovery and playback of WARC files
End with open discussion of SolrWayback configuration and features.

Prerequisites:

Participants should have a Linux, Mac or Windows computer with Java 8 or Java 11 installed. To see java is installed type this in a terminal: java -version
Downloading the latest release of SolrWayback Bundle from:https://github.com/netarchivesuite/solrwayback/releases beforehand is recommended.
Having institutional WARC files available is a plus, but sample files can be downloaded from https://archive.org/download/testWARCfiles
A mix of WARC-files from different harvests/years will showcase SolrWaybacks capabilities the best way possible.

Target audience:

Web archivists and researchers with medium knowledge of web archiving and tools for exploring web archives. Basic technical knowledge of starting a program from the command line is required; the SolrWayback bundle is designed for easy deployment.

Maximum number of participants
30

Background

SolrWayback 4 (https://github.com/netarchivesuite/solrwayback) is a major rewrite with a strong focus on improving usability. It provides real time full text search, discovery, statistics extraction & visualisation, data export and playback of webarchive material. SolrWayback uses Solr (https://solr.apache.org/) as the underlying search engine. The index is populated using Web Archive Discovery (https://github.com/ukwa/webarchive-discovery). The full stack is open source and freely available. A live demo is available at https://webadmin.oszk.hu/solrwayback/

During the conference there will be focused support for SolrWayback in a dedicated Slack channel by Thomas Egense and Toke Eskildsen.