‘Research-ready’ collections: challenges and opportunities in making web archive material accessible

Leontien Talboom1, Mark Simon Haydn2

1Cambridge University Libraries, United Kingdom; 2National Library of Scotland, United Kingdom

The Archive of Tomorrow is a collaborative, multi-institutional project led by the National Library of Scotland and funded by the Wellcome Trust collecting information and misinformation around health in the online public space. One of the aims of this project is to create a ‘research-ready’ collection which would make it possible for researchers to access and reuse the themed collections of materials for further research. However, there are many challenges around making this a reality, especially around the legislative framework governing collection of and access to web archives in the UK, and technical difficulties stemming from the emerging platforms and schemas used to catalogue websites.

This talk would primarily address IIPC 2023's Access and Research themes, while also touching on the Collections and Operations strands in its discussion of a short-term project promising to deliver technical improvements and expanded access to web archives collections by 2023. The presentation would like to challenge and explore the difficulties the project encountered by offering different ways into the material, including exposing insights that can be generated from working with metadata exports outside of collecting platforms; detailing the project’s work in surfacing web archives in traditional library discovery settings through metadata crosswalks; and exploring further possibilities around the use of Jupyter Notebooks for data exploration and the documentation and dissemination of datasets.

The intended deliverables of this session are to present the tools developed within the project to make web archive material suitable and useful for research; to share frameworks used by the project’s web archivists when navigating the challenges of archiving personal and political health information online; and to discuss the barriers to access around collecting web archive and social media material in a UK context.

Developing new academic uses of web archives collections: challenges and lessons learned from the experimental service deployed at the University of Lille during the ResPaDon Project

Jennifer Morival1, Sara Aubry2, Dorothée Benhamou-Suesser2

1Université de Lille, France; 2Bibliothèque nationale de France, France

2022 marks the second year of the ResPaDon project, undertaken by the BnF (National Library of France) and the University of Lille, in partnership with Sciences Po and Campus Condorcet. The project brings together researchers and librarians to promote and facilitate a broader academic use of web archives by demonstrating the value of web archives and by reducing the technical and methodological barriers researchers may encounter when discovering this source for the first time or when working with such complex materials.

One of the ways to meet the challenges and address new ways of doing research is the implementation of an experimental remote access point to the web archives at the University of Lille. The project team has renewed the offer of tools and conducted outreach to new groups of potential web archive users.

The remote access point to web archives has been deployed in two university libraries in Lille: this service allows for both consultation of the web archives in their entirety (44 billion documents, 1.7 PB of data) and for exploring a collection, "The 2002 presidential and local elections", which was the the first collection constituted in-house by the BnF 20 years ago. This collection is now accessible , through various tools for data mining, analysis, and data visualization. And the use of those tools is accompanied by guides, reports, examples, use cases - multiple types of supporting documentation that will also be evaluated on their usefulness as part of the experimentation.

The presentation will focus on the implementation of this access point from both technical and practical aspects. It will address the training of the team of 6 mediators responsible for accompanying the researchers in Lille, as well as the collaboration between the teams in Lille and at the BnF. It will also tackle the challenges of outreach and the path we have taken to communicate within the academic community to find researcher-testers.

We will share the results and lessons learned from this experimentation: the first tests conducted with the researchers have allowed us to obtain feedback on the tools deployed and the improvements to be made to this experimental service.

Through the ARCHway: Opportunities to Support Access, Exploration, and Engagement with Web Archives

Samantha Fritz

Archives Unleashed Project, University of Waterloo, Canada

For nearly three decades, memory institutions have consciously archived the web to preserve born-digital heritage. Now, web archive collections range into the petabytes, significantly expanding the scope and scale of data for scholars. Yet there are many acute challenges research communities face, from the availability of analytical tools, community infrastructure, and inaccessible research interfaces. The core objective of the Archives Unleashed Project is to lower these barriers and burdens for conducting scalable research with web archives.

Following a successful series of datathon events (2017-2020), Archives Unleashed launched the cohort program (2021-2023) to facilitate opportunities to improve access, exploration and research engagement with web archives.

Borrowing from the hacking genre of events often found within the tech industry, Archives Unleashed datathons were designed to provide an immersive and uninterrupted period of time for participants to work collaboratively on projects and gain hands-on experience working with web archive data. The datathon series cultivated community formation and empowered scholars to build confidence and the skills needed to work with web archives. However, the short-term nature of datathons ultimately saw focused energy and time to research projects diminish once meetings concluded.

Launched in 2021, the Archives Unleashed cohort program was developed as a matured evolution of the datathon model to support research projects. The program ran two iterative cycles and hosted 46 international researchers from 21 unique institutions. Programmatically, researchers engaged in a year-long collaboration project, with web archives featured as a primary data source. The mentorship model has been a defining feature, including direct one-on-one consultation from Archives Unleashed, connections to field experts, and opportunities for peer-to-peer support.

This presentation will reflect on the experiences of engaging with scholars to build scalable analytical tools and deliver a mentorship program to facilitate research with web archives. The cohort program asked researchers to step into an unfamiliar environment with complex data, and they did so with curiosity while embracing opportunities to access, explore, and engage with web archive collections. While the program highlights a broad range of use cases, we seek to inspire the adoption of web archives for scholarly inquiry more commonly across disciplines.


Leveraging Existing Bibliographic Metadata to Improve Automatic Document Identification in Web Archives.

Mark Phillips1, Cornelia Caragea2, Praneeth Rikka1

1University of North Texas, United States of America; 2University of Illinois Chicago, United States of America

The University of North Texas Libraries, partnering with the University of Illinois Chicago (UIC) Computer Science Department, has been awarded a research and development grant (LG-252349-OLS-22) from the Institute of Museum and Library Services in the United States to continue work from previously awarded projects (LG-71-17-0202-17) related to identification and extraction of high-value publications from large web archives. This work will investigate the potential of using existing bibliographic metadata from library catalogs and digital library collection to better train machine learning models that can assist librarians and information professionals in identifying and classifying high-value publications from large web archives. The project will focus on extracting publications related to state government document collections from the states of Texas and Michigan with the hopes that this approach will enable other institutions interested in leveraging their existing web archives to assist in building traditional digital collections with these publications. This presentation will present an overview of the project with a description of the approaches the research team is exploring to leverage existing bibliographic metadata to assist in building machine models for publication identification from web archives. Early findings from the first year of research as well as next steps and how this research can be used by institutions apply to their own web archives.

Conceptual Modeling of the Web Archiving Domain

Illyria Brejchová

Masaryk University, Czech Republic

Web archives collect and preserve complex digital objects. This complexity, along with the large scope of archived websites and the dynamic nature of web content, makes sustainable and detailed metadata description challenging. Different institutions have taken various approaches to metadata description within the web archiving community, yet this diversity complicates interoperability. The OCLC Research Library Partnership Web Archiving Metadata Working Group took a significant step forward in publishing user-centered descriptive metadata recommendations applicable across common metadata formats. However, there is no shared conceptual model for understanding web archive collections. In my research, I examine three conceptual models from within the GLAM domain, IFLA-LRM created by the library community, CIDOC-CRM originating from the museum community, and RiC-CM stemming from the archive community. I will discuss what insight they bring to understanding the content within web archives and their potential for supporting metadata practices that are flexible, scalable, meet the requirements of the end users, and are interoperable between web archives as well as the broader cultural heritage domain.

This approach sheds light on common problems encountered in metadata description practice in a bibliographic context by modeling archived web resources according to IFLA-LRM and showing how constraints within RDA introduce complexity without providing tools for feasibly representing this complexity in MARC 21. On the other hand, object-oriented models, such as CIDOC-CRM, can represent at least the same complexity of concepts as IFLA-LRM but without many of the aforementioned limitations. By mapping our current descriptive metadata and automatically generated administrative metadata to a single comprehensive model and publishing it as open linked data, we can not only more easily exchange metadata but also provide a powerful tool for researchers to make inferences about the past live web by reconstructing the web harvesting process using log files and available metadata.

While the work presented is theoretical, it provides a clearer understanding of the web archiving domain. It can be used to develop even better tools for managing and exploring web archive collections.

Web Archives & Machine Learning: Practices, Procedures, Ethics

Jefferson Bailey

Internet Archive, United States of America

Given their size, complexity, and heterogeneity, web archives are uniquely suited to leverage and enable machine learning techniques for a variety of purposes. On the one hand, web collections increasingly represent a larger portion of the recent historical record and are characterized by longitudinality, format diversity, and large data volumes; this makes them highly valuable in computational research by scholars, scientists, and industry professionals using machine learning for scholarship, analysis, and tool development. Few institutions, however, are yet facilitating this type of access or pursuing these types of partnerships and projects given the specialized practices, skills, and resources required. At the same time, machine learning tools also have the potential to improve internal procedures and workflows related to web collections management by custodial institutions, from description to discovery to quality assurance. Projects applying machine learning to web archive workflows, however, also remains a nascent, if promising, area of work for libraries. There is also a “virtuous loop” possible between these two functional areas of access support and collections management, wherein researchers utilizing machine learning tools on web archive collections can create technologies that then have internal benefits to the custodial institutions that granted access to their collections. Finally, spanning both external researcher uses and internal workflow applications are an intricate set of ethical questions posed by machine learning techniques. Internet Archive has been partnering with both academic and industry research projects to support the use of web archives in machine learning projects by these communities. Simultaneous, IA has also explored prototype work applying machine learning to internal workflows for improving the curation and stewardship of web archives. This presentation will cover the role of machine learning in supporting data-driven research, the successes and failures of applying these tools to various internal processes, and the ethical dimensions of deploying this emerging technology in digital library and archival services.

From Small to Scale: Lessons Learned on the Requirements of Coordinated Selective Web Archiving and Its Applications

Balázs Indig1,2, Zsófia Sárközi-Lindner1,2, Mihály Nagy1,2

1Eötvös Loránd University, Department of Digital Humanities, Budapest, Hungary; 2National laboratory for Digital Humanities, Budapest, Hungary

Today, web archiving is measured on an increasingly large scale, pressurizing newcomers and independent researchers to keep up with the pace of development and maintain an expensive ecosystem of expertise and machinery. These dynamics involve a fast and broad collection phase, resulting in a large pool of data, followed by a slower enrichment phase consisting of cleaning, deduplication and annotation.

Our streamlined methodology for specific web archiving use cases combines mainstream practices with new open-source tools. Our custom crawler conducts selective web archiving for portals (e.g. blogs, forums, currently applied to Hungarian news providers), using the taxonomy of the given portal to systematically extract all articles exclusively into portal-specific WARC files. As articles have uniform portal-dependent structure, they can be transformed into a portal-independent TEI XML format individually. This methodology enables assets (e.g. video) to be archived separately on demand.

We focus on textual content, which in case of using traditional web archives would require using resource intensive filtering. Alternatives like trafilatura are limited to automatic content extraction often yielding invalid TEI or incomplete metadata unlike our semi-automatic method. Resulting data are deposited by grouping portals under specific DOIs, enabling fine-grained access and version control.

With almost 3 million articles from more than 20 portals we developed a library for executing common tasks on these files, including NLP and format conversion to overcome the difficulties of interacting with the TEI standard. To provide access to our archive and gain insights through faceted search, we created a light-weight trend viewer application to visualize text and descriptive metadata.

Our collaborations with researchers have shown that our approach makes it easy to merge coordinated separate crawls promoting small archives created by different researchers, who may have lower technical skills, into a comprehensive collection that can in some respects serve as an alternative to mainstream archives.

Balázs Indig, Zsófia Sárközi-Lindner, and Mihály Nagy. 2022. Use the Metadata, Luke! – An Experimental Joint Metadata Search and N-gram Trend Viewer for Personal Web Archives. In Proceedings of the 2nd International Workshop on Natural Language Processing for Digital Humanities, pages 47–52, Taipei, Taiwan. Association for Computational Linguistics.


The UK Government Web Archive (UKGWA): Measuring the impact of our response to the COVID-19 pandemic

Tom Storrar

The National Archives, United Kingdom

The COVID-19 pandemic, the first pandemic of the digital age, has presented an enormous challenge to our web archiving practice. As the official archive of the UK government, we were tasked with building a comprehensive archive of the UK government's online response to the emergency. To meet this challenge we have devised new archiving strategies ranging from supplementary broad, keyword-driven crawling to focus, data-driven, daily captures of the UK’s official “Coronavirus (COVID-19) in the UK” data dashboard. We have also massively increased our rates of capture. The challenge has demanded creativity, adaptation and a great deal of effort.

All of this work prompted us to think of a number of questions that we’d like to answer: How complete is the record we captured in our web archive and how much is this a result of the extra effort we made? How could we perform meaningful analysis on the enormous numbers of HTML and non-HTML resources? What contributions have these innovations made to this outcome and how can these inform our practice going forward?

To tackle these questions we needed to analyse millions of captured resources in our web archive. It soon became clear that we’d only be able to achieve the level of insight needed by developing an entire end-to-end analysis system. The resulting pipeline we designed and built uses a combination of familiar and novel concepts and approaches; we used the WARC file content, along with CDX APIs, but we also developed a set of heuristics, and custom algorithms, all ultimately populating a database that allowed us to run queries to give us the answers we sought. Running an entirely cloud-based system enabled this work as we were at that time unable to reliably access our office.

This presentation will provide an overview of the approaches used, the results we found and the areas for further development. We believe that these tools can be applied to our overall web archive collections and hope that other institutions will find our experience useful when thinking about analysing their own collection and quantifying the impact of their efforts.

Women and COVID through Web Archives. How to explore the pandemic through a collaborative, interdisciplinary research approach

Susan Aasman1, Karin de Wild2, Joshgun Sirajzade3, Fréderic Clavert3, Valerie Schafer3, Sophie Gebeil4, Niels Brügger5

1University of Groningen, Netherlands, The; 2Leiden University, The Netherlands; 3University of Luxembourg, Luxembourg; 4Aix-Marseille University, France; 5Aarhus University, Denmark

The COVID crisis has been a shared worldwide and collective experience from March 2020 and lot of voices have echoed each other, may it be related to grief, lockdown, masks and vaccines, homeschooling, etc. However, this unprecedented crisis has also deepened asymmetries and failures within societies, in terms of occupational fields, economic inequalities, health and sanitary access, and we could extend the inventory of these hidden and more visible gaps that were reinforced during the crisis. Women and gender were also at stake when it came to this sanitary crisis, may it be to discuss the better management of the crisis by female politicians, domestic violence during the lockdown, decreasing production of papers by female research scientists, homeschooling and mental load of women, etc.

As a cohort team within the Archives Unleashed Team (AUT) program, the European research AWAC2 team benefited from a privileged access to this collection, thanks to Archive-It and through ARCH, and from regular mentorship by the AUT team. It allowed us to investigate and analyse this huge collection of 5.3 TB, 161 757 lines for the CSV on domain frequency CSV, 8,738,751 lines for the CSV related to plain text of web pages. In December 2021, our AWAC2 team submitted several topics to the IIPC (International Internet Preservation Consortium) community and invited the international organization to select one of them that the team would investigate in depth, based on the unique IIPC COVID collection of web archives. Women, gender, and COVID was the winning topic.

Accepting the challenge, the AWAC2 team organized a datathon in March 2022 in Luxembourg to investigate and retrieve the many traces of women, gender and COVID in web archives, while mixing close and distant reading. Since then, the team has been working on the dataset to further explore the opportunities for computational methods for reading at scale. In this presentation, we will reflect on technical, epistemological, and methodological challenges and present some results as well.

Surveying the landscape of COVID-19 web collections in European GLAM institutions

Nicola Bingham1, Friedel Geeraert2, Caroline Nyvang3, Karin de Wild4

1British Library, United Kingdom; 2KBR (Royal Library of Belgium); 3Royal Danish Library; 4Leiden University

The aim of the WARCnet network [https://cc.au.dk/en/warcnet/about] is to promote high-quality national and transnational research that will help us to understand the history of (trans)national web domains and of transnational events on the web, drawing on the increasingly important digital cultural heritage held in national web archives. Within the context of this network, a survey was conducted to see how cultural heritage institutions are capturing the COVID-19 crisis for future generations. The aim of the survey was to map the scope and collection strategies of COVID-19 Web collections with a main focus on Europe. The survey was managed by the British Library and was conducted by means of the Snap survey platform. It circulated between June and September 2022 among mainly European GLAM institutions and 61 responses were obtained.

The purpose of this presentation is to provide an overview of the different collection development practices when curating COVID-19 collections. On the one hand, the results may support GLAM institutions to gain further insights in how to curate COVID-19 Web collections or identify potential partners. On the other hand, revealing the scope of these Web collections may also encourage humanists and data scientists to unlock the potential of these archived Web sources to further understand international developments on the Web during the COVID-19 pandemic

More concretely, the presentation will provide further insight into the local, regional, national or global scopes of the different COVID-19 collections, the type of content that is included in the collections, the available metadata, the selection criteria that were used when curating the collections and the efforts that were made to create inclusive collections. The temporality of the collections will also be discussed by highlighting the start, and, if applicable, end dates of the collections and the capture frequency. Quality control and long-term preservation are two further elements that will be discussed during the presentation.


Searching for a Little Help From My Friends: Reporting on the Efforts to Create an (Inter)national Distributed Collaborative Social Media Archiving Structure

Zefi Kavvadia1, Katrien Weyns2, Mirjam Schaap3, Sophie Ham4

1International Institute of Social History; 2KADOC Documentation and Research Centre on Religion, Culture, and Society; 3Amsterdam City Archives; 4KB, National Library of the Netherlands

Social media archiving in cultural heritage and government is still at an experimental stage with regard to organizational readiness for and sustainability of initiatives. The many different tools, the variety of platforms, and the intricate legal and ethical issues surrounding social media do not readily allow for immediate progress and uptake by organizations interested or mandated to preserve social media content for the long term.

In Belgium and the Netherlands, the last three years have seen a series of promising projects on building social media archiving capacity, mostly focusing on heritage and research. One of their most important findings is that the multiple needs and requirements of successful social media archiving are difficult for any one organization to tackle; efforts to propose good practices or establish guidelines often run onto the reality of the many and sometimes clashing priorities of different domains e.g. archives, libraries, local and national government, research. Faced with little time and increasing costs, managers and funders are generally reluctant to support social media archiving as an integral part of collecting activity, as it is seen as a nice-to-have but not crucial part of their already demanding core business.

Against this background, we set out to bring together representatives of different organizations from different sectors in Belgium and the Netherlands to research the possibilities for what a distributed collaborative approach to social media archiving could look like, including requirements for sharing knowledge and experiences systematically and efficiently, sharing infrastructure and human and technical resources, prioritization, and future-proofing the initiative. In order to do this, we look into:

  • Wishes, demands, and obstacles to doing social media archiving at different types of organizations in Belgium and the Netherlands?
  • Aligning the heritage, research, and governmental perspectives
  • Learning from existing collective organizational structures
  • First steps for the allocation of roles and responsibilities

Through interviews with staff and managers of interested organizations, we want to find out if there is potential in thinking about social media archiving as a truly collaborative venture. We would like to discuss the progress of this research and the ideas and challenges we have come up against.

Archiving social media in Flemish cultural or private archives, (how) is it possible

Katrien Weyns1, Ellen Van Keer2

1KADOC-KU Leuven, Belgium; 2meemoo, Belgium

Social media are increasingly replacing other forms of communication. In doing so, they are also becoming an important source to archive in order to preserve the diverse voices in society for the long term. However, few Flemish archival institutions currently archive this type of content. To remedy this situation, a number of private archival institutions in Flanders started research on sustainable approaches and methods to capture and preserve social media archives. Confronted with the complex reality of this new landscape however, this turned out to be a rather challenging undertaking.

Through the lens of our project 'Best practices for social media archiving in Flanders and Brussels', we’ll look at the lessons learned and the central challenges that remain for social media archiving in private archival institutions in Flanders. Many of these lessons and challenges transcend this project and concern the broader web archiving community and cultural heritage sector.

Unsurprisingly, to a lot of (often smaller) private archival institutions in Belgium archiving social media remains a major challenge either because of a lack of (new) digital archiving competencies or the availability of (often expensive and quickly outdated) technical solutions in heritage institutions. On top of that, there are major legal challenges. For one, these archives cannot fall back on archival law or legal deposit law as a legal basis. In addition, the quickly evolving European and national privacy and copyright regulations form a maze of rules and exceptions they have to find their way in and keep up with.

One last stumbling block is proving particularly hard to overcome. It concerns the legal and technical restrictions the social media platforms themselves impose on users. These make it practically impossible for heritage institutions to capture and preserve the integrity of social media content in a sustainable way. We believe this problem is best to be addressed by the international web archiving, research and heritage community as a whole.

This is only one of the recommendations we’re proposing to improve the situation as part of the set of ‘best practices’ we developed and which we would like to present here in more detail.

Collaborating On The Cutting Edge: Client Side Playback

Clare Stanton, Matteo Cargnelutti

Library Innovation Lab, United States of America

Perma.cc is a project of the Library Innovation Lab, which is based within the Harvard Law School Library and exists as a unit of a large academic institution. Our work has been focused in the past mainly on the application of web archiving technology as it relates to citation in legal and scholarly writing. However, we also have spent time exploring expansive topics in the web archiving world - oftentimes via close collaboration with the Webrecorder project - and most recently have built tools leveraging new client-side playback technology made available by replayweb.page.

warc-embed is LIL's experimental client-side playback integration boilerplate, which can be used to test out and explore this new technology, along with its potential new applications. It consists of: a simple web server configuration that provides web archive playback; a preconfigured “embed” page that can be easily implemented to interact with replayweb.page; and a two-way communication layer that allows the replay to reliably and safely communicate with the archive. These features are replicable for a relatively non-technical audience and thus we sought to explore small scale applications of it outside of our group.

This is one of two proposals relating to this work. We believe there is an IIPC audience who could attend either or both sessions based on their interests. They explore separate topics relating to the core technology. This session will look into user applications of the tool and institutional user feedback from the Harvard Library community.

Our colleagues at Harvard use the Internet Archive’s Archive-It across the board for the majority of their web archiving collections and access. As an experiment, we have worked with some of them to host and serve their .warcs via warc-embed. We scoped work based on their needs and made adjustments based on their ability to apply the technology. One example of this is a refresh of the software to be able to mesh with WordPress, which was more easily managed directly by the team. This session will explore a breakdown of roadblocks, design strategies, and wins from this collaboration. It will focus on the end-user results and applications of the technology.


Linking web archiving with arts and humanities: the collaboration ROSSIO and Arquivo.pt

Ricardo Basilio

Arquivo.pt - Fundação para a Ciência e Tecnologia, I.P., Portugal

ROSSIO and Arquivo.pt developed collaborative activities with the goal of connecting web archiving, arts and digital humanities, between 2018 and 2022. How to make Web archives useful and accessible to digital humanities researchers, and by extension to citizens? This challenge was answered in three ways: training, dissemination, and collaborative curation of websites. This presentation aims to describe those collaborative activities and share what we’ve learned from them.

ROSSIO is a Portuguese infrastructure for the Social Sciences, Arts and Humanities (https://rossio.fcsh.unl.pt/). Its mission is to aggregate, contextualize, enrich and disseminate digital content. It is based at the Faculty of Social and Human Sciences of the NOVA University of Lisbon (FCSH-NOVA) and involves several institutions that provide content. Arquivo.pt's mission (https://arquivo.pt) is to preserve the Portuguese Web and make available contents from the Web since 1996 to everyone, from simple citizens to researchers.

ROSSIO contributed human resources, namely, a web curator, a community manager, a web developer, and researchers who used Arquivo.pt in their work. Arquivo.pt in turn contributed its know-how, created new services (e.g., the SavePageNow) and made available open data sets.

Therefore, we describe the activities carried out in collaboration and their results.

First, regarding training, we refer to face-to-face and online sessions held with ROSSIO partners and their communities. We highlight the initiative "Café with Arquivo.pt" (https://arquivo.pt/cafe) and the webinars held during the pandemic, because they strengthened the connection between Arquivo.pt and distant communities (e.g., in 2021 they had 538 participants and 84% of satisfaction).

Second, the continuous dissemination in the social networks and groups of the ROSSIO partners which helped to make Arquivo.pt better known (e.g., 7.300 new users accessed the service between 2018 and 2021).

Third, researchers from the ROSSIO collaborated in curating websites, which resulted in documentation for studies and online exhibitions (e.g. “Times of illness, times of healing” at the FCSH NOVA; and "art festivals memory" at the Gulbenkian Art Library).

We concluded this presentation by sharing what we learned from participating in ROSSIO, and the challenges that lie ahead for creating a community of practice among art and humanities researchers.

Building collaborative collections : experience of the Croatian Web Archive

Inge Rudomino, Dolores Mumelaš

National and University Library in Zagreb, Croatia

In Croatia, the only institution that archives the web is the National and University Library in Zagreb. The library established the Croatian Web Archive (HAW) and began archiving Croatian web sources in 2004. From then until today, we have developed several approaches to web archiving: selective, .hr crawls, thematic crawls, building local history collections and social media archiving. In order to broaden our collections and raise public awareness as much as possible the Croatian Web Archive is opening up to collaboration with other libraries, as well as all interested citizens.

One of the examples is the Building Local History Web project from 2020. That year, the Croatian Web Archive began collaboration with public libraries for the purpose of archiving web resources related to a specific area or homeland. The contents are related to a specific locality with the aim of presenting and ensuring long-term access to local materials that are available only on the web and complement and popularize the local history collection of the public library.

In addition to collaboration with public libraries, the Croatian Web Archive has connected with the User Service Department of the National and University Library in Zagreb, in order to involve citizens in the creation of thematic collections through citizen science. In that way the thematic collection “Bees, life, people” was created, using the crowdsourcing method, in collaboration with the public library, citizens (high school students) and other library departments.

This presentation will discuss developing a collection policy, collaboration and working process in building local history and citizen science collections.

The lessons learned throughout collaboration with citizens and public libraries are great encouragement to expand the existing scope of archiving as well as involvement of other libraries and citizens in raising awareness of information literacy and the importance of archiving web content.

Your Software Development Internship in Web Archiving

Youssef Eldakar

Bibliotheca Alexandrina, Egypt

A summer internship project is an opportunity for the intern to practice in the real world as well as for the host institution to make extra progress on program objectives, while also engaging with the community. Since 2019, Bibliotheca Alexandrina's IT team has been running a summer internship series for undergraduate students of computing, with several of the internship projects having a connection to web archiving.

Throughout this experience, our mentors have been finding the young interns much intrigued by the technology involved in archiving the web. From a computing perspective, aside from serving to preserve a quite significant information medium, web archiving is an activity where a number of sub-domains of computing come together. A software project in web archiving will involve, for instance, management of big data to keep pace with how the web and consequently an archive thereof continues to expand in volume, parallel computing to achieve the capacity for both data harvesting and processing at that level of scale, machine learning to find answers to questions about the datasets that can be extracted from a web archive, or network theory and graph analytics to come to more understandable representations of the heavily interlinked data.

In this presentation, we invite you to join us on a virtual visit to the home of the IT team at Bibliotheca Alexandrina for a look into our archive of past internship projects in web archiving. These projects include the investigation of alternative graph analytics backends for the implementation of new features in web archive graph visualization, repurposing of the WARC format for use in the library's digital book portal, and crawling the web for text for language model training. For each project, we will review the specific objective, how the problem was addressed, and the outcome. Finaly, to reflect on the overall experience, we will share lessons learned as well as discuss how the interaction with the community through internships is additionally an opportunity to raise awareness about web archiving, the technology involved, and the work of the International Internet Preservation Consortium (IIPC).


The Auto QA process at UK Government Web Archive

Kourosh Feissali, Jake Bickford

The National Archives, United Kingdom

The UK Government Web Archive’s (UKGWA) Auto QA process allows us to carry out enhanced data-driven QA almost completely automatically. This is particularly useful for websites that are high-profile or sites that are about to close. Our Auto QA has several advantages over solely visual QA. The advantages enable us to:

1) Identify problems that are not obvious at the visual QA stage.

2) Identify Heritrix errors during the crawl. These include -2 and -6 errors. Once identified, we re-run Heritrix on the affected URIs.

3) Identify and patch URIs that Heritrix could not discover.

4) Identify, test, and patch Hyperlinks insides PDFs. Many PDFs contain hyperlinks to a page on the parent website or to other websites. And sometimes the only way to access those pages is through a link in a PDF which most crawlers can't normally access.

Auto QA consists of three separate processes:

1) ‘Crawl Log Analysis’ that runs on every crawl automatically. CLA examines Heritrix crawl logs and looks for errors. It then tests those errors against the live web.

2) ‘Diffex’ that compares what Heritrix discovered with the output of another crawler such as Screaming Frog. This will identify what Heritrix did not discover. Diffex then tests those URIs against the live web and if they are valid, they are added to a patchlist.

3) ‘PDFflash’ extracts PDF URI’s from Heritrix crawl logs. It then parses them and looks for hyperlinks within PDFs; tests those hyperlinks against the live web, our web archives, and against our in-scope domains. If a hyperlink’s target serves 404 it will be added to our patchlist provided it meets certain conditions such as scoping criteria.

UKGWA’s Auto QA is a highly efficient and scalable system that compliments visual QA; and we are in the process of making it open source.

The Human in the Machine: Sustaining a Quality Assurance Lifecycle at the Library of Congress

Grace Bicho, Meghan Lyon, Amanda Lehman

Library of Congress, United States of America

This talk will build upon information shared during the IIPC WAC 2022 session Building a Sustainable Quality Assurance Lifecycle at the Library of Congress (Thomas and Lyon).

The work to develop a sustainable and effective quality assurance (QA) ecosystem is ongoing and the Library of Congress Web Archiving Team (WAT) is constantly working to improve and streamline workflows. The Library’s web archiving QA goals are structured around Dr. Reyes Ayala’s framework for quality measurements of web archives based in Grounded Theory (Reyes Ayala). During last year’s session, we described how the WAT satisfies the two dimensions of Relevance and Archivability, with some automated processes built in to help the team do its work. We also introduced our idea for Capture Assessment to satisfy the Correspondence dimension of Dr. Reyes Ayala’s framework.

In July 2022, the WAT launched the Capture Assessment workflow internally and invited curators of web archives content at the Library to review captures of their selected content. To best communicate issues of Correspondence quality between the curatorial librarians and the WAT, we instituted a rubric where curatorial librarians can ascribe a numeric value to convey quality information from various angles about a particular web capture, alongside a checklist of common issues to easily note.

The WAT held an optional training alongside the launch, and since then, there have been over 90 responses from a handful of curatorial librarians, including one power user. The WAT has found responses to be mostly actionable for correction in future crawls. We’ve also seen that Capture Assessments are performed on captures that wouldn’t necessarily be flagged via other QA workflows, which gives us confidence that a wider swath of the archive is being reviewed for quality.

The session will share more details about the Capture Assessment workflow and, in time for the 2023 WAC session, we intend to complete a small, early analysis of the Capture Assessment responses to share with the wider web archiving community.

Reyes Ayala, B. Correspondence as the primary measure of information quality for web archives: a human-centered grounded theory study. Int J Digit Libr 23, 19–31 (2022). https://doi.org/10.1007/s00799-021-00314-x


20 years of archiving the French electoral web

Dorothée Benhamou-Suesser, Anaïs Crinière-Boizet

Bibliothèque nationale de France, France

In 2022, BnF is celebrating the 20th anniversary of its electoral crawls. On this occasion, we would like to trace the history of 20 years of electoral crawls, which cover 20 elections of all types (presidential, parliamentary, local, departmental, European), and represent more than 30 Tio of data. The 2002 presidential election crawl was the first in-house crawl conducted by the BnF, a founding moment for experimenting a legal, technical and library policy framework. We, as an heritage institution, are accountable for the first electoral collections, which are emblematic and representative of our workflows on several aspects: harvest, selection, and outreach.

First, on the technical point of view, electoral crawls were an opportunity to set up crawling tools and to develop adaptative techniques to face the evolution of Web and meet the challenge to archive it. We have experimented and made improvements in our archiving processes for each new election and a specific look into the communication means (eg. forums, Twitter accounts, Youtube channels and more recently Instagram accounts, TikTok contents).

Secondly, electoral crawls have led the BnF to set up and organise a network of contributors and the means of selection. In 2002, contributions were from BnF librarians. In 2004, partners libraries in different regions and overseas territories contributed to select content for the regional elections. In 2012, we initiated the development of a collaborative curation tool. Throughout the years, we have also built a document typology that has remained stable to guarantee the coherence of the collections.

Thirdly, electoral crawls led us to set up ways to promote web archives to the public and the research community. To promote the use of a collection with such historical consistency, of high interest for the study of political life, we designed guided tours (thematic and edited selections of archived pages made by librarians). The BnF also engaged in organizing scientific events, and in several collaborative outreach initiatives.

Archiving the Web for FIFA World Cup Qatar 2022™

Arif Shaon, Carol Ann Daul Elhindi, Marcin Werla

Qatar National Library, Qatar

The core mission of Qatar National Library is to “spread knowledge, nurture imagination, cultivate creativity, and preserve the nation’s heritage for the future.” To fulfil this mission, the Library commits to collecting, preserving and providing access to both local and global knowledge, including heritage-related content relevant to Qatar and the region. Web resources of cultural importance could assist future generations in the interpretation of events that may not be extant anywhere else. Archiving such websites is an important initiative within the wider mission of the Library to support Qatar on its journey towards a knowledge-based economy.

The 2022 FIFA World Cup will be the first World Cup ever to be held in the Arab world, and hence is considered a landmark event in Qatar’s history. Qatar’s journey towards hosting the 2022 World Cup has been covered by all types of local and international websites and news portals, and the coverage is expected to increase significantly in the weeks leading to, during and post-World Cup. The information published by these websites will truly reflect the journey towards, and experience of, the event from a variety of perspectives, including the fans, the organizers, the players, and members of the public. Capturing and preserving such information for the long-term enables future generations to also share the experience and appreciate the astounding effort required to host a massive, culturally important global event in Qatar.

In this talk, we describe the Library’s approach to capturing and preserving websites related to the World Cup 2022, to guarantee access to the content for the future generations. We also highlight the challenges associated with developing archived websites as collections for researchers in the context of the Qatari copyright law.

Museums on the Web: Exploring the past for the future

Karin de Wild

Leiden University, Netherlands, The

This presentation will celebrate the launch of the special collection ‘Museums on the Web’ at the KB, National Library of the Netherlands. This evolving collection unlocks an essential and the largest sub-collection within the KB Web archive. It contains more than 800 museum websites and offers the potential to research histories of museums on the Web within the Netherlands.

It requires special tools to access Web archives and therefore this presentation will demonstrate a variety of entry points. It features a selection of curated archived websites that can be viewed page-by-page. It will also be the first KB special collection that is accessible through a SOLR Wayback search engine, which enables the request of derived datasets and explore the collection through a series of dashboards. This offers the opportunity to study histories of museums on the Web in The Netherlands, combining methods from history and data science and drawing on a computational analysis of Web archive data.

The presentation will conclude with highlighting some significant case studies to showcase the diversity of museum websites and the research potential to uncover a Dutch history of museums on the Web. The advent of online technologies has changed the way museums manage collections and access them, shape exhibitions, and build communities. By engaging with the past, we can enhance our understanding of how museums are functioning today and offer new perspectives for future developments.

This paper coincides with the release of a Double Special Issue “Museums on the Web: Exploring the past for the future” in the journal Internet Histories: Digital Technology, Culture and Society (Routledge/Taylor & Francis).

Unsustainability and Retrenchment in American University Web Archives Programs

Gregory Wiedeman1, Amanda Greenwood2

1University at Albany, United States of America; 2Union College

This presentation will overview the expansion and later retrenchment of UAlbany’s web archives program due to a lack of permanently funded staff. UAlbany began its web archives program in 2013 in response to state records laws requiring it to preserve university records on the web. The department that housed the program had strong existing collecting programs in New York State politics and capital punishment. Since much of current politics and activism now happens online, it was natural and necessary to expand the web archives program to ensure we were effectively documenting these important spaces for the long-term future. However, we will show how the increasing complexity of the web and collecting techniques means that the scoping needs for ongoing collecting seem to require significantly more testing and labor over time. Thus, despite the need to expand the web archives program to meet our department’s mission, we will describe the painful process of reducing our web archives collecting scope. With the NDSA Web Archiving in the United States surveys reporting 71-83% of respondents devoting 0.5 or less FTE to web archiving, maintenance inflation like this is catastrophic to many web archives programs. Most alarmingly, we will overview how the web archives labor situation at American universities is likely to get worse. The UAlbany Libraries, which houses the web archives program, has permanently lost over 30% of FTE since 2020 and almost 50% of FTE since 2000. Peer assessment studies, ARL staffing surveys, and the University of California, Berkley’s recent announcement of library closures shows that UAlbany’s example is more typical than exceptional. Finally, we will show how these cuts are not the result of a misunderstanding or a lack of value for web archives or libraries by university administrators, but because our web archives program conflicts with UAlbany’s overall organizational mission and the business model of American higher education.


Discovering and Archiving the Frisian Web. Preparing for a National Domain Crawl.

Susanne van den Eijkel, Iris Geldermans

KB, National Library of the Netherlands

In the past years KB, National Library of the Netherlands (KBNL), conducted a pilot for a national domain crawl. KBNL has been harvesting websites with the Web Curator Tool (a web interface with Heritrix crawler) since 2007, on a selective basis that are focused on Dutch history, culture and language. Information on the web can be brief in existence but can have a vital importance for researchers now and in the future. Furthermore, KBNL outlined in their content strategy that it is the ambition of the library to collect everything that was published in and about the Netherlands, websites included. As more libraries around the world were collecting a national domain, KBNL also expressed the wish to execute a national domain crawl. Before we were able to do that, we had to form a multidisciplinary web archiving team, decide on a new tool for domain harvests and start an intensive testing phase. For this pilot a regional domain, the Frisian, was selected. Since we were new to a domain harvest, we used a selective approach. Curators of digital collections from KBNL were in close contact with Frisian researchers, to help define which websites needed to be included in the regional domain. During the pilot we also gathered more knowledge about Heritrix as we were using NetarchiveSuite (also a web interface with Heritrix crawler) for crawls.

Now that the results are in, we can share our lessons learned, like challenges on technical and legal aspects and related policies that are needed for web collections. Also, we will go into detail about the crawler software settings that were tested and how we can use such information as context information.

This presentation is related to the conference topics collections, community and program operations, as we want to share the best practices for executing a (regional) domain crawl and lessons learned in preparation for a national domain crawl. Furthermore, we will focus on the next steps after completion of the pilot. Other institutions that are harvesting websites can learn from it and those that want to start with web archiving can be more prepared.

Back to Class: Capturing the University of Cambridge Domain

Caylin Smith, Leontien Talboom

Cambridge University Libraries, United Kingdom

The University Archives of Cambridge University, based at the University Library (UL), is responsible for the selection, transfer, and preservation of the internal administrative records of the University, dating from 1266 to the present. These records are increasingly created in digital formats, including common ‘office’ formats (Word, Excel, PDF) as well as increasingly for the web.

The question “How do you preserve an entire online ecosystem in which scholars collaborate, discover and share new knowledge?” about the digital scholarly record posed by Cramer et al. (2022) equally applies to online learning and teaching materials as well as the day-to-day business records of a university.

Capturing this online ecosystem as comprehensively, rather than selectively, as possible is an undertaking that involves many stakeholders and moving parts.

As a UK Legal Deposit Library, the UL is a partner in the UK Web Archive and Cambridge University websites are captured annually; however, some online content needs to be captured more frequently, does not have an identifiable UK address, or is behind a log-in screen.

To improve this capturing, the UL is working on the following:

  • Engaging with content creators and/or University Information Services, which supports the University’s Drupal platform.
  • Working directly with the University Archivist as well as creating a web archiving working group with additional Library staff to identify what University websites need to be captured manually or were captured only in an annual domain crawl but need to be captured more frequently.
  • Becoming a stakeholder in web transformation initiatives to communicate requirements for creating preservable websites and quality checking new web templates from an archival perspective.
  • Identifying potential tools for capturing online content behind login screens. So far WebRecorder.io has been a successful tool to capture this material; however, this is a time-consuming and manual process that would be improved if automated. The automation of this process is currently being explored.

Our presentation will walk WAC2023 attendees through our current workflow as well as highlight ongoing challenges we are working to resolve so that attendees based at universities can take these into account for archiving content on their university’s domains.

Laboratory not Found? Analyzing LANL’s Web Domain Crawl

Martin Klein, Lyudmila Balakireva

Lost Alamos National Laboratory, United States of America

Institutions, regardless of whether they identify as for-profit, nonprofit, academic, or government, are invested in maintaining and curating their representation on the web. The organizational website is often the top-ranked on search engine result pages and commonly used as a platform to communicate organizational news, highlights, and policy changes. Individual web pages from this site are often distributed via organization-wide email channels, included in new articles, and shared via social media. Institutions are therefore motivated to ensure the long-term accessibility of their content. However, resources on the web frequently disappear, leading to the known detriment of link rot. Beyond the inconvenience of the encounter with a “404 - Page not Found” error, there may be legal implications when published government resources are missing, trust issues when academic institutions fail to provide content, and even national security concerns when taxpayer-funded federal research organizations such as Los Alamos National Laboratory show deficient stewardship of their digital content.

We therefore conducted a web crawl of the lanl.gov domain with the motivation to investigate the scale of missing resources within the canonical website representing the institution. We found a noticeable number of broken links, including a significant number of special cases of link rot commonly known as “soft404s” as well as potential transient errors. We further evaluated the recovery rate of missing resources from more than twenty public web archives via the Memento TimeTravel federated search service. Somewhat surprisingly, our results show little success in recovering missing web pages.

These observations lead us to argue that, as an institution, we could be a better steward of our web content and establishing an institutional web archive would be a significant step towards this goal. We therefore implemented a pilot LANL web archive in support of highlighting the availability and authenticity of web resources.

In this presentation, I will motivate the project, outline our workflow, highlight our findings, and demonstrate the implemented pilot LANL web archive. The goal is to showcase an example of an institutional web crawl that, in conjunction with the evaluation, can serve as a blueprint for other interested parties

Public policies for governmental web archiving in Brazil

Jonas Ferrigolo Melo1, Moisés Rockembach2

1University of Porto, Portugal; 2Federal University of Rio Grande do Sul, Brazil

Scientific, cultural, and intellectual relevance of web archiving has been widely recognized since the 1990s. The preservation of the web has been appreciated in several studies ranging from its specific theories and practices, such as its methodological approaches, specific ethical aspects of preserving web pages, to subjects that permeate the Digital Humanities and their uses as a primary source.

This study aims to identify the documents and actions that are related to the development of the web archive policy in Brazil. The methodology used was bibliographic and documental research, using literature on government web archiving, and legislation regarding public policies.

Brazil has a variety of technical resources and legislation that addresses the need to preserve government documents, however, the websites have not yet been included in the records management practices of Brazilian institutions. Until the recent past, the country did not have a website preservation policy. However, there are currently two government actions under development.

A Bill that has been under consideration in the National Congress since July 2015, provides on the institutional digital public heritage in the www. This project is currently in the Constitution and Justice and Citizenship Commission (CCJC) of the Brazilian National Congress, since December 2022.

Another action comes from the National Council of Archives – Brazil (CONARQ), which established a technical chamber to define guidelines for the elaboration of studies, proposals, and solutions for the preservation of websites and social media. Based on its general goals, the technical chamber has produced two documents: (i) the Website and Social Media Preservation Policy; and, (ii) the recommendation of basic elements for websites and social media’s digital preservation. The documents were approved in December 2022 and will be published as a federal resolution.

The actions raised show that efforts for the state to take a proactive role in promoting and leadership of this technological innovation are in course in Brazil. The definition of a web archiving policy, as well as the requirements for the selection of preservation and archiving methods, technologies, and contents that will be archived, can already be considered a reality in Brazil.


Developer Update for Browsertrix Crawler and Browsertrix Cloud

Ilya Kreymer, Tessa Walsh

Webrecorder, United States of America

This presentation will provide a technical and feature update on the latest features implemented in Browsertrix Cloud and Browsertrix Crawler, Webrecorder's open source automated web archiving tools. The presentation will provide a brief intro to Browsertrix Cloud and the ongoing collaboration between Webrecorder and IIPC partners testing the tool.

We will present an outline for the next phase of development of these tools and discuss current / ongoing challenges in high fidelity web archiving, and how we may mitigate them in the future. We will also cover any lessons learned thus far.

We will end with a brief Q&A to answer any questions about the Browsertrix Crawler and Cloud systems, including how others may contribute to testing and development of these open source tools.

Opportunities and Challenges of Client-Side Playback

Clare Stanton, Matteo Cargnelutti

Library Innovation Lab, United States of America

The team working on Perma.cc at the Library Innovation Lab has been using the open-source technologies developed by Webrecorder in production for many years, and has subsequently built custom software around those core services. Recently, in exploring applications for client-side playback of web archives via replayweb.page, we have learned lessons about the security, performance and reliability profile of this technology. This has deepened our understanding of the opportunities it presents and challenges it poses. Subsequently, we have developed an experimental boilerplate for testing out variations of this technology and have sought partners within the Harvard Library community to iterate with, test our learnings, and explore some of the interactive experiences that client-side playback makes possible.

warc-embed is LIL's experimental client-side playback integration boilerplate, which can be used to test out and explore this new technology. It consists of: a cookie-cutter web server configuration for storing, proxying, caching and serving web archive files; a pre-configured "embed" page, serving an instance of replayweb.page aimed at a given archive file; as well as a two-way communication layer allowing the embedding website to safely communicate with the embedded archive. These unique features allow for a thorough exploration of this new technology from a technical and security standpoint.

This is one of two proposals relating to this work. We believe there is an IIPC audience who could attend either or both sessions based on their interests. This session will dive into the technical research conducted at the lab and present those findings.

Combined with the emergence of the WACZ packaging format, client-side playback is a radically different and novel take on web archive playback which allows for the implementation of previously unachievable embedding scenarios. This session will explore the technical opportunities and challenges client-side playback presents from a performance, security, ease-of-access and programmability perspective by going over concrete implementation examples of this technology on Perma.cc and warc-embed.

Sustaining pywb through community engagement and renewal: recent roadmapping and development as a case study in open source web archiving tool sustainability

Tessa Walsh


IIPC’s adoption of pywb as the “go to” open source web archive replay system for its members, along with Webrecorder’s support for transitioning to pywb from other “wayback machine” replay systems, brings a large new user base to pywb. In the interests of ensuring pywb continues to sustainably meet the needs of IIPC members and the greater web archiving community, Webrecorder has been investing in maintenance and new releases for the current 2.x release series of pywb as well as engaging in the early stages of a significant 3.0 rewrite of pywb. These changes are being driven by a community roadmapping exercise with members of the IIPC oh-sos (Online Hours: Supporting Open Source) group and other pywb community stakeholders.

This talk will outline some of the recent feature and maintenance work done in pywb 2.7, including a new interactive timeline banner which aims to promote easier navigation and discovery within web archive collections. It will go on to discuss the community roadmapping process for pywb 3.0 and an overview of the proposed new architecture, perhaps even showing an early demo if development is in a state by May 2023 to support doing so.

The talk will aim to not only share specific information about pywb and the efforts being put into its sustainability and maintenance by both Webrecorder and the IIPC community, but also to use pywb as a case study to discuss the resilience, sustainability, and renewal of open source software tools that enable web archiving for all. pywb as a codebase is after all nearly a decade old itself and has gone through several rounds of significant rewrites as well as eight years of regular maintenance by Webrecorder staff and open source contributors to get to its current state, making it a prime example of how ongoing effort and community involvement make all the difference in building sustainable open source web archiving tools.

Addressing the Adverse Impacts of JavaScript on Web Archives

Ayush Goel1, Jingyuan Zhu1, Ravi Netravali2, Harsha V. Madhyastha1

1University of Michigan, United States of America; 2Princeton University, United States of America

Over the last decade, the presence of JavaScript code on web pages has dramatically increased. While JavaScript enables websites to offer a more dynamic user experience, its increasing use adversely impacts the fidelity of archived web pages. For example, when we load snapshots of JavaScript-heavy pages from the Internet Archive, we find that many are missing important images and JavaScript execution errors are common.

In this talk, we will describe the takeaways from our research on how to archive and serve pages that are heavily reliant on JavaScript. Via fine-grained analysis of JavaScript execution on 3000 pages spread across 300 sites, we find that the root cause for the poor fidelity of archived page copies is because the execution of JavaScript code that appears on the web is often dependent on the characteristics of the client device on which it is executed. For example, JavaScript on a page can execute differently based on whether the page is loaded on a smartphone or on a laptop, or whether the browser used is Chrome or Safari; even subtle differences like whether the user's network connection is over 3G or WiFi can affect JavaScript execution. As a result, when a user loads an archived copy of a page in their browser, JavaScript on the page might attempt to fetch a different set of embedded resources (i.e., images, stylesheets, etc.) as compared to those fetched when this copy was crawled. Since a web archive is unable to serve resources that it did not crawl, the user sees an improperly rendered page both because of missing content and JavaScript runtime errors.

To account for the sources of non-deterministic JavaScript execution, a web archive cannot crawl every page in all possible execution environments (client devices, browsers, etc), as doing so would significantly inflate the cost of archiving. Instead, if we augment archived JavaScript such that the code on any archived page will always execute exactly how it did when the page was crawled, we are able to ensure that all archived pages match their original versions on the web, both visually and functionally.


What if GitHub disappeared tomorrow?

Emily Escamilla, Michele Weigle, Michael Nelson

Old Dominion University, United States of America

Research is reproducible when the methodology and data originally presented by the researchers can be used to reproduce the results found. Reproducibility is critical for verifying and building on results; both of which benefit the scientific community. The correct implementation of the original methodology and access to the original data are the lynchpin of reproducibility. Researchers are putting the exact implementation of their methodology in online repositories like GitHub. In our previous work, we analyzed arXiv and PubMed Central (PMC) corpora and found 219,961 URIs to GitHub in scholarly publications. Additionally, in 2021, one in five arXiv publications contained at least one link to GitHub. These findings indicate the increasing reliance of researchers on the holdings of GitHub to support their research. So, what if GitHub disappeared tomorrow? Where could we find archived versions of the source code referenced in scholarly publications? Internet Archive, Zenodo, and Software Heritage are three different digital libraries that may contain archived versions of a given repository. However, they are not guaranteed to contain a given repository and the method for accessing the code from the repository will vary across the three digital libraries. Additionally, Internet Archive, Zenodo, and Software Heritage all approach archiving from different perspectives and different use cases that may impact reproducibility. Internet Archive is a Web archive; therefore, the crawler archives the GitHub repository as a Web page and not specifically as a code repository. Zenodo allows researchers to publish source code and data and to share them with a DOI. Software Heritage allows researchers to preserve source code and issues permalinks for individual files and even lines of code. In this presentation, we will answer the questions: What if GitHub disappeared tomorrow? What percentage of scholarly repositories are in Internet Archive, Zenodo, and Software Heritage? What percentage of scholarly repositories would be lost? Do the archived copies available in these three digital libraries facilitate reproducibility? How can other researchers access source code in these digital libraries?

Web archives and FAIR data: exploring the challenges for Research Data Management (RDM)

Sharon Healy1, Ulrich Karstoft Have2, Sally Chambers3, Ditte Laursen4, Eld Zierau4, Susan Aasman5, Olga Holownia6, Beatrice Cannelli7

1Maynooth University; 2NetLab; 3KBR & Ghent Centre for Digital Humanities; 4Royal Danish Library; 5University of Groningen; 6IIPC; 7School of Advanced Study, University of London

The FAIR principles imply “that all research objects should be Findable, Accessible, Interoperable and Reusable (FAIR) both for machines and for people” (Wilkinson et al., 2016). These principles present varying degrees of technical, legal, and ethical challenges in different countries when it comes to access and the reusability of research data. This equally applies to data in web archives (Boté & Térmens, 2019; Truter, 2021). In this presentation we examine the challenges for the use and reuse of data from web archives from both the perspectives of web archive curators and users, and we assess how these challenges influence the application of FAIR principles to such data.

Researchers' use of web archives has increased steadily in recent years, across a multitude of disciplines, using multiple methods (Maemura, 2022; Gomes et al., 2021; Brügger & Milligan, 2019). This development would imply that there are a diversity of requirements regarding the RDM lifecycle for the use and reuse of web archive data. Nonetheless there has been very little research conducted which examines the challenges for researchers in the application of FAIR principles to the data they use from web archives.

To better understand current research practices and RDM challenges for this type of data, a series of semi-structured interviews were undertaken with both researchers who use web or social media archives for their research and cultural heritage institutions interested in improving the access of their born-digital archives for research.

Through an analysis of the interviews we offer an overview of several aspects which present challenges for the application of FAIR principles to web archive data. We assess how current RDM practices transfer to such data from both a researcher and archival perspective, including an examination of how FAIR web archives are (Chambers, 2020). We also look at the legal and ethical challenges experienced by creators and users of web archives, and how they impact on the application of FAIR principles and cross-border data sharing. Finally, we explore some of the technical challenges, and discuss methods for the extraction of datasets from web archives using reproducible workflows (Have, 2020).

Lessons Learned in Hosting the End of Term Web Archive in the Cloud

Mark Phillips1, Sawood Alam2

1University of North Texas, United States of America; 2Internet Archive, United States of America

The End of Term (EOT) Web Archive which is composed of member institutions across the United States who have come together every four years since 2008 to complete a large-scale crawl of the .gov and .mil domains in the United States to document the transition in the Executive Branch of the Federal Government in the United States. In years when a presidential transition did not occur, these crawls served as a systematic crawl of the .gov domain in what has become a longitudinal dataset of crawls. In 2022 the EOT team from the UNT Libraries and the Internet Archive moved nearly 700TB of primary WARC content and derivative formats into the cloud. The goal of this work was to provide easier computational access to the web archive by hosting a copy of the WARC files and derivative WAT, WET, and CDXJ files in the Amazon S3 Storage Service as part of Amazon’s Open Data Sponsorship Program. In addition to these common formats in the web archive community, the EOT team modeled our work on the structure and layout of the Common Crawl datasets including their use of the columnar storage format Parquet to represent CDX data in a way that enables access with query languages like SQL. This presentation will discuss the lessons learned in staging and moving these web archives into AWS, the layout used to organize the crawl data into 2008, 2012, 2016, and 2020 datasets and further into different groups based on the original crawling institution. Examples of how content staged in this manner can be used by researchers both inside and outside of a collecting institution to answer questions that had previously been challenging to answer about these web archives. The EOT team will discuss the documentation and training efforts underway to help researchers incorporate these datasets into their work.


Preservability and Preservation of Digital Scholarly Editions

Michael Kurzmeier1, James O'Sullivan1, Mike Pidd2, Orla Murphy1, Bridgette Wessels3

1University College Cork, Ireland; 2University of Sheffield; 3University of Glasgow

Digital Scholarly Editions (DSE) are web resources, thus subject to data loss. While DSEs are usually the result of funded research, their longevity and preservation is uncertain. DSEs might be partially or completely captured during web archiving crawls, in some cases making web archives the only remaining publicly available source of information about a DSE. Patrick Sahle’s Catalogue of DSEs lists ~800 URLs referring to DSEs, of which 46 refer to the Internet Archive. (2020) This shows the overlap between DSEs and web archives and highlights the need for a closer look at the longevity and archiving of these important resources. This presentation will introduce a recent study on the availability and longevity of DSEs and introduce different preservation models and examples specific to DSEs. Examples of lost and partially preserved editions will be used to illustrate the problem of preservation and preservability of DSEs. This presentation will also outline the specific challenges of archiving DSEs.

The C21 Editions project is a three-year international research collaboration researching the state of the art and the future of DSEs. As part of the project output, this presentation will introduce the main data sources on DSEs and demonstrate the workflow to assess DSE availability over time. It will illustrate the role web archives play in the preservation of DSEs as well as highlight specific challenges DSEs present to web archiving. As DSEs are complex projects, featuring multiple layers of data, transcription and annotation, their full preservation usually includes ongoing maintenance of the often custom-build backend system. Once project funding ends, these structures are very prone to deterioration and loss. Besides ongoing maintenance, other preservation models exist, generally reducing the archiving scope in order to reduce the ongoing work required (Dillen 2019; Pierazzo 2019; Sahle and Kronenwett 2016). Such editions using compatible rather than bespoke solutions are more likely to be fully preserved. Other approaches include a “preservability by design” approach through minimal computing (Elwert n.d.) or standardization through existing services such as DARIAH or GitHub. The presentation will outline these models using examples of successful preservation as well as lost editions.

This presentation is part of the larger C21 Editions project, a three-year international collaboration jointly funded by the Arts & Humanities Research Council (AH/W001489/1) and Irish Research Council (IRC/W001489/1).

Collecting and presenting complex digital publications

Ian Cooke, Giulia Carla Rossi

The British Library, United Kingdom

'Emerging Formats' is a term that is used by UK legal deposit libraries to describe experimental and innovative digital publications, for which there are no collection management solutions that can operate at scale. They are important to the libraries, and their users, as they document a period of creativity and rapid change, and often include authors and experiences that are less well represented in more mainstream publications, and are at high risk of loss. For six years, the UK legal deposit libraries have been working collaboratively and experimentally to both survey the types of publications, and to test approaches to collection that will support preservation, discovery and access. An important concept in this work has been 'contextual collecting', that seeks to preserve the best possible archival instance of a work, alongside information that documents how a work was created, and how it was experienced by users.

Web archiving has formed an important part of this work, both in providing practical tools to support collection management, including access, and also in supporting the collection of contextual information. An example of this can be seen in the New Media Writing Prize thematic collection https://www.webarchive.org.uk/en/ukwa/collection/2912

In this presentation, we will step back from specific examples, and talk about what we have learned so far from our work as a whole. We will outline how this work, including user research and engagement, has shaped policy at the British Library, through the creation of our 'Content Development Plan' for Emerging Formats, and the role of web archiving within that plan.

This presentation contributes to the Collections themes of 'blurring the boundaries between web archives and other born digital collections' and 'reuse of web archived materials for other born digital collections'. It builds on previous presentations to Web Archive Conference, which have focused on specific challenges related to collecting complex digital publications, to demonstrate how this research has informed the policy direction at the British Library and how web archiving infrastructure will be built in to efforts to collect, assess and make accessible new publications.

What can web archiving history tell us about preservation risks?

Susanne van den Eijkel, Daniel Steinmeier

KB, National Library of the Netherlands

When people talk about the necessity of preservation, the first thing that comes to mind is the supposed risk of file format obsolescence. Within the preservation community there have been voices raising the concern that this might not be the most pressing risk. If we are actually solving the wrong problem, this means we neglect the real problem. Therefore, it is important to know that the solutions we create are solving demonstrably real problems. Web archiving could be a great source of information for researching the most urgent risks, because developments and standards on the web are very fluid. There are examples of file formats on the web, such as Flash, that are not supported anymore by modern browsers. However, these formats can still be rendered using widely available software. We have also seen that website owners migrated their content from Flash to HTML5. So, can we really say that obsolescence has resulted in loss of data? How can we find out more about this? And more importantly, can we find out which risks are actually more relevant?

At the National Library of the Netherlands, we have been working on building a web collection since 2007. By looking at a few historical webpages we will illustrate where to look for answers and how to formulate better preservation risks using source data and context information. At iPres2022 we have presented a short paper on the importance of context information for web collections. This information helps us in understanding the scope and the creation process of the archived website. In this presentation, we will demonstrate how we use this context information to search out sustainability risks for web collections. This will also give us insight into sustainability risks in general so we can create better informed preservation strategies.

Towards an effective long-term preservation of the web. The case of the Publications Office of the EU

Corinne Frappart

Publications Office of the European Union, Luxembourg

Much is being written about web archiving in general where new, improving methods to capture the World Wide Web and to facilitate access to the resulting archives are constantly being described and shared. But when it comes to the long-term preservation of web sites, i.e. safeguarding the ARC/WARC files with a proper planning of preservation actions beyond simply bit preservation, literature is much less abundant.

The Publications Office of the EU is responsible for the preservation of the websites authored by the EU institutions. In addition to our activities in harvesting and making accessible the content through our public web archive (https://op.europa.eu/en/web/euwebarchive), we started to delve more deeply into the management of content preserved for the long-term.

Our reflection focused on long-term risks such as obsolescence or loss of file useability, and on the availability of a disaster recovery mechanism for the platform providing access to the web archive. Ingesting web archive files into a long-term preservation system raises many questions:

  • Should we expect different difficulties with ARC and WARC files? Is it worth migrating the ARC files to WARC files, and having a consistent collection on which the same tools can be applied?
  • Does ARC/WARC file compression impact the storage, the processing time, the preservation actions?
  • What is the best granularity for the preservation of web archive?
  • Should the characterization of the numerous files embedded in ARC/WARC files occur during or after ingestion? With which impact on the preservation actions?
  • How can descriptive, technical and provenance metadata be enriched, possibly automatically, and where can they be stored?
  • What kind of information about the context of the crawls, the format description and the data structure should be also preserved to help future users to understand the content of the ARC/WARC files?

To get some advice about all these questions and others, the Publications Office commissioned a study looking at published and grey literature, and supplemented by a series of interviews conducted with leading institutions in field of web archiving. This paper presents the findings and offers recommendations on how to answer the questions above.


Maintenance Practices for Web Archives

Ed Summers, Laura Wrubel

Stanford University, United States of America

What makes a web archive an archive? Why don’t we call them web collections instead, since they are resources that have been collected from the web and made available again on the web? Perhaps one reason that the term archive has stuck is that it entails a commitment to preserving the collected web resources over time, and making continued access to them available. Just like the brick and mortar buildings that must be maintained to house traditional archives, web archives are supported by software and hardware infrastructure that must be cared for in order to ensure that the web archives remain accessible. In this talk we will present some examples of what this maintenance work looks like in practice drawing from experiences at Stanford University Libraries (SUL).

While many organizations actively use third party services like Archive-It, PageFreezer, and ArchiveSocial to create web archives, it is less common for them to retrieve the collected data and make it available outside that service platform. Starting in 2012 SUL has been engaged in building web archive collections as part of its general digital collections using tools such as httrack, CDL’s Web Archiving Service, Archive-It and more recently Webrecorder. These collections have been made available using the OpenWayback software, but in 2022 SUL switched to using the PyWB application.

We will discuss some of the reasons why Stanford initially found it important to host its own web archiving replay service and what factors led to switching to PyWB. Work such as reindexing and quality assurance testing were integral to moving to PyWB, which in turn generated new knowledge about the web archives records, as well as new practices for transitioning them into the new software environment. The acquisition, preservation of and access to web archives has been incorporated into the microservice architecture of the Stanford Digital Repository. One key benefit to this mainstreaming is shared terminology, infrastructure and maintenance practices for web archives, which is essential for sustaining the service. We will conclude with some consideration of what these local findings suggest about successfully maintaining open source web archiving software as a community.

Radical incrementalism and the resilience and renewal of the National Library of Australia's web archiving infrastructure

Alex Osborne1, Paul Koerbin2

1National Library of Australia, Australia; 2National Library of Australia, Australia

The National Library of Australia’s web archiving program is one of the world’s earliest established and longest continually sustained operations. From its inception it was focused on establishing and delivering a functional operation as soon as feasible. This work historically included the development of policy, procedures and guidelines; together with much effort working through the changing legal landscape, from a permissions-based operation to one based on legal deposit warrant.

Changes to the Copyright Act (1968) in 2016, that extended legal deposit to online materials, gave impetus to the NLA’s strategic priorities to increase comprehensive collecting objectives and to expand open access to its entire web archive corpus. This also had significant implications for the NLA’s online collecting infrastructure. In part this involved confronting and dealing with a large legacy of web content collected by various tools and structured in disparate forms; and in part it involved a rebuild of the collecting workflow infrastructure while sustaining and redeveloping existing collaborative collecting processes.

After establishing this historic context, this presentation will focus attention on the NLA’s approach to the development of its web archiving infrastructure – an approach described as radical incrementalism: taking small, pragmatic steps that lead over time to achieving major objectives. While effective in providing the way to achieve strategic objectives, this approach can also build a legacy of infrastructural dead-weight that needs to be dealt with in order to continue to sustain and renew the dynamic and challenging task of web archiving. With a radical team restructure and an agile and iterative approach to development, the NLA has made significant progress in recent times in moving from a legacy infrastructure to one of renewed sustainability and flexibility in application.

This presentation will highlight some of the recent developments in the NLA’s web archiving infrastructure, including the web archive collection management system (including ‘Bamboo’ and ‘OutbackCDX’) and the web archive workflow management tool, ‘PANDAS’.

Arquivo.pt behind the curtains

Daniel Gomes

FCT: Arquivo.pt, Portugal

Arquivo.pt is a governmental service that enables search and access to historical information preserved from the Web since the 1990s. The search services provided by Arquivo.pt include full-text search, image search, version history listing, advanced search and application programming interfaces (API). Arquivo.pt has been running as an official public service since 2013 but in the same year its system totally collapsed due to a severe hardware failure and over-optimistic architectural design. Since then, Arquivo.pt was completely renewed to improve its resilience. At the same time, Arquivo.pt has been widening the scope of its activities by improving the quality of the acquired web data and deploying online services of general interest to public administration institutions, such as the Memorial that preserves the information of historical websites or Arquivo404 that fixes broken links in live websites. These innovative offers require the delivery of resilient services constantly available.

The Arquivo.pt hardware infrastructure is hosted at its own data centre and it is managed by full-time dedicated staff. The preservation workflow is performed through a large-scale information system distributed over about 100 servers. This presentation will describe the software and hardware architectures adopted to maintain the quality and resilience of Arquivo.pt. These architectures were “designed-to-fail” following a “share-nothing” paradigm. Continuous integration tools and processes are essential to assure the resilience of the service. The Arquivo.pt online services are supported by 14 micro-services that must be kept permanently available. The Arquivo.pt software architecture is composed of 8 systems that host 35 components and the hardware architecture is composed of 9 server profiles. The average availability of the online services provided by Arquivo.pt in 2021 was 99,998%. Web archives must urgently assume their rule in digital societies as memory keepers of the XXI century. The objective of this presentation is to share our lessons learned at a technical level so that other initiatives may be developed at a faster pace using the most adequate technologies and architectures.

Implementing access to and management of archived websites at the National Archives of the Netherlands

Antal Posthumus

Nationaal Archief, Netherlands, The

The National Archives of the Netherlands, as a permanent government agency and official archive for the Central Government, has the legal duty, laid down in the Archiefwet, to secure the future of the government record. In the case of this proposal the focus is on how we worked on the infrastructure and processes of our trusted digital repository (TDR in short) relating to the ingestion, storage, management and preservation of and providing access to archived public websites of the Dutch Central Government.

In 2018 we’ve issued a very well received guideline on archiving websites (2018), We tried to involve our producers in the drafting process of the guidelines in their development. Part of which was to organize a public review. We received no less than 600 comments from 30 different organizations, which enabled us to improve the guidelines and immediately bring them to the attention of potential future users.

These guidelines were also used as part of the requirements of a public European tender (2021). The objective of the tender: realizing a central harvesting platform (hosted by. https://www.archiefweb.eu/openbare-webarchieven-rijksoverheid/) to structurally harvest circa 1500 public websites of the Central Government. This enabled us as an archival institution to influence the desired outcome of the harvesting process for these 1500 websites owned by at least all Ministries and most of their agencies.

A main challenge was that our off the shelf version of the Open Wayback-viewer wasn’t a complete version of the software and therefore isn’t able to render increments, or provide a calendar function, one of the key elements of the minimum viable product we aimed at.
We’ve opted for pywb based on what we learned through the IIPC-community about the transition from Open Wayback to Pywb.
Installation of Pywb was experienced by our technical team as very simple. An issue we did encounter was that the TDR-software doesn’t support a linkage with this (or any) external viewer which forces us to copy all WARC-files from our TDR into the viewer. This means a deviation from our current workflow; it also means we need twice as much disk space, so to speak.


Memory in Uncertainty – The Implications of Gathering, Storing, Sharing and Navigating Browser-based Archives

Cade Diehm, Benjamin Royer

New Design Congress, Germany

How do we save the past in a violent present for an uncertain future? As societal digitisation accelerates, so too has the belligerence of state and corporate power, the democratisation of targeted harassment, and the collapse of consent by communities plagued by ongoing (and often unwanted) datafication. Drawing from political forecasts and participatory consultation with practitioners and communities, this research examines the physical safety of data centres, the socio-technical issues of the diverse practice of web-based archiving, and the physical and mental health of archive practitioners and communities subjected to archiving. This research identifies and documents issues of ethics, consent, digital security, colonialism, resilience, custodianship and tool complexity. Despite the systemic challenges identified in the research, and the broad lag in response from tool makers and other actors within the web archiving discipline, there exist compelling reasons to remain optimistic. Emergent technologies, stronger socio-technical literacy amongst archivists, and critical interventions in the colonial structures of digital systems offer immediate points of intervention. By acknowledging the shortcomings of cybernetics, resisting the desire to apply software solutionism at scale, and developing a nuanced and informed understanding of the realities of archiving in digitised societies, a broad surface of opportunities can emerge to develop resilient, considered, safe and context-sensitive archival technologies and practice for our uncertain world.

To preserve this memory, click here. Real-time public engagement with personal digital archives

Marije Miedema, Susan Aasman, Sabrina Sauer

University of Groningen, Centre for Media and Journalism Studies

Digital collections aim to reflect our personal and collective histories, which are shaped by and concurrently shape our memories. While advancements are made to develop web archival practices in the public domain, personal digital material is mostly preserved with commercially driven technologies. This is worrying, for although it may seem that these privately-owned cloud services are spaces where our precious pictures will exist forever, we know that long-term sustainable archiving practices are not these service providers’ primary concern. This demo is part of the first stages in the fieldwork of a PhD project that explores alternative approaches to sustainable everyday archival data management. Through participatory research methods, such as co-designing prototypes, we aim to establish a public-private-civic collaboration to rethink our relationship with the personal digital archive. Moving towards the question of what digital material do we throw away, discard, or forget about, we want to contribute to existing knowledge on how to manage the growing amount of digital stuff.

Translating this question into an interactive installation, the demo combines human and technological performativity employing participatory, playful methods to let conference participants materialize their reflections on their engagement with their digital archives, from their professional and personal perspective. This demo invites conference participants to actively engage with the question of responsibility regarding the future of our personal digital past; is there a role to play for public institutions next to the commitment of individuals to commercially driven storage technologies? The researchers will consider the privacy of the participants throughout the duration of the demo. Through this demo, the community of (web) archivists are involved in the early stages of the project’s co-creative research practices and aims to build lasting connections with these important stakeholders.

Participatory Web Archiving: A Roadmap for Knowledge Sharing

Cui Cui1,2

1Bodleian Libraries University of Oxford, United Kingdom; 2Information School University of Sheffield

In recent years, community participation seems to have become a desirable step in developing web archives. Participatory practices in the cultural heritage sector are not new (Benoit & Eveleigh, 2019). The practice of working in collaboration with different community partners to build archives is underway in conventional archives (Cook, 2013). Indeed, it has now become one of the main themes of web archival development on both theoretical and practical levels.

Although involving wider communities is often regarded as an approach to democratise practices, it has been debated if community participation can lead to improved representation. At the same time, the significant impact that participatory practices have on creating and sharing knowledge should not be underestimated. My current PhD research is to understand how participatory practices have been deployed in web archiving, their mechanisms and impacts.

Since April 2022, I have worked as a web archivist for the Archive of Tomorrow project, developing various sub-collections on the topics relating to cancer, Covid-19, food, diet, nutrition, and wellbeing. The project, funded by the Wellcome Trust, is to explore and preserve online 2 information and misinformation about health and the Covid-19 pandemic. Started in February 2022, the project runs for 14 months and will form a 'Talking about Health' collection within the UK Web Archive, giving researchers and members of the public access to a wide representation of diverse online sources.

For this project, I have attempted to link theories with practices and applied various participatory methods in developing the collection, such as engaging with subject librarians, delivering a workshop co-curating a sub-collection, consulting academics to identify archiving priorities, co-curating a sub-collection with students from an internship scheme, and collaborating with a local patient support group. This poster is to reflect how different approaches have been deployed and lessons learned. It will highlight the transformative impact of participatory practices on sharing, creating and reconstructing knowledge.


Benoit, E., & Eveleigh, A. (2019). Defining and framing participatory archives in archival science. In E. Benoit & A. Eveleigh (Eds.), Participatory archives: theory and practice (pp. 1–12). London.

Cook, T. (2013). Evidence, memory, identity, and community: Four shifting archival paradigms. Archival Science, 13(2–3), 95–120. https://doi.org/10.1007/s10502- 012-9180-7

WARC validation, why not?

Antal Posthumus, Jacob Takema

Nationaal Archief, Netherlands, The

This lightning talk would like to tempt and to challenge the participants of the IIPC Web Archiving Conference 2023 to engage in an exchange of ideas, assumptions and knowledge about the subject of validating WARC-files and the use of WARC validation tools.

In 2021 we’ve written an information sheet about WARC validation. During our (desk)research it became clear that most (inter)national colleagues who archive websites more often than not don’t use WARC validation tools. Why not?

Most heritage institutions, national libraries and archives focus on safeguarding as much online content as possible before it disappears, based on an organizational selection policy. And the other goal is to give access to the captured information as complete and quickly as possible, both to the general users and researchers. Both goals are at the core of webarchiving initiatives of course!

It seems as though little attention is given to an aspect of quality control such as the checking of the technical validity of WARC-files. Or are there other reasons not to pay much attention to this aspect?

We like to share some of our findings after deploying several tools for processing WARC-files: JHOVE, JWAT, Warcat and Warcio. More tools are available, but in our opinion these four tools are the most commonly used, mature and actively maintained tools that can check of validate WARC files.

In our research into WARC validation, we noticed that some tools are validation tools that check conformance to WARC standard ISO 28500 and others ‘only’ check block and/or payload digests. Most tools support version 1.0 of the WARC standard (of 2009). Few support version 1.1 (of 2017).

Another conclusion is that there is no one WARC validation tool ‘to rule them all’, so using a combination of tools will probably be the best strategy for now.

Sunsetting a digital institution: Web archiving and the International Museum of Women

Marie Chant

The Feminist Institute, United States of America

The Feminist Institute’s (TFI) partnership program helps feminist organizations sunset mission-aligned digital projects utilizing web archiving technology and ethnographic preservation to contextualize and honor the labor contributed to ephemeral digital initiatives. In 2021, The Feminist Institute partnered with Global Fund for Women to preserve the International Museum of Women (I.M.O.W). This digital, social change museum built award-winning digital exhibitions that explored women’s contributions to society. I.M.O.W. initially aimed to build a physical space but shifted to a digital-only presence in 2005, opting to democratize access to the museum’s work. I.M.O.W’s first exhibition, Imagining Ourselves: A Global Generation of Women, engaged and connected more than a million participants worldwide. After launching several successful digital collections, I.M.O.W. merged with Global Fund for Women in 2014. The organization did not have the means to continually migrate and maintain the websites as technology depreciated, leaving gaps in functionality and access. Working directly with stakeholders from Global Fund for Women and the International Museum of Women, TFI developed a multi-pronged preservation plan that included capturing I.M.O.W’s digital exhibitions using Webrecorder’s Browsertrix Crawler, harvesting and converting Adobe Flash assets, conducting oral histories with I.M.O.W. staff and external developers, and providing access through the TFI Digital Archive.

Visualizing web harvests with the WAVA tool

Ben O'Brien1, Frank Lee1, Hanna Koppelaar2, Sophie Ham2

1National Library of New Zealand, New Zealand; 2National Library of the Netherlands, Netherlands

Between 2020-2021, the National Library of New Zealand (NLNZ) and the National Library of the Netherlands (KB-NL) developed a new harvest visualization feature within the Web Curator Tool (WCT). This feature was demonstrated during a presentation at the 2021 IIPC WAC titled Improving the quality of web harvests using Web Curator Tool. During development it was recognised that the visualization tool could be beneficial to the web archiving community beyond WCT. This was also reflected in feedback received after the 2021 IIPC WAC.

The feature has now been ported to an accompanying stand-alone application called the WAVA tool (Web Archive Visualization and Analysis). This is a stripped down version, that contains the web harvest analysis and visualization without the WCT dependent functionality, such as patching.

The WCT harvest visualization has been designed primarily for performing quality assurance on web archives. To avoid the traditional mess of links and nodes when visualizing URLs, the tool abstracts the data to a domain level. Aggregating URLs into groups of domains gives a higher overview of a crawl and allows for quicker analysis of the relationships between content in a harvest. The visualization consists of an interactive network graph of links and nodes that can be inspected, allowing a user to drill down to the URL level for deeper analysis.

NLNZ and KB-NL believe the WAVA tool can have many uses to the web archiving community. It lowers the barrier to investigating and understanding the relationships and structure of the web content that we crawl. What can we discover in our crawls that might improve the quality of future web harvests? The WAVA tool also removes technical steps that have been a barrier in the past to researchers visualizing web archive data. How many future research questions can be aided by its use?

In-Person Panels

SES-03 (PANEL): Institutional Web Archiving Initiatives to Support Digital Scholarship

Martin Klein1, Emily Escamilla2, Sarah Potvin3, Vicky Rampin4, Talya Cooper4

1Los Alamos National Laboratory, United States of America; 2Old Dominion University, United States of America; 3Texas A&M University, United States of America; 4New York University, United States of America

Panel description:
Scholarship happens on the web but unlike more traditional output such as scientific papers in PDF format, we are still lacking comprehensive institutional web archiving approaches to capture increasingly prominent scholarly artifacts such as source code, datasets, workflows, and protocols. This panel will feature scholars from three different institutions - Old Dominion University, Texas A&M University, and New York University - that will provide an overview of their explorations in investigating the use of scholarly artifacts and their (in-)accessibility on the live web. The panelists will further outline how these findings inform institutional collection policies regarding such artifacts, web archiving efforts aligned with institutional infrastructure, and outreach and education opportunities for students and faculty. The panel will conclude with an interactive discussion while welcoming input and feedback from the WAC audience.



Title: Source Code Archiving for Scholarly Publications


Git Hosting Platforms (GHPs) are commonly used by software developers and scholars to host source code and data to make them available for collaboration and reuse. However, GHPs and their content are not permanent. Gitorious and Google Code are examples of GHPs that are no longer available even though users deposited their code expecting an element of permanence. Scholarly publications are well-preserved due to current archiving efforts by organizations like LOCKSS, CLOCKSS, and Portico; however, no analogous effort has yet emerged to preserve the data and code referenced in publications, particularly the scholarly code hosted online in GHPs. The Software Heritage Foundation is working to archive public source code, but issue threads, pull requests, wikis, and other features that add context to the source code are not currently preserved. Institutional repositories seek to preserve all research outputs which include data, source code, and ephemera; however, current publicly available implementations do not preserve source code and its associated ephemera, which presents a problem for scholarly projects where reproducibility matters. To discuss the importance of institutions archiving scholarly content like source code, we first need to understand the prevalence of source code within scholarly publications and electronic theses and dissertations (ETDs). We analyzed over 2.6 million publications across three categories of sources: preprints, peer-reviewed journals, and ETDs. We found that authors are increasingly referencing the Web in their scholarly publications with an average of five URIs per publication in 2021, and one in five arXiv articles included at least one link to a GHP. In this panel, we will discuss some of the questions that result from these findings such as: Are these GHP URIs still available on the live Web? Are they available in Software Heritage? Are they available in web archives and if so, how often and how well are they archived?


Title: Designing a Sociotechnical Intervention for Reference Rot in Electronic Theses


Intertwined publication and preservation practices have become widespread in the establishment of institutional digital repositories and libraries’ stewardship of institutional research output, including open educational resources and electronic theses and dissertations. Most digital preservation work seeks to preserve a whole text, like a dissertation, in a digital form. This presentation reports on an ongoing research effort - a collaboration with Klein, Potvin, Katherine Anders, and Tina Budzise-Weaver - intended to prevent potential information loss within the thesis, through interventions that can be integrated into trainings and thesis management tools. This approach draws on research into graduate training and citation practices, web archiving, open source software development, and digital collection stewardship with a goal of recommending systematized sociotechnical interventions to prevent reference rot in institutionally-hosted graduate theses. Findings from qualitative surveys and interviews conducted at Texas A&M University on graduate student perceptions of reference rot will be detailed.


Title: Collaborating on Software Archiving for Institutions


Inarguably, software and code are part of our scholarly record. Software preservation is a necessary prerequisite for long-term access and reuse of computational research, across many fields of study. Open research software is shared on the Web most commonly via Git hosting platforms (GHPs), which are excellent for fostering open source communities, transparency of research, and add useful features on top such as wikis, continuous integration, and merge requests and issue threads. However, the source code and the useful scholarly ephemera (e.g. wikis) are archived separately, often by “breadth over depth” approaches. I’ll discuss the Collaborative Software Archiving for Institutions (CoSAI) project from NYU, LANL, ODU, and OCCAM, which is addressing this pressing need to provide machine-repeatable, human-understandable workflows for preserving web-based scholarship, scholarly code in particular, alongside the components that make it most useful. I’ll present the results of ongoing efforts in the three main streams of work: 1) technical development on open source, community-led tools for collecting, curating, and preserving open scholarship with a focus on research software, 2) community building around open scholarship, software collection and curation, and archiving of open scholarship, and 3) optimizing workflows for archiving open scholarship with ephemera, via machine-actionable and manual workflows.

 SES-04 (PANEL): SolrWayback: Best practice, community usage and engagement

Thomas Egense1, Laszlo Toth2, Youssef Eldakar3, Sara Aubry4, Anders Klindt Myrvoll1

1Royal Danish Library (KB); 2National Library of Luxembourg (BnL); 3Bibliotheca Alexandrina (BA); 4National Library of France (BnF)

Panel description

This panel will focus on the status quo of SolrWayback, implementations of SolrWayback and where it's heading in the future, including the growing open source community adapting SolrWayback and contributing to developing the tool, making it more resilient.

Thomas Egense will give an update on the current development and the flourishing user community and some thoughts on making SolrWayback even more resilient in the future.

Laszlo Toth will talk about the National Library of Luxembourg (BnL) development of a fully automated archiving workflow comprised of the capture, indexing and playback of Luxembourgish news websites. The solution combines the powerful features of SolrWayback such as full-text search, wildcard search, category search and mre, with the high playback quality of PyWb.

Youssef Eldakar will present the way Solwayback have enhanced the way researchers can search for content and view the 18 IIPC special collections and also bring up some considerations about scaling the system.

Sara Aubry will present how the National Library of France (BnF) has been using SolrWayback to give researcher teams the possibility to explore, analyze and visualize specific collections. She will also share how BnF contributed to the application development, including the extension of datavisualisation features.

Thomas Egense: Increasing community interactions and the near future of SolrWayback

During the last year, the number of community interactions such as direct email questions, bugs/ feature requests posted on github jira, has increased every week. It is indeed good news that so many Libraries/Institutions or researchers already have embraced SolrWayback, but to keep up this momentum more community engagement will be welcomed for this open source project.

By submitting a feature request or bug report on GitHub you will help prioritize which will benefit the most, do not hold back. More programmers for backend(Java) or frontend (GUI) would speed up the development of SolrWayback.

Recently BnF helped improve some of the visualization tools by allowing shorter time intervals instead of years. For newly established collections this is a much more useful visualization. Is it a good example of the different need for new collections just 1 year old compared to collections with 25 years of web harvests. So it was not in our focus though it was a very useful improvement.

In the very near future I expect that more time will be used on supporting new users attempting to implement SolrWayback. Also the hybrid SolrWayback combined with PyWb for playback seems to be the direction many choose to go. And finally large collections will run into a Solr scaling problem that can be solved by switching to SolrCloud. There is a need for better documentation and workflow support in the SolrWayback bundle for this scaling issue.

Lazlo Toth: A Hybrid SolrWayback-PyWb playback system with parallel indexing using the Camunda Workflow Engine

Within the framework of its web archiving programme, the National Library of Luxembourg (BnL) develops a fully automated archiving workflow comprised of the capture, indexing and playback of Luxembourgish news websites.

Our workflow design takes into account several key features such as the efficiency of crawls (both in time and space) and of the indexing processes, all while providing high quality end user experience. In particular, we have chosen a hybrid approach for the playback of our archived content, making use of several well-known technologies in the field.

Our solution combines the powerful features of SolrWayback such as full-text search, wildcard search, category search and so forth, with the high playback quality of PyWb (for instance its ability to handle complex websites, in particular with respect to POST requests). Thus, once a website is harvested, the corresponding WARC files are indexed in both systems. Users are then able to perform fine-tuned searches using SolrWayback and view the chosen pages using PyWb. This also means that we need to store our indexes in two different places: the first is within an OutbackCDX indexing server connected to our PyWb instance, the second is a larger Solr ecosystem put in place specifically for SolrWayback. This parallel indexing process, together with the handling of the entire workflow from start to finish, is handled by the Camunda Workflow Engine, which we have configured in a highly flexible manner.

This way, we can quickly respond to new requirements, or even to small adjustments such as new site-specific behaviors. All of our updates, including new productive tasks or workflows, can be deployed on-the-fly without needing any downtime. This combination of technologies allows us to provide a seamless and automated workflow together with an enjoyable user experience. We will present the integrated workflow with Camunda and how users interact with the whole system.

Youssef Eldakar “Where We Are a Year Later with the IIPC Collections and Researcher Access through SolrWayback ”

One year ago, we presented a joint effort, spanning the IIPC Research Working Group, the IIPC Content Development Working Group, and Bibliotheca Alexandrina, to republish the IIPC collections for researcher access through alternative interfaces, namely, LinkGate and SolrWayback.

This effort aims to re-host the IIPC collections, originally harvested on Archive-It, at Bibliotheca Alexandrina with the purpose of offering researchers the added value of being able to explore a web archive collection as a temporal graph with the data indexed in LinkGate, as well as search the full text of a web archive collection and run other types of analyses with the data indexed in SolrWayback.

At the time of last year's presentation, the indexing of 18 collections and a total compressed size of approximately 30 TB for publishing through both LinkGate and SolrWayback was at its early stage. As part of this panel on SolrWayback, one year later, we present an update of what is now available to researchers after the progress made on indexing and tuning of the deployment, focusing on showcasing access to the data through the different tools found in the SolrWayback user interface.

We also present a brief technical overview of how the underlying deployment has changed to meet the demands of scaling up to the growing volume of data. We finally share thoughts on future next steps. See the republished collections at https://iipc-collections.bibalex.org/ and the presentation from 2022.

Sara Aubry: SolrWayback at the National Library of France (BnF) : an exploration tool for researchers and the web archiving team engagement to contribute to its evolution

With the opening of its DataLab in October 2021 and the Respadon project (which will also be presented during the WAC), BnF web archiving team is currently concentrating on the development of services, tools, methods and documentation to ease the understanding and appropriation of web archives for research. The underlying objective is to provide the research community, along with information professionals, with a diversity of tools dedicated to the building, exploring and analysis of web corpora. Among all tools we have tested with researchers, SolrWayback has a particular place because of its simplicity to handle and its rich functionalities. Beyond a first contact with the web archives, it allows researchers to question and analyze the focused collections to which it gives access. This presentation will focus on researcher feedback using SolrWayback, how the application promotes the development of skills on web archives, and how we accompany researchers in the use of this application. We will also present how research use and feedback has led us to contribute to the development of this open source tool.

 SES-14 (PANEL): Renewal in Web Archiving: Towards More Inclusive Representation and Practices

Makiba Foster1, Bergis Jules2, Zakiya Collier3

1The College of Wooster; 2Archiving The Black Web; 3Shift Collective

“The future is already here, it's just not very equally distributed, yet” - William Gibson
In this session you will learn about a growing community of practice of independent yet interconnected projects whose work converges as an intervention to critically engage the practice of web archiving to be more inclusive in terms of what gets web archived and who gets to build web archives. These projects reimagine a future for web archiving that distributes the practice and diversifies the collections.

Presentation 1- Archiving The Black Web

Author/Presenter: Makiba Foster, The College of Wooster and Bergis Jules, Archiving the Black Web

Abstract: Unactualized web archiving opportunities for Black knowledge collecting institutions interested in documenting web-based Black history and culture has reached critical levels due to the expansive growth of content produced about the Black experience by Black digital creators. Archiving The Black Web (ATBW), works to establish more equitable, accessible, and inclusive web archiving practices to diversify not only collection practices but also its practitioners. Founded in 2019, ATBW's creators will discuss the collaborative catalyst for the creation and launch of this important DEI initiative within web archiving. In this panel session, attendees will learn more about ATBW’s mission to address web archiving disparities. ATBW envisions a future that includes cultivating a community of practice for Black collecting institutions, developing training opportunities to diversify the practice of web archiving, and expanding the scope of web archives to include culturally relevant web content.

Presentation 2 - Schomburg Syllabus

Author/Presenter: Zakiya Collier, Shift Collective

Abstract: From 2017-2019 the Schomburg Center for Research in Black Culture participated in the Internet Archive’s Community Webs program, becoming the first Black collecting institution to create a web archiving program centering web-based Black history and culture. Recognizing that content in crowdsourced hashtag syllabi could be lost to the ephemerality of the Web, the #HashtagSyllabusMovement collection was created to archive online educational material related to publicly produced, crowdsourced content highlighting race, police violence, and other social justice issues within the Black community. Both the first of its kind in focus and within The New York Public Library system, the Schomburg Center’s web archiving program faced challenges including but not limited to identifying ways to introduce the concept of web archiving to Schomburg Center researchers and community members, demonstrating the necessity of a web supported web archiving program to Library administration, and expressing the urgency needed in centering Black content on the web that may be especially ephemeral like those associated with struggles for social justice. It was necessary for the Schomburg Center to not only continue their web archiving efforts with the #Syllabus and other web archive collections, but also develop strategies to invoke the same sense of urgency and value for Black web archive collections that we now see demonstrated in the collection of analog records documenting Black history, culture and activism— especially as social justice organizing efforts increasingly have online components.

As a result, the #SchomburgSyllabus project was developed to merge web-archives and analog resources from the Schomburg Center in celebration of Black people's longstanding self-organized educational efforts. #SchomburgSyllabus uniquely organizes primary and secondary sources into a 27-themed web-based resource guide that can be used for classroom curriculum, collective study, self-directed education, and social media and internet research. Tethering web-archived resources to the Schomburg Center’s world-renowned physical collections Black diasporic history has proven key in garnering support for the Schomburg’s web archiving program and enthusiasm for the preservation of the Black web as demonstrated by the #SchomburgSyllabus’ use in classrooms, inclusion in journal articles, and features in cultural/educational TV programs.


 WKSHP-01: Describing Collections with Datasheets for Datasets

Emily Maemura1, Helena Byrne2

1University of Illinois; 2British Library, United Kingdom

Significant work in web archives scholarship has focused on addressing the description and provenance of collections and their data. For example, Dooley et al. (2018) propose recommendations for descriptive metadata, and Maemura et al. (2018) develop a framework for documenting elements of a collection’s provenance. Additionally, documentation of the data processing and curation steps towards generating a corpus for computational analysis are described extensively in Brügger (2021), Brügger, Laursen & Nielsen (2019) and Brügger, N., Nielsen, J., & Laursen, D. (2020). However, looking beyond libraries, archives, or cultural heritage settings provides alternative forms for the description of data. One approach to the challenge of describing large datasets comes from the field of machine learning where Gebru et al. (2018, 2021) propose developing “Datasheets for Datasets,” a form of short document answering a standard set of questions arranged by stages of the data lifecycle.

This workshop explores how web archives collections can be described using the framework provided by Datasheets for Datasets. Specifically, this work builds on the template for datasheets developed by Gebru et al. that is arranged into seven sections: Motivation; Composition; Collection Process; Preprocessing/Cleaning/Labeling; Use; Distribution; and, Maintenance. The workflow they present includes a total of 57 questions to answer about a dataset, focusing on the specific needs of machine learning researchers. We consider how these questions can be adopted for the purposes of describing web archives datasets. Participants will consider and assess how each question might be adapted and applied to describe datasets from the UK Web Archive curated collections. After a brief description of the Datasheets for Datasets framework, we will break into small groups to perform a card-sorting exercise. Each group will evaluate a set of questions from the Datasheets framework and assess them using the MoSCoW technique, sorting questions into categories of Must, Should, Can’t, and Won’t have. Groups will then describe their findings from the card-sorting exercise in order to generate a broader discussion of priorities and resources available for generating descriptive metadata and documentation for public web archives datasets.

Format:120 minute workshop where participants will do a card sorting activity in small groups to review the practicalities of the Datasheets for Datasets Framework when applied to web archives. Ideally participants can prepare by reading through questions prior to the workshop.

We anticipate the following schedule:

  • 5 min: Introduction
  • 15 min: Overview of Datasheets for Datasets
  • 5 min: Overview of UKWA Datasets
  • 60 min: Card-sorting Exercise in small groups
  • 5 min: Comfort Break
  • 20 min: Discussion of small group findings
  • 5 min: Conclusion and Wrap-up

Target Audience: Web Archivists, Researchers

Anticipated number of participants: 12-16

Technical requirements: overhead projector with computer and large tables for a big card sorting activity.

Learning outcomes:

  • Raise awareness of the Datasheets for Datasets Framework in the web archiving community.
  • Understand what type of descriptive metadata web archive experts think should accompany web archive collections published as data.
  • Generate discussion and promote communication between web archivists and research users on priorities for documentation.

Coordinators: Emily Maemura (University of Illinois), Helena Byrne (British Library)

Emily Maemura is an Assistant Professor in the School of Information Sciences at the University of Illinois Urbana-Champaign. Her research focuses on data practices and the activities of curation, description, characterization, and re-use of archived web data. She completed her PhD at the University of Toronto's Faculty of Information, with a dissertation exploring the practices of collecting and curating web pages and websites for future use by researchers in the social sciences and humanities.

Helena Byrne is the Curator of Web Archives at the British Library. She was the Lead Curator on the IIPC Content Development Group 2022, 2018 and 2016 Olympic and Paralympic collections. Helena completed a Master’s in Library and Information Studies at University College Dublin, Ireland in 2015. Previously she worked as an English language teacher in Turkey, South Korea, and Ireland. Helena is also an independent researcher that focuses on the history of women's football in Ireland. Her previous publications cover both web archives and sports history.


Brügger, N. (2021). Digital humanities and web archives: Possible new paths for combining datasets. International Journal of Digital Humanities. https://doi.org/10.1007/s42803-021-00038-z

Brügger, N., Laursen, D., & Nielsen, J. (2019). Establishing a corpus of the archived web: The case of the Danish web from 2005 to 2015. In N. Brügger & D. Laursen (Eds.), The historical web and digital humanities: The case of national web domains (pp. 124–142). Routledge/Taylor & Francis Group.

Brügger, N., Nielsen, J., & Laursen, D. (2020). Big data experiments with the archived Web: Methodological reflections on studying the development of a nation’s Web. First Monday. https://doi.org/10.5210/fm.v25i3.10384

Dooley, J., & Bowers, K. (2018). Descriptive Metadata for Web Archiving: Recommendations of the OCLC Research Library Partnership Web Archiving Metadata Working Group (p. ). OCLC Research. https://doi.org/10.25333/C3005C

Gebru, T., Morgenstern, J., Vecchione, B., Vaughan, J. W., Wallach, H., Daumé III, H., & Crawford, K. (2018). Datasheets for Datasets. ArXiv:1803.09010 [Cs]. http://arxiv.org/abs/1803.09010

Gebru, T., Morgenstern, J., Vecchione, B., Vaughan, J. W., Wallach, H., III, H. D., & Crawford, K. (2021). Datasheets for datasets. Communications of the ACM, 64(12), 86–92. https://doi.org/10.1145/3458723

Maemura, E., Worby, N., Milligan, I., & Becker, C. (2018). If These Crawls Could Talk: Studying and Documenting Web Archives Provenance. Journal of the Association for Information Science and Technology, 69(10), 1223–1233. https://doi.org/10.1002/asi.24048

 WKSHP-02: A proposed framework for using AI with web archives in LAMs

Abigail Potter

Library of Congress, United States of America

There is tremendous promise in using artificial intelligence, and specifically machine learning techniques to help curators, collections managers and users to understand, use, steward and preserve web archives. Libraries, archives, museums and other public cultural heritage organizations who manage web archives have shared challenges in operationalizing AI technologies and unique requirements for managing digital heritage collections at a very large scale. Through research, experimentation and collaboration the LC Labs team has developed a set of tools to document, analyze, prioritize and assess AI technologies in a LAM context. This framework is in draft form and in need of additional use cases and perspectives, especially web archives use cases. The facilitators will introduce the framework and ask participants to use the proposed framework to evaluate their own proposed or in process ML or AI use case that increases understanding of and access to web archives.

  • Sharing the framework elements, gathering feedback, and documenting web archives use cases are the goals of the workshop.
    Sample Elements and Prompts from the framework:
    Organizational Profile: How will or does your organization want to use AI or Machine learning?
  • Define the Problem you are trying to solve.
  • Write a user story about the AI/ML task or system your are planning/doing
  • Risks and Benefits: What are the benefits and risks to users, staff and the organization when an AI/ML technology is/will be used?
  • What systems or policies will/do the AI/ML task or system impact or touch?
  • What are the limitations of future use of any training, target, validation or derived data?
  • Data Processing Plan: What documentation are/will you require when using AI or ML technologies - what existing open source or commercial platforms offer
    pathways into use of AI?
  • What are the success metrics and measures for the AI/ML task?
  • What are the quality benchmarks for the AI/ML output?
  • What could come next?

 WKSHP-03: Fake it Till You Make it: Social Media Archiving at Different Organizations for Different Purposes

Susanne van den Eijkel1, Zefi Kavvadia2, Lotte Wijsman3

1KB, National Library of the Netherlands; 2International Institute for Social History; 3National Archives of the Netherlands


Different organizations, different business rules, different choices. That seems obvious. However, different perspectives can alter the choices that you make and therefore the results you get when you’re archiving Social Media. In this tutorial, we would like to zoom in on the different perspectives an organization can have. A perspective can be formed over a mandate or type of organization, the designated community of an institution, or a specific tool that you use. Therefore, we would like to highlight these influences and how they can affect the results that you get.

When you start with Social Media archiving, you won’t get the best results right away. It is really a process of trial and error, where you aim for good practice and not necessarily best practice (and is there such a thing as best practice?). With a practical assignment we want to showcase the importance of collaboration between different organizations. What are the worst practices that we have seen so far? What’s best to avoid, and why? What could be a solution? And why is it a good idea to involve other institutions at an early stage?

This tutorial relates to the conference topics of community, research and tools. It builds on previous work from the Dutch Digital Heritage Network and the BeSocial project from the National Library of Belgium. Furthermore, different tools will be highlighted and it will me made clear why different tooling can result in different results.


In-person tutorial, 90 minutes.

  • Introduction: who are the speakers, where do they work, introduction on practices related to different organizations.
  • Assignment: participants will do a practical assignment related to social media archiving. They’ll receive personas for different institutions (library, government, archive) and ask themselves the question: how does your own organization's perspective influence the choices you make? We will gather the results on Post-its and end with a discussion.
  • Wrap-up: conclusions of discussion.

Target audience

This tutorial is aimed at those who want to learn more about doing social media archiving at their organizations. It is mainly meant for starters in social media archiving, but not necessarily complete beginners (even though they are definitely welcome too!). Potential participants could be archivists, librarians, repository managers, curators, metadata specialists, (research) data specialists, and generally anyone who is or could be involved in the collection and preservation of social media content for their organization.

Expected number of participants: 20-25.

Expected learning outcome(s)

Participants will understand:

  1. Why Social Media archiving is different than Web Archiving;
  2. Why different perspectives lead to different choices and results;
  3. How tools can affect the potential perspectives you can work with.

In addition, participants will get insight into:

  1. The different perspectives from which you can do social media archiving from;
  2. How different organizations (could) work on social media archiving.


Susanne van den Eijkel is a metadata specialist for digital preservation at the National Library of the Netherlands. She is responsible for all the preservation metadata, writing policies and implementing them. Her main focus are born-digital collections, especially the web archives. She focuses on web material after it has been harvested, and not so much on selection and tools and is therefore more involved with which metadata and context information is available and relevant for preservation. In addition, she works on the communication strategy of her department; is actively involved in the Dutch Digital Heritage Network and provides guest lectures on digital preservation and web archiving.

Zefi Kavvadia is a digital archivist at the International Institute of Social History in Amsterdam, the Netherlands. She is part of the institute’s Collections Department, where she is responsible for processing of digital archival collections. She is also actively contributing to research, planning, and improving of the IISH digital collections workflows. While her work covers potentially any type of digital material, she is especially interested in the preservation of born-digital content and is currently the person responsible for web archiving at IISH. Her research interests range from digital preservation and archives, to web and social media archiving, and research data management, with a special focus on how these different but overlapping domains can learn and work together. She is active in the web archiving expert group of the Dutch Digital Heritage Network and the digital preservation interest group of the International Association of Labour History Institutions.

Lotte Wijsman is the Preservation Researcher at the National Archives in The Hague. In her role she researches how we can further develop preservation at the National Archives of the Netherlands and how we can innovate the archival field in general. This includes considering our current practices and evaluating how we can improve these with e.g. new practices and tools. Currently, Lotte is active in research projects concerning subjects as social media archiving, AI, a supra-organizational Preservation Watch function, and environmentally sustainable digital preservation. Furthermore, she is a guest teacher at the Archiefschool and Reinwardt Academy (Amsterdam University of the Arts).

 WKSHP-04: Browser-Based Crawling For All: Getting Started with Browsertrix Cloud

Andrew N. Jackson1, Anders Klindt Myrvoll2, Ilya Kreymer3

1The British Library, United Kingdom; 2Royal Danish Library; 3Webrecorder

Through the IIPC-funded “Browser-based crawling system for all” project, members have been working with Webrecorder to support the development of their new crawling tools: Browsertrix Crawler (a crawl system that automates web browsers to crawl the web) and Browsertrix Cloud (a user interface to control multiple Browsertrix Crawler crawls). The IIPC funding is particularly focussed on making sure IIPC members can make use of these tools.

This workshop will be led by Webrecorder, who will invite you to try out an instance of Browsertrix Cloud and explore how it might work for you. IIPC members that have been part of the project will be on hand to discuss institutional issues like deployment and integration.

The workshop will start with a brief introduction to high-fidelity web archiving with Browsertrix Cloud. Participants will be given accounts and be able to create their own high-fidelity web archives using the Browsertrix Cloud crawling system during the workshop. Users will be presented with an interactive user interface to create crawls, allowing them to configure crawling options. We will discuss the crawling configuration options and how these affect the result of the crawl. Users will have the opportunity to run a crawl of their own for at least 30 minutes and to see the results. We will then discuss and reflect on the results.

After a quick break, we will discuss how the web archives can be accessed and shared with others, using the ReplayWeb.page viewer. Participants will be able to download the contents of their crawls (as WACZ files) and load them on their own machines. We will also present options for sharing the outputs with others directly, by uploading to an easy-to-use hosting option such as Glitch or our custom WACZ Uploader. Either method will produce a URL which participants can then share with others, in and outside the workshop, to show the results of their crawl. We will discuss how, once complete, the resulting archive is no longer dependent on the crawler infrastructure, but can be treated like any other static file, and, as such, can be added to existing digital preservation repositories.

In the final wrap-up of the workshop, we will discuss challenges, what worked and what didn’t, what still needs improvement, etc.. We will also discuss how participants can add the web archives they created into existing web archives that they may already have, and how Browsertrix Cloud can fit into and augment existing web archiving workflows at participants' institutions. IIPC member institutions that have been testing Browsertrix Cloud thus far will share their experiences in this area.

The format of the workshop will be as follows:

  • Introduction to Browsertrix Cloud - 10 min
  • Use Cases and Examples by IIPC project partners - 10 min
  • Break - 5 min
  • Hands-On - Setup and Crawling with Browsertrix Cloud (Including Q&A / help while crawls are running) - 30 min
  • Break - 5 min
  • Hands-On - Replaying and Sharing Web Archives - 10 min
  • Wrap-Up - Final Q&A / Discuss Integration of Browsertrix Cloud Into Existing Web Archiving Workflows with IIPC project partners - 20 min

Webrecorder will lead the workshop and the hands-on portions of the workshop. The IIPC project partners will present and answer questions about use-cases at the beginning and around integration into their workflows at the end.

Participants should come prepared with a laptop and a couple of sites that they would like to crawl, especially those that are generally difficult to crawl by other means and require a ‘high fidelity approach’. (Examples include social media sites, sites that are behind a paywall, etc..) Ideally, the sites can be crawled during the course of 30 mins (though crawls can be interrupted if they run for too long)

This workshop is intended for curators and anyone wishing to create and use web archives and are ready to try Browsertrix Cloud on their own with guidance from the Webrecorder team. The workshop does not require any technical expertise besides basic familiarity with web archiving. The participants’ experiences and feedback will help shape not only the remainder of the IIPC project, but the long-term future of this new crawling toolset.

The workshop should be able to accommodate up to 50 participants.

 WKSHP-05: Supporting Computational Research on Web Archives with the Archive Research Compute Hub (ARCH)

Jefferson Bailey, Kody Willis, Helge Holzmann

Internet Archive, United States of America


  • Jefferson Bailey, Director of Archiving & Data Services, Internet Archive
  • Kody Willis, Product Operations Manager, Archiving & Data Services, Internet Archive
  • Helge Holzmann, Senior Data Engineer, Archiving & Data Services, Internet Archive
  • An Archives Unleashed member may also coordinate/participate

Format: 90 or 120-minute workshop and tutorial

Target Audience: The target audience is professionals working in digital library services that are collecting, managing, or providing access to web archives, scholars using web archives and other digital collections in their work, library professionals working to support computational access to digital collections, and digital library technical staff.

Anticipated Number of Participants: 25

Technical Requirements: A meeting room with wireless internet access and a projector or video display. Participants must bring laptop computers and there should be power outlets. The coordinators will handle preliminary activities over email and provide some technical support beforehand as far as building or accessing web archives for use in the workshop.

Abstract: Every year more and more scholars are conducting research on terabytes and even petabytes of digital library and archive collections using computational methods such as data mining, natural language processing, and machine learning. Web archives are a significant collection of interest for these researchers, especially due to their contemporaneity, size, multi-format nature, and how they can represent different thematic, demographic, disciplinary, and other characteristics. Web archives also have longitudinal complexity, with frequent changes in content (and often state of existence) even at the same URL, gobs of metadata both content-based and transactional, and many characteristics that make them highly suitable for data mining and computational analysis. Supporting computational use of web archives, however, poses many technical, operational, and procedural challenges for libraries. Similarly, while platforms exist for supporting computational scholarship on homogenous collections (such as digitized texts, images, or structured data), none exist that handle the vagaries of web archive collections while also providing a high level of automation, seamless user experience, and support for both technical and non-technical users.

In 2020, Internet Archive Research Services and the Archives Unleashed received funding for joint technology development and community building to combine their respective tools that enable computational analysis of web and digital archives in order to build an end-to-end platform supporting data mining of web archives. The program also simultaneously is building out a community of computational researchers doing scholarly projects via a program supporting cohort teams of scholars that receive direct technical support for their projects. The beta platform, Archives Research Compute Hub (ARCH), is currently being used by dozens of researchers in the digital humanities, social and computer science researchers, and by dozens of libraries and archives that are interested in supporting local researchers and sharing datasets derived from their web collection in support of large-scale digital research methods.

ARCH lowers the barriers for conducting research of web archives, using data processing operations to generate 16 different derivatives from WARC files. Derivatives range in use from graph analysis, text mining, and file format extraction, and ARCH makes it possible to visualize, download, and integrate these datasets into third-party tools for more advanced study. ARCH enables analysis of the more than 20,000 web archive collections - over 3 PB of data - collected by over 1,000 institutions using Archive-It that cover a broad range of subjects and events and ARCH also includes various portions of the overall Wayback Machine global web archive totalling 50+ PB and going back to 1996.

This workshop will be a hands-on training covering the full lifecycle of supporting computational research on web archives. The agenda will include an overview of the conceptual challenges researchers face when working with web archives, the procedural challenges that librarians face in making web archives available for computational use, and most importantly, will provide an in-depth tutorial on using the ARCH platform and its suite of data analysis, dataset generation, data visualization, and data publishing tools, both from the perspective of a collection manager, a research services librarian, and a computational scholar. Workshop attendees will be able to build small web archive collections beforehand or will be granted access to existing web archive collections to use during the workshop. All participants will also have access to any datasets and data visualizations created as part of the workshop.

Anticipated Learning Outcomes:

Given the conference, we expect the attendees primarily to be web archivists, collection managers, digital librarians, and other library and archives staff. After the workshop, attendees will:

  • Understand the full lifecycle of making web and digital archives available for computational use by researchers, scholars, and others. This includes gaining knowledge of outreach and promotion strategies to engage research communities, how to handle computational research requests, how to work with researchers to scope and refine their requests, how to make collections available as data, how to work with internal technical teams facilitating requests, dataset formats and delivery methods, and how to support researchers in ongoing data analysis and publishing.
  • Gain knowledge of the specific types of data analysis and datasets that are possible with web archive collections, including data formats, digital methods, tools, infrastructure requirements, and the related methodological affordances and limitations for scholarship related to working with web archives as data.
  • Receive hands-on training on using the ARCH platform to explore and analyze web archive collections, from both the perspective of a collection manager and that of a researcher.
  • Be able to use the ARCH platform to generate derivative datasets, create corresponding data visualizations, publish these datasets to open-access repositories, and conduct further analysis with additional data mining tools.
  • Have tangible experience with datasets and related technologies in order to perform specific analytic tasks on web archives such as exploring graph networks of domains and hyperlinks, extract and visualize images and other specific formats, and perform textual analysis and other interpretive functions.
  • Have insights into digital methods through their exposure to a variety of different active, real-life use cases from scholars and research teams currently using the ARCH platform for digital humanities and similar work.

 WKSHP-06: Run your own full stack SolrWayback

Thomas Egense, Toke Eskildsen, Jørn Thøgersen, Anders Klindt Myrvoll

Royal Danish Library, Denmark

An in-person, updated, version of the ‘21 WAC workshop, Run your own full stack SolrWayback.

This workshop will

  1. Explain the ecosystem for SolrWayback 4 (https://github.com/netarchivesuite/solrwayback)
  2. Perform a walkthrough of installing and running the SolrWayback bundle. Participants are expected to mirror the process on their own computer and there will be time for solving installation problems
  3. Leave participants with a fully working stack for index, discovery and playback of WARC files
  4. End with open discussion of SolrWayback configuration and features.


  • Participants should have a Linux, Mac or Windows computer with Java 8 or Java 11 installed. To see java is installed type this in a terminal: java -version
  • Downloading the latest release of SolrWayback Bundle from:https://github.com/netarchivesuite/solrwayback/releases beforehand is recommended.
  • Having institutional WARC files available is a plus, but sample files can be downloaded from https://archive.org/download/testWARCfiles
  • A mix of WARC-files from different harvests/years will showcase SolrWaybacks capabilities the best way possible.

Target audience:

Web archivists and researchers with medium knowledge of web archiving and tools for exploring web archives. Basic technical knowledge of starting a program from the command line is required; the SolrWayback bundle is designed for easy deployment.

Maximum number of participants


SolrWayback 4 (https://github.com/netarchivesuite/solrwayback) is a major rewrite with a strong focus on improving usability. It provides real time full text search, discovery, statistics extraction & visualisation, data export and playback of webarchive material. SolrWayback uses Solr (https://solr.apache.org/) as the underlying search engine. The index is populated using Web Archive Discovery (https://github.com/ukwa/webarchive-discovery). The full stack is open source and freely available. A live demo is available at https://webadmin.oszk.hu/solrwayback/

During the conference there will be focused support for SolrWayback in a dedicated Slack channel by Thomas Egense and Toke Eskildsen.

Online Sessions


Uncovering the (paper) traces of the early Belgian web

Bas Vercruysse1, Julie Birkholz1,2, Friedel Geeraert2

1Ghent University, Belgium; 2KBR (Royal Library of Belgium)

The Belgian web began in June 1988 when EARN and Eunet introduced the .be domain. In December 1993 the first .be domain names were registered and in 1994 there were a total of 129 registered .be names.[1] Documentation about the early Belgian web is scarce and the topic has not yet been researched in depth. In other European countries such as France and The Netherlands, specific projects have been set up to document the early national web. [2] This study of the early Belgian web therefore helps to complete the history of the early web in Europe and understand the specific dynamics that led to the emergence of the web in Belgium.

Records of the early Web Belgium include: published lists of domain names of interest to Belgians held in the collections of KBR (e.g. publications such as the Belgian Web Directory published from 1997 to 1998 and the Web Directory published from 1998 to 2000), archived early Belgian websites preserved in the Wayback Machine since 1996, archives of organisations such as DNS Belgium (the registry for the .be, .brussels and .vlaanderen domains) etc.

This archival information provides a slice of the information needed to understand the emergence of the early web in Belgium, yet it is clear that social actors who played key roles in developing the Belgian web, are not always recorded in the few archival records that remain. By combining these “paper traces” of the early Belgian web with semi-structured interviews with key actors in Belgium (e.g. long-time employees of DNS Belgium, instigators of the .be domain name, first users of and researchers on the web) we are able to reconstruct the history of the start of the web in Belgium.

In this presentation, we will report on this research that stitches the first traces of the early Belgian web.

[1] DNS Belgium. (2019). De historiek van DNS Belgium. Available online at: https://www.dnsbelgium.be/nl/over-dns-belgium/de-historiek-van-dns-belgium.

[2] De Bode, P., Teszelszky, K. (2018). Web collection internet archaeology Euronet-Internet (1994-2017). Available online at: https://lab.kb.nl/dataset/web-collection-internet-archaeology-euronet-internet-1994-2017; Bibliothèque nationale de France. (2018). Web90 - Patrimoine, Mémoires et Histoire du Web dans les années 1990. Available online at: https://web90.hypotheses.org/tag/bnf.

The Lifranum research project : building a collection on French speaking literature

Christian Cote2, Alexandre Faye1, Christine Genin1, Kevin Locoh-Donou1

1French national library, France; 2University of Lyon 3 (Jean Moulin), France

Many amateurs and professionals writers have taken to the web since its very beginning, to share their writings and personal diaries, engaging themselves in the first forums. These practices increased with the rise of blogging platforms in the 2000s. Authors have used hypertext link possibilities to develop a new digital sociability and a common transnational creative network.

The Lifranum research project brings together researchers from several disciplines. Its objective is to provide an original platform within a thematic web archive as corpora and to develop enhanced search features. The indexing scheme takes into account advances in automatic style analysis. In this context, researchers and librarians have defined complementary needs considering the web archive collection to be built and have tested new methods to design the corpora and carry out the crawls.

During this presentation, we will share the challenges we encountered and the experiences we developed during the building of this large thematic corpora, from the selection phase to the crawl processes. The following aspects will be discussed:

- text indexing and text analyzing issues;

- large thematic corpora building methods using Hyphe, a tool developed by SciencesPo for exploring the web, build corpora, analyze links between websites and adding annotations;

- managing quantity and quality on blogging platforms;

- documenting choices to proceed data.

The presentation will also compare web archive logics to scientific approaches focused on a specific type of data (text, image, video) that are exposed using APIs and easier to analyze. We will question the contributions and limits of this type of collection launched in partnership within the framework of a research project, which enriches the archives due to more methodical explorations of the web, anticipated qualitative controls and production of reusable documentation.

The video will be available in French with English subtitles.

Developing a Reborn Digital Archival Edition as an Approach for the Collection, Organisation, and Analysis of Web Archive Sources

Sharon Healy1, Juan-José Boté-Vericad2, Helena Byrne3

1Maynooth University; 2Universitat de Barcelona; 3British Library

In this presentation, we explore the development of a reborn digital archival edition (RDAE) as a hybrid approach for the collection, organisation, and analysis of reborn digital materials (Brügger, 2018; 2016) that are accessible through public web archives. Brügger (2016) describes reborn digital media, as media that has been collected and preserved and has undergone a change due to this process such as emulations of computer games or materials in a web archive. Further to this, we explore the potential of an RDAE as a method to enable the sharing and reuse of such data. As part of this we use a case study of the press/media statements of Irish politician, poet, and sociologist, Michael D. Higgins from 2002-2011. For the most part, these press statements were once available on the website of Michael D. Higgins, who is the current serving Irish President since 2011. Higgin’s website disappeared from the live web, sometime after the 2011 Presidential Election took place (27 October 2011), and sometime before Higgins was inaugurated (11 November 2011). Using the NLI Web Archive (National Library of Ireland) and the Wayback Machine (Internet Archive), this project sought to find and collect traces of these press statements and bring them together as an RDAE. In doing so, we use Zotero open-source citation management software, for collecting, organising, and analysing the data (archived web pages). We extract the text, and use screenshot software to capture an image of the archived web page. Thereafter, we utilise Omeka open-source software as a platform for presenting the data (screenshot/metadata/transcription) as a curated thematic collection of reborn digital materials, offering search and discovery functions through free text search, metadata fields and subject headings. To end, we use DROID open-source software for organising the data for long-term preservation, and Open Science Framework as a platform for sharing derivative materials and datasets.


Brügger, N. (2016). Digital Humanities in the 21st Century: Digital Material as a Driving Force. Digital Humanities Quarterly, 10(2). Retrieved from http://www.digitalhumanities.org/dhq/vol/10/3/000256/000256.html

Brügger, N. (2018). The Archived Web: Doing History in the Digital Age. The MIT Press.


Web Archiving en español: Barriers to Accessing and Using Web Archives in Latin America

Alan Colin-Arce1, Sylvia Fernández-Quintanilla2, Rosario Rogel-Salazar1, Verónica Benítez-Pérez1, Abraham García-Monroy1

1Universidad Autónoma del Estado de México, Mexico; 2University of Texas at San Antonio

Web archives have been growing in popularity in Global North countries as a way of preserving a part of their political, cultural, and social life carried out online. However, its spread to other regions has been slower because of several technical, economic, and social barriers. In this presentation, we will discuss the main limitations in the uptake of web archiving in Spanish-speaking Latin American countries and its implications for the access and use of web archives.

The first barrier to web archiving in these countries is the lack of awareness of web archives among librarians and archivists. According to Scopus data from 2022, out of 909 documents with the words “web archiv*” on the title, abstract, or keywords, 10 papers are from Spanish-Speaking Latin American countries. In worldwide web archiving surveys, there are no initiatives from Latin America yet (D. Gomes et al., 2011; P. Gomes, 2020), and we could only identify 5 Latin American institutions on Archive-It, none of which had active public collections in 2022.

Another barrier is the cost of web archiving services like Archive-It, which can be unaffordable for many institutions in the region. Even if they can afford these services or use free tools, most web archiving software is created only in English and it does not smoothly support multilingual collections or collections in languages other than English (for example, the default metadata fields for collections and seeds are in English and adding them in other languages is not straightforward).

This unequal access to web archives between the Global North and South comes with the risk that Global North websites get preserved, organized, accessed, and used, while Latin America and other regions continue depending solely on third parties like the Internet Archive to preserve their websites.

A possible solution for raising awareness of web archives is developing workshops as well as mentorship programs for Latin American librarians and digital humanists looking to start with web archiving. For the linguistic barrier, translating the documentation of web archiving tools to other languages can be a first step to encourage their use in Latin America.


Gomes, D., Miranda, J., & Costa, M. (2011). A Survey on Web Archiving Initiatives. In S. Gradmann, F. Borri, C. Meghini, & H. Schuldt (Eds.), Research and Advanced Technology for Digital Libraries (pp. 408–420). Springer. https://doi.org/10.1007/978-3-642-24469-8_41

Gomes, P. (2020). Map of Web archiving initiatives. Own work. https://commons.wikimedia.org/wiki/File:Map_of_Web_archiving_initiatives_58632FR T.jpg


All Our Yesterdays: A toolkit to explore web archives in Colab

Tim Ribaric, Sam Langdon

Brock University, Canada

The rise of Jupyter notebooks and particularly Google Colab has created an easy to use and accessible platform for those interested in exploring computational methods. This is especially the case with performing research using web archives. However the question remains, how to start? In particular, for those without an extensive background in programming this might be an insurmountable challenge. Enter the All Our Yesterdays Took Kit. (AOY-TK) This suite of notebooks and associated code provides a scaffolded introduction to opening, analyzing, and generating insights with web archives. With tight integration to Google Drive and text analysis tools, it provides a comprehensive solution to that very question of how to start. Development of AOY-TK is made possible by grant funding and this session will discuss progress to date and provide some brief case study examples of the types of analysis possible using the toolkit.

Using Web Archives to Model Academic Migration and Identify Brain Drain

Mat Kelly, Deanna Zarrillo, Erjia Yan

Drexel University, United States of America

Academic faculty members may change their institution affiliation over the course of their career. In the case of Historically Black Colleges and Universities (HBCUs) in the United States, which make substantial contributions to the preparation of Black professionals, retaining the most talented Black students and faculty from moving to non-HBCUs (thus preventing “brain drain”) is often a losing battle. This project seeks to investigate the effects of academic mobility at the institutional and individual level, measuring the potential brain drain from HBCUs. To accomplish this, we consult web archives to identify captures of academic institutions and their departments in the past to extract faculty names, title, and affiliation at various points in time. By analyzing the HBCUs’ list of faculty over time, we will be able to model academic migration and quantify the degree of brain drain.

This NSF-sponsored project is in the early stages of execution and is a collaboration between Drexel University, Howard University, University of Tennessee - Knoxville, and University of Wisconsin - Madison. We are currently in the data collection stage, which entails us leveraging an open source Memento aggregator to consult international sources of web archives to potentially improve the quality and quantity of captures of past versions of HBCU sites. In this initial stage, we have encountered caveats of the process of efficient extraction, established a systematic methodology of utilizing this approach beyond our initial use cases, and identified potentially ethical dilemmas of individuals’ information on the past being uncovered and highlighted without their explicit consent. During the first year of the project, we have refined our approach to facilitate better data quality for subsequent steps in the process and to emphasize recall. This presentation will both describe some of these nuances of our collaborative project as well as highlight the next steps for identifying brain drain from HBCUs by utilizing web archives.


Lessons Learned From the Longitudinal Sampling of a Large Web Archive

Kritika Garg1, Sawood Alam2, Michele Weigle1, Michael Nelson1, Corentin Barreau2, Mark Graham2, Dietrich Ayala3

1Old Dominion University, Norfolk, Virginia - USA; 2Internet Archive, San Francisco, California - USA; 3Protocol Labs, San Francisco, California - USA

We document the strategies and lessons learned from sampling the web by collecting 27.3 million URLs with 3.8 billion archived pages spanning 26 years of the Internet Archive's holdings (1996–2021). Our overall project goal is to revisit fundamental questions regarding the size, nature, and prevalence of the publicly archivable web, and in particular, to reconsider the question, "how long does a web page last?" Addressing this question requires obtaining a "representative sample of the web." We proposed several orthogonal dimensions to sample URLs using the archived web: time of the first archive, HTML vs. MIME types, URL depth (top-level pages vs. deep links), and TLD. We sampled 285 million URLs from IA's ZipNum index file that contains every 6000th line of the CDX index. These include URLs of embedded resources, such as images, CSS, and JavaScript. To limit our samples to web pages, we filtered the URLs for likely HTML pages (based on filename extensions). We determined the time of the first archive and MIME type using IA's CDX API. We grouped the 92 million URLs with "text/html" MIME types based on the year of the first archive. Archiving speed and capacity have significantly increased, so we found fewer URLs archived in the early years than in later years. Hence, we adjusted our goal of 1 million URLs per year and clustered the early years (1996-2000) to reach that size (1.2 million URLs). We noticed an increase in deep links archived over the years. We extracted the top-level URLs from the deep links to upsample the earlier years. We found that popular domains like Yahoo and Twitter were over-represented in the IA. We performed logarithmic-scale downsampling based on the number of URLs sharing a domain. Given the collection size, we employed various sampling strategies to ensure fairness in the domain and temporal representations. Our final dataset contains TimeMaps of 27.3 million URLs comprising 3.8 billion archived pages from 1996 to 2021. We convey the lessons learned from sampling the archived web, which could inform other studies that sample from web archives.

TrendMachine: Temporal Resilience of Web Pages

Sawood Alam1, Mark Graham1, Kritika Garg2, Michele Weigle2, Michael Nelson2, Dietrich Ayala3

1Internet Archive, San Francisco, California - USA; 2Old Dominion University, Norfolk, Virginia - USA; 3Protocol Labs, San Francisco, California - USA

"How long does a web page last?" is commonly answered with "40 to 100 days", with sources dating back to the late 1990s. The web has since evolved from mostly static pages to dynamically-generated pages that heavily rely on client-side scripts and user contributed content. Before we revisit this question, there are additional questions to explore. For example, is it fair to call a page dead that returns a 404 vs. one whose domain name no longer resolves? Is a web page alive if it returns content, but has drifted away from its original topic? How to assess the lifespan of pages from the perspective of fixity with the spectrum of content-addressable pages to tweets to home pages of news websites to weather report pages to push notifications to streaming media? To quantify the resilience of a page, we developed a mathematical model that calculates a normalized score as time-series data based on the archived versions of the page. It uses Sigmoid functions to increase or decrease the score slowly on the first few observations of the same class. The score changes significantly if the observations remain consistent over time, and there are tunable parameters for each class of observation (e.g., HTTP status codes, no archival activities, and content fixity). Our model has many potential applications, such as identifying points of interest in the TimeMap of densely archived web resources, identifying dead links (in wiki pages or any other website) that can be replaced with archived copies, and aggregated analysis of sections of large websites. We implemented an open-source interactive tool [1] powered by this model to analyze URIs against any CDX data source. Our tool gave interesting insights on various sites, such as, the day when "cs.odu.edu" was configured to redirect to "odu.edu/compsci", the two and a half years of duration when "example.com" was being redirected to "iana.org", the time when ODU’s website had downtime due to a cyber attack, or the year when Hampton Public Library’s domain name was drop-catched to host a fake NSFW store.

[1] https://github.com/internetarchive/trendmachine


A Gift to Another Age: Evaluating Virtual Machines for the Preservation of Video Games at MoMA

Kirk Mudle

New York University and the Museum of Modern Art in New York, United States of America

This preservation project investigates the use of virtual machines for the preservation of video games. From MoMA’s collection, Rand and Robyn Miller’s classic adventure game Myst (1993) is used as a sample record to evaluate the performance of three different virtualization options for the Mac OS 9 operating system—SheepShaver, Qemu, and Yale’s Emulation-as-a-Service-Infrastructure (EaaSI). Serving as the control for the experiment, Myst is first documented running natively on an original PowerMac G4 at MoMA. The native performance is then compared with each virtualization software. Finally, a fully configured virtual machine is packaged as a single file and tested in different contemporary computing environments. More generally, this project clarifies the risks and challenges that arise when using virtual machines for the long-term preservation of computer and software-based art.

Experiences from archiving information from social media

Magdalena Sjödahl

Arkiwera, Sweden

Social media has enabled an expanded dialogue in the public space. News spread faster than ever, politicians and leaders can communicate directly with a great number of people, and everyone can become an influencer or create a public movement. For the archival institutions and governmental organisations active on different social medias, this creates new challenges and questions. As a consultancy firm working with digital preservation, we couldn’t just ask these questions without also finding some answers. This led to the beginning of developing the system that we today call Arkiwera.

With more than 10 years’ experience of web archiving, we started to look at solutions to preserve posts, including comments and reactions, from different social media platforms about 4-5 years ago. Since we couldn’t find any “out of the box”-solutions that we could just apply and refer our customer to we started to develop our own solution. This has been a long and interesting journey of many lessons learned that we would love to share on the conference connecting to several of the themes presented, e.g. Research, Tools and Access.

Our lecture introduces you to the circumstances offered within the Swedish archival context and the choices we have made from an archival, regulative, and ethical aspect when developing the archival platform Arkiwera – today used by a large number of organisations.


Empowering Bibliographers to Build Collections: The Browsertrix Cloud Pilot at Stanford Libraries

Quinn Dombrowski, Ed Summers, Laura Wrubel, Peter Chan

Stanford University, United States of America

The purview of subject-area librarians has expanded in the 21st century from primarily focusing on books and print subscriptions to a much larger set of materials, including digital subscription packages and data sets (distributed using a variety of media, for purchase or lease). Through this process, subject-area librarians are increasingly exposed to complex issues around copyright, license terms, and privacy/ethical concerns, where both norms and laws can vary significantly among different countries and communities. While it is nearly impossible for subject-area librarians in any field to treat “data” as outside the scope of their collecting efforts in 2022, the same does not hold true for web archives. Many libraries have at least some access to web archiving tools, although this access may primarily be in the hands of a limited number of users, sometimes associated with library technical services or special collections / university archives (e.g. for institutions whose focus of web archiving is primarily their own digital resources).

In late 2022, the web archiving task force at Stanford Libraries – a cross-functional team that brought together the web archivist, technical staff, and embedded digital humanities staff – set out to shift this dynamic by empowering disciplinary librarians to add web archiving to their toolkit for building the university’s collections. By partnering with Webrecorder, Stanford Libraries set up an instance of Browsertrix Cloud, and provided access to a pilot group of bibliographers and other subject-matter experts as part of a short-term pilot. The goals of this pilot were to see how, and how much, bibliographers would engage with web archiving for collection-building if given unfettered access to easy-to-use tools. What materials would they prioritize? What challenges would they encounter? What technical (e.g. storage) and support (e.g. training, debugging, community engagement) resources would be necessary for them to be successful? This pilot was also intended to inform the strategic direction for web archiving at Stanford moving forward.

In this talk, we will briefly present how we designed the pilot, will hear perspectives from bibliographers who participated, and we will share the pilot outcomes and future directions.

What next? An update from SUCHO

Quinn Dombrowski1, Anna Kijas2, Sebastian Majstorovic3, Ed Summers1, Andreas Segerberg4

1Stanford University, United States of America; 2Tufts University, United States of America; 3Austrian Center for Digital Humanities and Cultural Heritage, Austria; 4University of Gothenburg, Sweden

Saving Ukrainian Cultural Heritage Online (SUCHO) made headlines as an international, volunteer-run initiative archiving Ukrainian cultural heritage websites in the wake of Russia’s invasion in February 2022. Through SUCHO, over 1,500 volunteers around the world – from technologists and librarians, to retirees and children – were involved in a large-scale, rapid-response web archiving effort that developed a collection of over 5,000 websites and 50 TB of data. As a non-institutional project with the primary goal of digital repatriation, creating this collection and ensuring its security through a network of mirrors was not enough. The motivation for SUCHO was not to create a permanent archive of Ukraine that could be used as research data for scholars as the country was destroyed; instead, the hope was to hold onto the data only until the cultural heritage sector in Ukraine was ready to rebuild.

The initial web archiving phase of SUCHO’s work happened between March and August 2022. The archives came from a variety of sources: created on volunteers laptops using the command-line Browsertrix software, using Browsertrix Cloud, or even uploads of individual, highly interactive page archives using the Browsertrix Chrome plugin. In addition, while the project mostly worked from a single list of sites, the work was done in haste, and status metadata (e.g. “in progress”, “done”, “problem”) was not always accurately documented. Furthermore, while the project had full DNS records for these sites, that metadata was stored separately from the spreadsheet – as was information about site uptime and downtime over the course of the project. Creating the web archives was challenging, but it quickly became apparent that the bigger challenge would be curation.

This talk will follow up on our 2022 IIPC presentation on SUCHO, confronting the question of “What next?” for SUCHO. It will bring together a number of volunteers to discuss different facets of this curation process, including reuniting archives with different kinds of metadata, our efforts in extracting data from the archives that could be used as the foundation for rebuilding websites, and other work to curate and present what our volunteer community accomplished.


Time: Wednesday, 03/May/2023: 7:25pm - 7:55pm  ·  Virtual location: Online

Querying Queer Web Archives

Di Yoong1, Filipa Calado1, Corey Clawson2

1The Graduate Center, CUNY, USA; 2Rutgers University, USA

Our paper explores the intersections of querying and queerness as it interacts with and is informed by web spaces and their development across time. Working with hundreds of gigabites of web archival records on queer and queer-ish online spaces, we are developing new methodsfor search and discovery. as well as for the ethical access and use of web archives. This paper reflects on our process pursuing methodologies that accommodate diverse perspectives for querying web-based datasets and embrace the qualities of play and pliancy to respond to a host of research questions and investments.

For example, one central concern explores ethical methods for cleaning web archival data to maintain privacy and anonymity. While queer spaces have historically existed in the margins, confidential information is easily shared and retained in the process of collecting data. Given that we are looking into queer spaces across 30 or so years, we are also mindful of the ethical consideration for privacy and anonymity in twofolds: first, in the sense of anonymity that has shifted since early internet days; and second, on the uses of collected sites in repositories. For example, in 1995 only 0.4% of world population had access to the internet (Mendel, 2012), compared to 60% in 2020 (The World Bank, n.d.). The sense of anonymity and smaller internet community means that users were likely to share more private information than they might share today. Our research therefore has to consider how to remove private information in large amounts using tools such as bulk_extractor (Garfinkel, 2013) and bulk_reviewer (Walsh & Baggett, 2019). In addition, we also work with repositories of archived websites whose original collection was obtained through informed consent. This means that while we may have the ability to access the collection, ethical secondary use requires additional consideration. Given the small size of the collection, we have been able to reach out to the original creators, but this approach will need to be reconsidered for larger collections.

Beyond the Affidavit: Towards Better Standards for Web Archive Evidence

Nicholas Taylor


The Internet Archive (IA) standard legal affidavit is used in litigation both frequently and reliably for the authentication and admission of evidence from the Wayback Machine (WM). While the affidavit has enabled the regular and relatively confident application of IAWM evidence by the legal community, their understanding of the contingencies of web archives - including qualifications to which the affidavit itself calls attention - is limited.

The tendency to conflate IA's attestation as to the authenticity of IAWM /records/ with the authenticity of /historical webpages/ will eventually have material consequences in litigation, which we may reasonably suppose will undermine confidence in the trustworthiness of web archives generally and to a greater extent than likely merited. The ever-increasing complexity of the web and the unfortunately growing investment in disinformation only increase the probability that this will happen sooner as versus later.

In response to the looming (or present, but as yet undiscovered) threat to the current IA affidavit-favored regime for authentication of IAWM evidence, the web archiving community would do well to champion better, more institutionally-agnostic standards for evaluating and affirming the authenticity of archived web content. Some modest efforts have been made on this front, and there are a few places we can consult for tacitly indicated frameworks. Collectively, these include judicial precedents, e-discovery community guidance, and the marketing of services by commercial archiving companies. I would argue that these do not get us far enough, though.

To that end, I would like to elaborate a more expansive set of criteria that could serve as a basis for the authenticity of web archives for evidentiary purposes. Some of these traits are foundational to web archiving in the main, and help to distinguish web archives from other forms of web content capture. Some reflect the affordances of our standards and tools that we as a community already have in place. Some reflect under-addressed technical challenges, for which continued investment in mitigation will be necessary to maintain the trustworthiness of our archives for legal use. Together, they may better provide for the sustained and trustworthy use of web archives for evidentiary purposes.


Browser-Based Crawling For All: The Story So Far

Anders Klindt Myrvoll1, Andrew Jackson2, Ben O'Brien3, Sholto Duncan3, Ilya Kreymer4, Lauren Ko5, Jasmine Mulliken6, Andreas Predikaka7, Antares Reich7

1Royal Danish Library; 2The British Library, United Kingdom; 3National Library of New Zealand | Te Puna Mātauranga o Aotearoa; 4Webrecorder; 5UNT; 6Stanford; 7Austrian National Library

Through the IIPC-funded “Browser-based crawling system for all” project (https://netpreserve.org/projects/browser-based-crawling/), members have been working with Webrecorder to support the development of their new crawling tools: Browsertrix Crawler (a crawl system that automates web browsers to crawl the web) and Browsertrix Cloud (a user interface to control multiple Browsetrix Crawler crawls). The IIPC funding is particularly focussed on making sure IIPC members can use these tools.

This online panel will provide an update on the project, emphasizing the experiences of IIPC members who have been experimenting with the tools. Three IIPC members who have been exploring Browsertrix Cloud in detail will present their experiences so far. What works well, what works less well, how the development process has been, and what the longer-term issues might be. The Q&A session will be used to explore the issues raised and encourage wider engagement and feedback from IIPC members.

Project Update: Anders Klindt Myrvoll & Ilya Kreymer

Anders will present an update from the project leads on what has been achieved since we started the project and what the next steps are. We will look at the broad picture as well as the the goals, outcomes and deliverables as described in the IIPC project description: https://netpreserve.org/projects/browser-based-crawling/

On behalf of Webrecorder, Ilya will outline the wider context and updating on the status of the project and including any immediate feedback from the Workshop session

User experience 1 (NZ) Sholto Duncan

Testing Browsertrix Cloud at NLNZ

In recent years the selective web harvesting programme at the National Library of New Zealand has broadened its crawling tools of choice in order to use the best one for the job. From primarily using Heritrix, through WCT, to now also regularly crawling with Webrecorder and Archive-IT. This allowed us to get the best capture possible. But unfortunately still falls short in harvesting some of those more rich, dynamic, modern websites that are becoming more commonplace.

Other areas within the Library that often use web archiving processes for capturing web content have seen this same need for improved crawling tools. This has provided a range of users and diverse use cases for our Browsertrix Cloud testing. During this presentation we will cover our user experience during this testing.

User experience 2 (UNT) Lauren Ko

Improving the Web Archive Experience

With a focus on collecting the expiring websites of defunct federal government commissions, carrying out biannual crawls of its own subdomains, and participating in event-based crawling projects, since 2005 UNT Libraries has mostly carried out harvesting with Heritrix. However, in recent years, attempts to better archive increasingly challenging websites and social media have led to supplementing this crawling with a more manual approach using pywb's record mode. Now hosting an instance of Browsertrix Cloud, UNT Libraries hopes to reduce the time spent on archiving such content that requires browser-based crawling. Additionally, the libraries expect the friendlier user interface Browsertrix Cloud provides to facilitate its use by more staff in the library, as a teaching tool in a web archiving course in the College of Information, and in a project collaborating with external contributors.

User experience 3 (Stanford) Jasmine Mulliken

Crawling the Complex

Web-based digital scholarship, like the kind produced under Stanford University Press’s Mellon-funded digital publishing initiative (http://supdigital.org), is especially resistant to standard web archiving. Scholars choosing to publish outside the bounds of the print book are finding it challenging to defend their innovatively formatted scholarly research outputs to tenure committees, for example, because of the perceived ephemerality of web-based content. SUP is supporting such scholars by providing a pathway to publication that also ensures the longevity of their work in the scholarly record. This is in part achieved by SUP’s partnership with Webrecorder (https://blog.supdigital.org/sup-webrecorder-partnership/), which has now, using Browsertrix Cloud, produced web-archived versions of all eleven of SUP’s complex, interactive, monograph-length scholarly projects (https://archive.supdigital.org/). These archived publications represent an important use case for Browsertrix Cloud that speaks to the needs of creators of web content who rely on web archiving tools as an added measure of value for the work they are contributing to the evolving innovative shape of the scholarly record.

User experience 4 (Austrian National Library) Andreas Predikaka & Antares Reich

Integrating Browsertrix

Since the beginning of the web archiving project in 2008, Austrian National Library has been using the crawler Heritrix integrated in Netarchivesuite. For many websites in daily crawls, the use of Heritrix is no longer sufficient and it is necessary to improve the quality of our crawls. Tests showed very quickly, that Browsertrix is doing a very good job to fulfil this requirement. But for us it is also important that the results of Browsertrix crawls are integrated into our overall working process. By using the API of Browsertrix, it was possible to create a proof of concept of necessary steps for this use case.