SESSIONS
- SES-01: TOOLS: UNDER CONSTRUCTION: LESSONS LEARNED (NATIONAL LIBRARY PERSPECTIVE)
- SES-02: CRAWLING TOOLS
- SES-03: ADVOCACY & USER ENGAGEMENT
- SES-04: DISCOVERY & ACCESS (NEWS/NEWSPAPERS)
- SES-05: SUSTAINABILITY
- SES-06: CURATION: SOCIAL MEDIA
- SES-07: RESEARCH & ACCESS
- SES-08: HANDLING WHAT YOU CAPTURED
PANELS
- PAN-01: ENGAGING AUDIENCES
- PAN-02: CROSS-INSTITUTIONAL COLLABORATIONS
- PAN-03:CROSS-INSTITUTIONAL COLLABORATION: THE END OF TERM ARCHIVE
WORKSHOPS
- WS-01: EXPLORING DILEMMAS IN THE ARCHIVING OF LEGACY WEBPORTALS: AN EXERCISE IN REFLECTIVE QUESTIONING
- WS-02: WEB ARCHIVE COLLECTIONS AS DATA
- WS-03: INTRODUCTION TO WEB GRAPHS
- WS-04: HOW TO DEVELOP A NEW BROWSERTRIX BEHAVIOR
LIGHTNING TALKS
- LT-01
- STRATEGIES AND CHALLENGES IN THE PRESERVATION OF MEXICO’S WEB HERITAGE: FIRST STEPS
- CHALLENGES AND STRATEGIES IN IMPLEMENTING WEB ARCHIVING LEGISLATION IN BRAZIL
- ARQUIVO.PT TOOLKIT FOR WEB ARCHIVING
- TRACKING THE POLITICAL REPRESENTATIONS OF LIFE: METHODOLOGICAL CHALLENGES OF EXPLORING THE BNF WEB ARCHIVES
- COLLABORATIVE CURATORIAL APPROACHES OF THE CZECH WEB ARCHIVE USING THE EXAMPLE OF THEMATIC LITERARY COLLECTIONS
- LT-02
- MODELLING ARCHIVED WEB OBJECTS AS SEMANTIC ENTITIES TO MANAGE CONTEXTUAL AND VERSIONING ISSUES
- MODERNIZING WEB ARCHIVES: THE BUMPY ROAD TOWARDS A GENERAL ARC2WARC CONVERSION TOOL
- POKING AROUND IN PODCAST PRESERVATION
- AUTOMATIC CLUSTERING OF DOMAINS BY INDUSTRY FOR EFFECTIVE CURATION
- BEST PRACTICE OF PRESERVING POSTS FROM SOCIAL MEDIA FEEDS
- LT-03
- CHANGE DETECTION OVER A LARGE NUMBER OF URLS
- THE PRACTICE OF WEB ARCHIVING STATISTICS AND QUALITY EVALUATION BASED ON THE LOCALIZATION OF ISO/TR 14873:2013(E): A CASE STUDY OF THE NSL-WEBARCHIVE PLATFORM
- ARQUIVO.PT QUERY LOGS
- MODIFYING EPADD FOR ENTITY EXTRACTION IN NON-ENGLISH LANGUAGES
- LT-04
- COLLABORATIVE COLLECTIONS AT ARQUIVO.PT: FOUR YEARS OF RECORDINGS FROM THE CITY OF SINES (PORTUGAL)
- PARTICIPATORY WEB ARCHIVING: THE TENSIONS BETWEEN THE INSTRUMENTAL BENEFITS AND DEMOCRATIC VALUE
- A MINIMAL COMPUTING APPROACH FOR WEB ARCHIVE RESEARCH
- WHERE FASHION MEETS SCIENCE: COLLECTING AND CURATING A CREATIVE WEB ARCHIVE
- WHAT YOU SEE NO ONE SAW
POSTERS
- POSTER SESSION
- ARQUIVO.PT API/BULK ACCESS AND ITS USAGE
- ‘WE ARE NOW ENTERING THE PRE-ELECTION PERIOD’: EXPERIMENTAL TWITTER CAPTURE AT THE NATIONAL ARCHIVES
- THE BNF DATALAB SERVICES AND TOOLS FOR RESEARCHERS WORKING ON WEB ARCHIVES
- EXPERIENCES SWITCHING AN ARCHIVING WEB CRAWLER TO SUPPORT HTTP/2
- WEB SCRAPING IN THE HUNGARIAN WEB ARCHIVE
- ARQUIVO.PT API/BULK ACCESS AND ITS USAGE
- POLITELY DOWNLOADING MILLIONS OF WARC FILES WITHOUT BURNING THE SERVERS DOWN
- NEXT STEPS TOWARDS A FORMAL REGISTRY OF WEB ARCHIVES FOR PERSISTENT AND SUSTAINABLE IDENTIFICATION
- USING WEB ARCHIVES TO CONSTRUCT THE HISTORY OF AN ACADEMIC FIELD
- ARQUIVO.PT ANNUAL AWARDS: A GLIMPSE
- ASYNCHRONOUS AND MODULAR PIPELINES FOR FAST WARC ANNOTATION
- CONSORTIUM ON ELECTRONIC LITERATURE (CELL)
- DESIGNING ART STUDENT WEB ARCHIVES
- FAILED CAPTURE OR PLAYBACK WOES? A CASE STUDY IN HIGHLY INTERACTIVE WEB BASED EXPERIENCES
- FROM NEW MEDIA ARCHIVES ON SOCIAL MEDIA PLATFORMS TO WEB ARCHIVES - CHALLENGES IN PRESERVING SCRAPED CULTURAL MATERIALS
- HAWATHON: PARTICIPANTS EXPERIENCE
- IMPLEMENTING THE E-ARK STANDARD FOR INGEST OF SOCIAL MEDIA ARCHIVES: GOALS, OPPORTUNITIES AND CHALLENGES
- PLANNING WEB ARCHIVING WITHIN A FOUR-YEAR SCOPE: MAKING THE NEW COLLECTION PLAN FOR THE YEARS 2025-2028 IN THE NATIONAL LIBRARY OF FINLAND
- REDIRECTS UNRAVELED: FROM LOST LINKS TO RICKROLLS
- ROBOTS.TXT AND CRAWLER POLITENESS IN THE AGE OF GENERATIVE AI
- SOLVING THE PROBLEM OF REFERENCE ROT VIA WEB ARCHIVING: AN OA PUBLISHER’S SOLUTION & FUTURE SOLUTIONS IN THOTH
- USE OF SCREENSHOTS AS A HARVESTING TOOL FOR DYNAMIC CONTENT AND USE OF AI FOR LATER DATA ANALYSIS
- ADVANCING PARTICIPATORY DEMOCRACY THROUGH WEB ARCHIVING: THE KRIA ICELANDIC CONSTITUTION ARCHIVE
br>SESSIONS
SESSION 01: TOOLS: UNDER CONSTRUCTION: LESSONS LEARNED (NATIONAL LIBRARY PERSPECTIVE)
Embedding the Web Archive in an Overall Preservation System
Hansueli Locher
Swiss National Library, SwitzerlandThe Swiss National Library (SNL) is building a new digital long-term archive that will go live in spring 2025. This system is designed as an overall system that covers all the processes involved in handling the digital objects of all the SNL's collections, including the web archive. This starts with the delivery of the objects by producers or the collection of the objects by the SNL itself, includes the preparation for archiving and cataloguing, administration and preservation, and ends with the provision to users.
The first part of the presentation will describe the architecture and functionality of the overall system, which consists of three different areas and uses a mixture of standard components and individual developments.
-
- A modular pre-ingest area provides so-called processing channels for different types of collection objects. With the help of said channels the objects and their metadata are prepared in such a way that they can be transferred to the ingest process of the digital archive.
- The Digital Archive contains the core system for managing and archiving digital collection objects. It also provides risk and preservation management functionality.
- An access system allows users to access the digital collections. It provides a full-text search, access control and server-based viewers for the most common data formats. In addition, selected parts of the collection can be presented to users in a curated form via so-called showcases.
br>The second part of the presentation will show how the Swiss Web Archive and its specific processes have been integrated into the overall system. Special precautions had to be taken particularly in the Pre-Ingest and Access areas.
In Pre-Ingest, a distinct processing channel was created for the web archive. This makes it possible to register the websites for collection (and automated periodic snapshots), collect them, check their quality and improve it if necessary, and ensure that they are virus-free.
Access makes the web archive accessible via a full-text search, for which special precautions had to be taken when generating the hit lists. Otherwise, the hits from the other collections would be lost among the numerous hits from the web archive. In addition, one of the showcases will provide an unexpected approach to the web archive.
The presentation will conclude by addressing some of the specific challenges of integrating the web archive into an overall preservation system and the lessons learnt.
UKWA Rebuild
Gil Hoggarth
British Library, United KingdomThe British Library suffered a major service outage following a cyber-attack on all technical systems in late October, 2023. What followed was a complete rebuild of all services with security baked in. This short presentation provides an overview of how the UK Web Archive was affected, how the new operational technology landscape of the British Library changed, and describes the work being undertaken to return UKWA as a public service and to begin crawling again from on-premise servers. It will also describe how the internal systems of UKWA are changing to meet the new infrastructure and policies.*
The challenges faced should be important to all web archiving institutions. The necessary changes made by the British Library to ensure the new services are secure by design will have a major impact on the UK Web Archive systems, but these could be challenges and changes imposed on any web archive. The size of the UK Web Archive, approaching 2PiB and an estimated 18 billion files, also creates challenges in itself which will be familiar to many web archives - the redesign of UKWA includes distant storage and aims to establish shared functions and resources across the Legal Deposit Libraries in the future.
Ways of discovering content within the UK Web Archive have been significantly reduced by the cyber-attack. Previously, a full text search service was available using Apache Solr. However, the return of a 'discovery service' has been delayed by the necessity of rebuilding all systems from scratch. The future planning for a discovery service, and a user service, will also be outlined in the presentation.
* As of mid-August 2024, no technology infrastructure or systems have been released for the UKWA rebuild work. Consequently, the content of this presentation may change from this paper submission and the conference date.
Under Construction: Web Archive of the German National Library
Natanael Arndt
German National Library, GermanyOur institution is running a web archive since 2012, in cooperation with an external contractor and on closed-source software. Most recently we have started the shift towards an in-house open source web archiving system that shall be integrated with the overall data management infrastructure of our institution. During a first migration process the whole setup was moved in-house. The migration allowed us to gain some control over the operation, while the development and support is still performed by the contractor.
In our experience over the last decade, we have identified a number of limitations with the current web archive setup: The crawling capacity is limited to a maximum of 12,000 snapshots per annum, the non-modular system complicates the implementation of new requirements, and we cannot directly benefit from the progress of the striving open source web archiving community in regard to new features and the implementation of web archiving standards. In parallel to the web archiving activities, our institution has developed an overarching data management infrastructure for the acquisition, digital preservation, and provisioning of electronic resources, such as e-books, e-journals, and most recently audio files. In order to gain an increased maintainability, flexibility and control over the web archiving activity, our aim is to implement a new system in-house, to integrate it with the well-established in-house workflows for electronic resources, and align it with and base it on the current open source state of the art and the standards of the web archiving community.
During the presentation we take you on the journey of our institution towards the implementation of an in-house and open source web archive. We try to answer the questions: How do we understand the environment? How do we get together our team? Where do we want to go? How do we decide, which paths we take? Which gear do we need? And finally, what are our lessons learned?
br>SESSION 02: CRAWLING TOOLS
Lessons Learned Building a Crawler From Scratch: The Development and Implementation of Veidemann
Marius André Elsfjordstrand Beck
National Library of Norway, NorwayOver the past two decades, web content has become increasingly dynamic. While a long-standing harvesting technology like Heritrix effectively captures static web content, it has huge limitations capturing and discovering links in dynamic content. As a response to this, the Web Archive at the National Library of Norway in 2015 set out to develop a new browser-based web crawler.
This talk will present our experiences and lessons learned from building Veidemann. There are so many factors to consider when building a tool from scratch, and we will try to outline some of the decisions we were faced with during the process, unexpected issues and how we are addressing them.
The talk will present:
-
- A high-level view of the design of Veidemann and the factors that influenced it
- How Veidemann compares to similar projects
- The pros and cons of using a container-based platform
- The main issues with the current implementation and possible solutions to them
- Unexpected results
- An idea for a different paradigm in the design of such a system
br>The full cost/benefit analysis of taking on a project of this size and scale is, by the nature of the work, not fully knowable at the start. After nearly a decade in the making, the story of Veidemann is one of pride, hope, hardship and lessons learned. While it is still being used in production at our institution and harvesting roughly 1TB per week (of deduplicated content), other similar tools, such as Browsertrix, have distinct advantages in their approach. While the future of Veidemann is uncertain we would love to share what we have learned so far with the broader community.
Experiences of Using in-House Developed Collecting Tool ELK
Lauri Ojanen
National Library of Finland, FinlandELK (acronym for Elonleikkuukone which means harvester in Finnish) is a tool which was built in the National Library of Finland’s Legal Deposit Services to aid collecting, managing, and harvesting online materials to the Web Archice. Legal Deposit Services started to use ELK in 2018 and since we’ve updated ELK several times to better suit the needs of collectors and harvesters of web materials.
Features of ELK include back catalog of former thematic web harvests including web materials also known as seeds, cataloging information and keywords, and tools to manage thematic web harvests that are currently being made. Features have been made in collaboration between the collectors and developers who also work on harvesting the web materials. The aim is to create a tool where the collectors can easily categorize different web materials, give notes on how to harvest different materials and stay on track what has been collected and what has not. Collectors can also harvest single web pages themselves for quality control. This is to make sure that pages with dynamic elements can be viewed as they were meant to in the web archive.
ELK is also used as a documenting platform. The easiest way to see curatorial choices, keywords and history of the thematic web harvests is to gather them in one platform. When that platform is used for everything related to the web archiving, we can easily see what themes have been harvested, what sort of materials were collected previously and in best cases see the curatorial decisions that were made in those harvests.
Sharing our experiences of an in-house developed tool for collecting web materials we can help other libraries in their efforts. What are the advantages in curating and managing our web collections and what disadvantages there are. Also, where we would like to see our collections go in the future now that we’ve used the tool for a while.
Better Together: Building a Scalable Multi-Crawler Web Harvesting Toolkit
Alex Dempsey, Adam Miller, Kyrie Whitsett
Internet Archive, United States of AmericaThe web is as nearly infinite in its expanse as it is in its diversity. As its volume and complexity continues to grow, high-quality, efficient, and scalable web harvesting methods are more essential than ever. The numerous and varied challenges of web archiving are well known to this community, so it’s not surprising there isn’t one tool that can perfectly harvest it all. But through open source software collaboration we can build a scalable toolkit to meet some of these challenges.
In the presentation, we will outline some of the many lessons and best practices our institution has learned from the challenges, requirements, research, and practical experience from collaborating with other memory institutions for over 25 years to meet the harvesting needs of the preservation community.
To demonstrate how some of those challenges can be overcome, we will then discuss a fictional large-scale domain harvest use case presenting common issues. With each new challenge encountered we will introduce concepts in web harvesting while demonstrating approaches to solve them. Sometimes the best approach is a configuration option in Heritrix, and sometimes it’s including another open source software to incrementally improve the quality and scale of the campaign. Nothing is perfect, so we’ll also cover some things to consider when deciding to employ an additional tool.
Some of the challenges we’ll address are:
-
- Scaling crawls to multiple machines
- How to avoid accidental crawler traps
- Efficiently layering-in browser assisted web crawling
- Handling rich media like video and PDFs
br>Heritrix makes a great base for large-scale web crawling, and many in the IIPC community already use it for their web harvests. The presentation will demonstrate tools that complement Heritrix, and should be easy to try as an add-on to a reliable implementation, but the concepts—and often the tools themselves—are web crawler agnostic.
The presentation is geared to a wide range of experience. Anyone who is curious about what it takes to run a large web harvest will leave with a better understanding, and experienced practitioners will acquire insights into some technical improvements and strategies for improving their own harvesting infrastructures.
Lowering Barriers to Use, Crawling, and Curation: Recent Browsertrix Developments
Tessa Walsh, Ilya Kreymer
Webrecorder, United States of AmericaAs the web continues to evolve and web archiving programs develop in their practices and face new challenges, so too must the tools that support web archiving continue to develop alongside them. This talk will provide updates on new features and changes in Browsertrix since last year’s conference that enable web archiving practitioners to capture, curate, and replay important web content better than ever before.
One key new feature that will be discussed is crawling through proxies. Browsertrix now supports the ability to crawl through SOCKS5 proxies which can be located anywhere in the world, regardless of where Browsertrix itself is deployed. With this feature, it is possible for users to crawl sites from an IP address located in a particular country or even from an institutional IP range, setting crawl workflows to use different proxies as desired. This feature allows web archiving programs to satisfy geolocation requirements for crawling while still taking advantage of the benefits of using cloud-hosted Browsertrix. Proxies may also have other concrete use cases for web archivists, including avoiding anti-crawling measures and being able to provide a static IP address for crawling to publishers.
Similarly, the presentation will discuss changes made that enable users of Browsertrix to configure and use their own S3 buckets for storage. Like proxies, this feature lowers the barriers to using cloud-hosted Browsertrix by enabling institutions to use their own storage infrastructure and meet data jurisdiction requirements without needing to deploy and maintain a self-hosted local instance of Browsertrix.
Other developments will also be discussed, such as improvements to collection features in Browsertrix which better enable web archiving practitioners to curate and share their archives with end users, user interface improvements which make it easier for anyone to get started with web archiving, and improvements to Browsertrix Crawler to ensure websites are crawled at their fullest possible fidelity.
br>SESSION 03: ADVOCACY & USER ENGAGEMENT
Insufficiency of Human-Centric Ethical Guidelines in the Age of AI: Considering Implications of Making Legacy Web Content Openly Accessible
Gaja Zornada, Boštjan Špetič
Computer History Museum Slovenia (Računališki muzej), SloveniaWhile the preservation of web history is crucial for maintaining a cultural and informational record of our age, reconstructing and resurfacing legacy content without appropriate context nowadays presents new ethical concerns.
Legacy content may be misleading to users when consumed in isolation, as it often reflects outdated norms, technologies, and information that are no longer relevant. Moreover, individuals featured in such content may be unfairly subjected to scrutiny based on past actions or statements that, in today's context, could harm their personal or professional reputation. The consequences of resurfacing this content without adequate contextualization are amplified when AI technologies are involved.
AI’s ability to synthesize and amplify such data across platforms can create a ripple effect, where even content that does not explicitly reveal personal information can still have far-reaching consequences. By connecting disparate data points, AI may draw conclusions or inferences about individuals, influencing public perception and potentially affecting career prospects or even legal outcomes. Unlike the human reader who would be able to contextually infer that a piece of reconstructed online content is part of a legacy web segment intended to be presented as a historical monument to the online world of times past, AI will not be able to distinguish such content from contemporary sources and will misplace the weights system on it’s analysis of such content. The ethical challenge here lies not just in the publication of legacy content and archival access, but in AI’s ability to endlessly circulate and reinterpret it in ways that were never intended by the original authors.
This proposal explores the delicate balance between the preservation of historical digital records and respecting individuals' right to be forgotten (RTBF) in the age of AI. It seeks to question how AI-powered tools reshaping the reading and presentation of web archives challenge existing ethical norms. By examining potential frameworks for responsible digital archiving, the proposal aims to identify solutions that mitigate the risks posed by AI-driven resurfacing of legacy content in the public domain.
Web Archives for Music Research
Andreas Lenander Ægidius
Royal Danish Library, DenmarkThe Royal Danish Library has set a strategic goal to make more of its cultural heritage materials accessible and engaging for researchers by 2027. In this paper, we present findings from an advocacy initiative targeted at researchers at national universities in music-related fields. The national web archive provides primary sources and contextual information relevant to music researchers as they engage with our music collections. However, there is room for improvement in the connection between these collections and our understanding of user needs.
Reports by Healy et al. (2022) and Healy & Byrne (2023) explore the challenges researchers face when using web archives, highlighting the ongoing need to examine the skills, tools, and methods associated with web archiving. Additionally, the sounds of the web—from MIDI to streaming—are an integral part of its history, yet this aspect is often overlooked by tools like the Internet Archive's Wayback Machine (Morris, 2019).
Through semi-structured interviews with fellow curators and music researchers at universities, we identify current barriers to access and user requirements for improved utilization of web archival resources. Our advocacy initiative also allows us to summarize current research trends as feedback for web curators. In conclusion, we describe how the web curators processed our findings into suggestions for updates and refinements to web crawling strategies and the built-in tools in the SolrWayBack installation.
References
Healy, S., & Byrne, H. (2023). Scholarly Use of Web Archives Across Ireland: The Past, Present & Future(s) (WARCnet Special Reports). Aarhus University. https://cc.au.dk/fileadmin/dac/Projekter/WARCnet/Healy_Byrne_Scholarly_Use_01.pdf
Healy, S., Byrne, H., Schmid, K., Bingham, N., Holownia, O., Kurzmeier, M., & Jansma, R. (2022). Skills, Tools, and Knowledge Ecologies in Web Archive Research (WARCnet Special Reports). Aarhus University. https://cc.au.dk/fileadmin/dac/Projekter/WARCnet/Healy_et_al_Skills_Tools_and_Knowledge_Ecologies.pdf .
Morris, J. W. (2019). Hearing the Past: The Sonic Web from MIDI to Music Streaming. In N. Brügger & I. Milligan (Eds.), The SAGE Handbook of Web History (pp. 491–510). Sage.
IXP History Collection: Recording the Early Development of the Core of the Public Internet
Sharon Healy1, Gerard Best1, Lara Díaz Martínez2
1: Independent Researcher, Ireland; 2: University of Barcelona, SpainThe IXP History Collection is an ongoing project which seeks to record and document histories of the Internet exchange points (IXPs) which form the core of the Internet’s topology. An IXP is the point at which Internet Service Providers and Content Delivery Networks connect and exchange data with each other (“peering”). IXPs form the topological core of the Internet backbone, their histories are inextricably linked to the commercialization of the Internet, and their development is a significant milestone in the global history of media and communications. Efforts should therefore be made to ensure that we preserve IXP histories for future generations.
The main purpose of the project is to collect and preserve networking and IXP histories due to valid concerns that these histories will be lost from the global record unless attempts are made to start preserving them now. In particular, the project is concerned with the fragility of electronic information and born digital documents, records, and multimedia, otherwise known as born digital heritage. As a starting point, the project utilizes the Internet Exchange Directory which is maintained by Packet Clearing House, an intergovernmental treaty organization responsible for providing operational support and security to critical Internet infrastructure, including Internet exchange points. The PCH IX Directory is one of the earliest organized efforts to develop and maintain a database for recording and tracking the establishment, development and global growth of IXPs.
The project then focuses on documenting IXP histories through as many online sources as possible (e.g., websites/pages, reports, journals, magazines/newspaper articles, old emails on public mail lists). The project relies on the use of web archives as a research tool for tracing IXP histories, as well as a preservation tool using the Save Page functions in the Wayback Machine and Arquivo.pt.
In this presentation we discuss our approach and methodology for developing the collection and making it available online as a reference resource, and we offer an overview of the importance of using web archives for documenting and preserving Internet and IXP histories. By presenting our approach, we hope to offer a case study that demonstrates how web archive research can be integrated with traditional research methods (Healy et al., 2022), and promote more widespread use of web archives as research tools for historical inquiry, and the long-term preservation of digital research (Byrne et al., 2024).
Resources:
- Arquivo.pt: https://arquivo.pt/
- IXP History Collection - Information Directory | Zotero: https://www.zotero.org/groups/4944209/ixp_history_collection_-_information_directory/library
- Packet Clearing House, Internet Exchange Directory: https://www.pch.net/ixp/dir
- Wayback Machine: https://web.archive.org/
References:
Healy, S., Byrne, H., Schmid, K., Bingham, N., Holownia, O., Kurzmeier, M. and Jansma, R. (2022). Skills, Tools, and Knowledge Ecologies in Web Archive Research. WARCnet Special Report, Aarhus, Denmark: https://web.archive.org/web/20221003215455/https://cc.au.dk/fileadmin/dac/Projekter/WARCnet/Healy_et_al_Skills_Tools_and_Knowledge_Ecologies.pdf
Byrne, H., Boté-Vericad, J-J, and Healy, S. (2024) Exploring Skills and Training Requirements for the Web Archiving Community. In: Aasman, S., Ben-David, A., and Brügger, N., eds. The Routledge Companion to Transnational Web Archive Studies. Routledge.
Lost, but Preserved - A Web Archiving Perspective on the Ephemeral Web
Sawood Alam, Rachel Auslander, Mark Graham
Internet Archive, United States of AmericaThe World Wide Web, our era's most dynamic information ecosystem, is characterized by its transient nature. Recent studies have highlighted the alarming rate at which web content disappears or changes, a phenomenon known as "link-rot". A 2024 Pew Research Center study revealed that 38% of webpages from 2013 were inaccessible a decade later. Even more striking, Ahrefs, an SEO company, reported that at least 66.5% of links to sites created in the last nine years are now dead. These findings echo earlier research by Zittrain et al., which uncovered significant link-rot in journalistic references from New York Times articles.
While these statistics paint a grim picture of digital impermanence, they often overlook a crucial factor: the role of web archives. This talk aims to reframe the link-rot discussion by considering the preservation efforts of various web archiving institutions.
Our research revisiting the Pew dataset yielded a surprising discovery: only one in nine URLs from the original study were truly missing, the remaining bulk had at least one capture in a web archive. This finding suggests that the digital landscape, when viewed through the lens of web archiving, may be less ephemeral than commonly perceived.
Key points we will explore:
- The state of link-rot: We will review recent studies and their methodologies, discussing the implications of their findings for digital scholarship, journalism, and information access.
- Web archives as digital preservationists: We will introduce major web archiving initiatives and explain their crucial role in maintaining the continuity of online information.
- Reassessing link rot with archives in mind: We will present our methodology and findings from reexamining the Pew dataset, demonstrating how web archives mitigate content loss.
- Challenges and limitations of web archiving: Despite their importance, web archives face significant technical, legal, and resource constraints. We will discuss these challenges and their impact on preservation efforts.
- The future of web preservation: We will explore emerging technologies and strategies in web archiving, including machine learning approaches to capture dynamic content and efforts to preserve the context of web pages.
- Call to action: We will emphasize the importance of supporting and expanding web archiving efforts, discussing how researchers, institutions, and individuals can contribute to these initiatives.
br>This talk aims to provide a more nuanced understanding of digital impermanence and preservation. While acknowledging the real challenges posed by link-rot, we will highlight the often-overlooked role of web archives in maintaining our digital heritage. By doing so, we hope to foster greater appreciation for web archiving efforts and encourage increased support for these crucial initiatives.
Our goal is to leave the audience with a renewed perspective on the state of the web's preservability and a clear understanding of why supporting web archiving is essential for ensuring the longevity and accessibility of our shared digital knowledge. As we navigate an increasingly digital world, recognizing that much of what seems lost may actually be preserved is vital for researchers, educators, journalists, lawyers, and anyone who values the continuity of online information.
br>SESSION 04: DISCOVERY & ACCESS (NEWS/NEWSPAPERS)
Unlocking the Archive: Open Access to News Content as Corpora
Jon Carlstedt Tønnessen, Magnus Breder Birkenes
National Library of Norway, NorwayThe content of web archives is potentially highly valuable to research and knowledge production. However, most web archives have strict access regimes to their collections, and with good reason: archived content is often subject to copyright restrictions and potentially also data protection laws. When moving towards best practices, a key question is how to improve access, while also maintaining legal and ethical commitments.1
This presentation will show how the National Library of Norway (NB) has worked to provide open access to a corpus of more than 1.5 million news articles in the web archive. By providing the collection as data - scoping it across the typical crawl job-oriented segmentation - anyone gets access to computational text analysis at scale. By serving metadata and snippets of content through a REST API and keeping the full content in-house, we align with FAIR principles while accounting for immaterial rights and data protection laws.2
The key steps in building the news corpora will be walked through, such as:
- ) extracting data from WARC
- ) removing boilerplate content for purposes of Natural Language Processing (NLP)
- ) curating and filtering across crawl-oriented collections
- ) tokenising full-text for computational analysis
- ) Quality Assessment before publishing
br>Further, we will demonstrate how anyone can tailor corpora for their own use and analyse news text at scale - either with user-friendly apps, or with computational notebooks via API.3
The demonstration highlights some of the limitations, but also the great possibilities for allowing distant reading of web archives. We will discuss how the approach to collections as data provides broader access and new perspectives for researchers. Open access further allows for utilisation in new contexts, such as higher education, government and commercial business. With easy-to-use web applications on top, the threshold for non-technical users is lowered, potentially increasing the use of web archives vastly. We also reflect on how interdisciplinary cooperation and user-orientation have been vital in designing and building the solution.
1 Caroline Nyvang og Eld Zierau, “Untangling Nordic Web Archives”, in The Nordic Model of Digital Archiving (Routledge, 2023), 191–92; Niels Brügger og Ralph Schroeder, Web as History: Using Web Archives to Understand the Past and the Present (London: UCL Press, 2017), 10.
2 Magnus Breder Birkenes and Jon Carlstedt Tønnessen. (2024). “corpus-build”. Github. National Library of Norway. https://github.com/nlnwa/corpus-build/; Thomas Padilla. (2017). “On a Collections as Data Imperative”. UC Santa Barbara. pp. 1–8; Sally Chambers. (2021). “Collections as Data : Interdisciplinary Experiments with KBR’s Digitised Historical Newspapers : a Belgian Case Study”. DH Benelux: The Humanities in a Digital World. 1–3; Magnus Breder Birkenes, Lars Johnsen, and Andre Kåsen. (2023). “NB DH-LAB: a corpus infrastructure for social sciences and humanities computing.” CLARIN Annual Conference Proceedings.
3 Apps and notebooks will be available as open-source code ultimo November 2024. For similar services for digitised content, see “Apper fra DH-LAB». (2024). National Library of Norway. https://www.nb.no/dh-lab/apper/; “Digital tekstanalyse”. (2024). National Library of Norway. https://www.nb.no/dh-lab/digital-tekstanalyse/
Recently Orphaned Newspapers: From Archived Webpages to Reusable Datasets and Research Outlooks
Tyng-Ruey Chuang1, Chia-Hsun Wang1, Hung-Yen Wu1,2
1: Academia Sinica, Taiwan; 2: National Yang Ming Chiao Tung University, TaiwanWe report on our progress in converting the web archives of a recently orphaned newspaper into accessible article collections in IPTC (International Press Telecommunications Council) standard format for news representation. After the conversion, old articles extracted from a defunct news website are now reincarnated as research datasets meeting the FAIR data principles. Specifically, we focus on Taiwan's Apple Daily and work on the WARC files built by the Archive Team in September 2022 at a time when the future of the newspaper seemed dim.0 We convert these WARC files into de-duplicated collections of pure text in ninjs (News in JSON) format.1
The Apple Daily in Taiwan had been in publication since 2003 but discontinued its print edition in May 2021. In August 2022, its online edition was no longer being updated, and the entire news website has become inaccessible since March 2023. The fate of Taiwan's Apple Daily followed that of its (elder) sister publication in Hong Kong. The Apple Daily in Hong Kong was forced to cease its entire operation after midnight June 23, 2021.2 Its pro-democracy founder, Jimmy Lai (黎智英)3, was arrested under Hong Kong's security law the year before.
Being orphaned and offline, past reports and commentaries from the newspapers on contemporary events (e.g. the Sunflower Movement in Taiwan and the Umbrella Movement in Hong Kong) become unavailable to the general public. Such inaccessibility has impacts on education (e.g. fewer news sources to be edited into Wikipedia), research (e.g. fewer materials to study the early 2000s zeitgeist in Hong Kong and Taiwan), and knowledge production (e.g. fewer traditional Chinese corpora to work with).
Our work in transforming the WARC records into ninjs objects produces a collection of unique 953,175 news articles totaling in 4.3 GB. The articles are grouped by the day/month/year they were published hence it is convenient to look into a specific date for the news that were published on that day. Metadata about each article — headline(s), subject(s), original URI, unique ID, among others — are mapped into the corresponding fields in the ninjs object for ready access.
(For figures, please access this dataset.4)
Figure 1 shows the ninjs object derived from a news article that was published on 2014-03-19, archived on 2021-09-29, and converted by us on 2024-02-17. Figure 2 is a screenshot of the webpage where the news was originally published. Figure 3 displays the text file of the ninjs object in Figure 1. Currently the images and videos accompanying the news article have not been extracted. Another process is in the plan to preserve and link to these media files in the produced ninjs object.
In our presentation, we shall elaborate on technical details (such as the accuracy and coverage of the conversion) and exemplary use cases of the collection. We will touch on the roles of public research organizations in preserving and making available materials that are deemed out of commerce and circulation.
0 https://wiki.archiveteam.org/index.php/Apple_Daily#Apple_Daily_Taiwan
1 https://iptc.org/standards/ninjs/
2 https://web.archive.org/web/20210623212350/https://goodbye.appledaily.com/
3 https://en.wikipedia.org/wiki/Jimmy_Lai
4 https://pid.depositar.io/ark:37281/k5p3h9k37NewsWARC: Analyzing News Over Time in the Web Archive
Amr Emara2, Khaled Ezz2, Shaden Hazem2, Youssef Eldakar1
1: Bibliotheca Alexandrina, Egypt; 2: Alamein International University, EgyptNews consumption, as studies generally suggest, is quite common globally. Today, individuals, wherever there is an Internet connection, access news predominantly online. On the web, news websites rank relatively high by number of visits. Considering the history of the web, the news media industry was one domain of society to adopt the web as technology very early on. Being of such significance, news content on the web is one to particularly investigate, using the web archive as data source.
We present NewsWARC, a tool, developed as an internship project, for aiding researchers to explore news content in a web archive collection over time. NewsWARC consists of two components: the data analyzer and the viewer. The data analyzer is code that runs on data in the collection and uses machine learning to get information about each news article or post, namely, sentiment, named entities, and category, and store that into a database for access via the second component that serves as the interface for querying and visualizing the pre-analyzed data. We report on our experience processing data from the Common Crawl news collection to use in testing, including comparing performance of the data analyzer running on different hardware configurations. We show examples of queries and trend visualizations that the viewer offers, such as examining how the sentiment of articles in health-related news varies over the course of a pandemic.
In developing this initial prototype, while we narrowed our focus with regard to information that the analyzer returns to sentiment, named entities, and category, there exists a wider range of analyses to include in future work, such as topic modeling, keyword and keyphrase extraction, measuring readability and complexity, and fact vs. opinion classification. Also as future work, this overall functionality can be deployed as a service for an alternative interface to supplement researcher access to web archives.
Zombie E-Journals and the National Library of Spain
José Carlos Cerdán Medina
Biblioteca Nacional de España, SpainA "zombie e-journal" refers to an electronic journal that has become inaccessible, but for which a web archive has preserved a copy, sometime this one is not perfectly accurate. It is widely recognized that, each year, a significant number of e-journals disappear without existing in print, resulting in the loss of their content on a global scale. This constitutes a substantial loss of economic investment, scholarly knowledge, and cultural heritage. While many universities maintain institutional repositories to safeguard publications, a large number of e-journals lack sustainable preservation methods due to financial constraints.
In response to this challenge, the Spanish Web Archive initiated efforts to explore potential solutions. A key question was posed: is it feasible to ensure the long-term preservation of more than 10,000 open-access e-journals in Spain? The National Library of Spain, which serves as the National Centre for ISSN assignment, maintains a catalogue that includes all e-journals registered with an ISSN.
The first phase of this initiative started in 2020, when the Spanish Web Archive implemented an annual broad crawl encompassing all URLs associated with electronic journals in Spain. This proactive approach significantly increases the likelihood of locating missing e-journals in the future.
Currently, the project has entered its second phase, during which e-journals that became inaccessible between 2009 and 2023 have been identified. To date, over 500 zombie e-journals have been recovered through consultations with the Spanish Web Archive. The full list of these journals is publicly available through the project’s website and integrated into the National Library’s catalogue.
In the forthcoming third phase, the identified e-journals will be formally declared out-of-commerce works, according to Directive (EU) 2019/790,thus facilitating open access to their content. This step will allow users to once again access and benefit from these resources.
Additionally, a comprehensive system has been developed to detect missing e-journals, conduct quality assurance (Q&A) processes on the captured content, and integrate access to these journals through the library's website and catalogue. The broad crawl has proven effective in identifying missing e-journals, and following quality assurance, the recovered information is systematically incorporated into the catalogue.
br>SESSION 05: SUSTAINABILITY
42 Tips to Diminish the CO2 Impact of Websites
Tamara van Zwol2, Lotte Wijsman1, Jasper Snoeren3, Tineke van Heijst4
1: National Archives of the Netherlands, Netherlands; 2: Dutch Digital Heritage Network, Netherlands; 3: Netherlands Institute for Sound and Vision, Netherlands; 4: Van Heijst Information Consulting, NetherlandsThe internet has become indispensable to modern life, yet its environmental impact is often overlooked. Despite terms like "virtual" and "cloud" suggesting a minimal footprint, the global internet is a significant energy consumer. In 2020, it accounted for approximately 4% of global energy consumption, and if usage trends persist, this figure could rise to 14% by 2040. Archiving even a small number of websites contributes to the growing carbon footprint of digital archives, which compounds over time.
To address this, the Dutch Digital Heritage Network commissioned research to assess the CO2 impact of current websites across various heritage organizations. The study provided practical recommendations to reduce this impact, such as optimizing image sizes, employing green hosting, and streamlining unnecessary code. These strategies not only benefit the public-facing side of websites but also hold potential for the backend, such as in the harvesting process for archiving.
In our presentation, we will share these research findings and highlight actionable steps organizations can take to create more energy-efficient digital archives. Additionally, we will explore the question of what should be archived: Is every aspect of a website equally essential for long-term preservation?
Lastly, we are investigating incremental archiving as a solution to reduce both storage needs and emissions. This approach, which focuses on capturing specific updates rather than performing full harvests, offers a more sustainable alternative for digital preservation.
Building Towards Environmentally Sustainable Web Archiving: The UK Government Web Archive and Beyond
Jane Winters1, Eirini Goudarouli2, Jake Bickford2
1: University of London, United Kingdom; 2: The National Archives (UK), United KingdomAs Subscription Video on Demand (SVOD) platforms expand, preserving DRM-protected content has become a critical challenge for web archivists. Traditional methods often fall short due to Digital Rights Management (DRM) restrictions, necessitating more adaptable solutions. This presentation covers the ongoing development of a generic toolchain based on screen recording designed to effectively address DRM restrictions, capture high-quality content, and scale efficiently.
The project is structured into two main phases. Phase One focuses on developing a system that automatically checks the quality of screen recordings. By monitoring key metrics such as frame rate, resolution, and bit rate, the system should ensure that recordings match the original content’s quality as closely as possible. This phase addresses several technical challenges, including video glitches, frame drops, low resolution, and audio syncing issues. These problems arise from varying network conditions, software performance issues, and hardware limitations. To refine and validate the toolchain, over 100 hours of competition footage from the Paris 2024 Olympic Games have been collected and are being used to assess the system’s performance. This dataset is crucial for ensuring that the toolchain can handle high-quality recordings effectively.
Phase Two tackles the specific challenges posed by DRM restrictions. Level 1 DRM, which involves a trusted environment and hardware restrictions, uses hardware acceleration that causes black screens when video playback and screen recording are attempted simultaneously. Additionally, many SVOD platforms limit high-resolution playback on Linux systems, complicating the capture of high-quality content. To circumvent these issues, playback should be handled on distant machines running Windows, Mac, or Chrome OS—environments where high-resolution limitations do not apply—while recording is performed on Linux systems. For HD video content, which generally involves Level 3 DRM with only software restrictions, Linux can be used directly for both playback and recording without encountering black screen issues.
The toolchain will utilize Docker to scale the recording process by virtualizing hardware components such as display and sound cards. Docker should enable the system to manage multiple recordings concurrently, improving efficiency and reducing the time required for large-scale archiving. FFmpeg will be employed for recording, while Xvfb and ALSA will be used to virtualize the display and sound cards, respectively. By leveraging Docker for virtualization and managing workloads across various instances, the system is expected to scale effectively and accelerate the archiving process.
This ongoing work aims to provide a robust and scalable solution for capturing DRM-protected content when direct downloading is not possible. The toolchain should be adaptable to various SVOD platforms and DRM systems, offering a flexible fallback method. The presentation will offer insights into the technical challenges being addressed, the strategies being developed to bypass DRM restrictions, and how the toolchain should evolve to manage large-scale content archiving effectively and attendees will gain an understanding of the methods used to overcome DRM challenges, the role of Docker in scaling, and the practical applications of this toolchain in preserving valuable web content.
Preservation of Historical Data: Using Warchaeology to Process 20 Years of Harvesting
Andreas Børsheim, Marius Andre Elsfjordstrand Beck
National Library of Norway, NorwayThe National Library of Norway have been harvesting the internet since the beginning of the millennium, with a primary focus and priority on the collection and storage of data. Over 25 years, web harvesting methods and preservation systems have changed. Consequently, the collection is composed of various file types, including ARC, WARC, and files produced by NEDLIB1.
In more recent years our focus has shifted towards access and quality assurance, and the need to include the older data has increased. But how do we utilize this data, which by now is poorly structured, has little to no documentation, and is hard to read by modern software?
In addition, the National Library of Norway is migrating to a new digital preservation system, so all of our data is expected to be moved, providing us an opportunity to clean, index and organize our collection.
To address and resolve these issues and move toward the ultimate goal of making the collections fully discoverable and available, the National Library of Norway developed an open-source tool, Warchaeology2, capable of converting, validating and deduplicating web archive collections data.
This presentation will outline how we have used this tool to process 2PB of data, harvested since 2001. The objective is better management and preservation, including to identify collections and groupings of data, parse and sort metadata, identify formats and how these should be processed or converted, deduplicate files, and gather insight about collections generally.
We will talk about the challenges in deduping, converting, and maintaining large web archive collections, including infrastructural issues like securing sufficient storage space to complete the work. This will be a time-intensive process; we estimate several months will be required for shuffling files between storage solutions, converting and deduplicating our data. The goals of this work are a collection of data that is cleaner, smaller, easier to maintain, and, at the end of the day, accessible for our users.
1 https://web.archive.org/web/20040604032621/http://www.kb.nl/coop/nedlib/
2 https://github.com/nlnwa/warchaeology/Analysing the Publications Office of the European Union Web Archive for the Rationalisation of Digital Content Generation
Alexandre Angers
Publications Office of the European Union, LuxembourgMore and more information from EU institutions, bodies and agencies is only made available on their public websites. However, web content often has a short lifespan, and this information is at risk of getting lost when websites are updated, substantially redesigned or taken offline. As part of its different preservation activities, the Publications Office of the EU crawls, curates and preserves the content and design of these websites, making them available for current and future generations. We also prepare an ingestion of this collection into our digital archive, to ensure its long-term preservation.
We have recently performed a full export of the most recent crawls from our web archive collection, spanning from March 2019 to September 2024, as a set of WARC files. We have extracted relevant information regarding all the “response” and “revisit” records in the collection and inserted it into a relational database, allowing efficient custom analyses. In this presentation, we will show various interesting statistics we have generated about the content of our web archive. These include the analysis of large response payloads (more than 100 Mb), as well as the relative footprint of crawled video files. We also investigate the amount of duplication of records - those that were avoided through ‘revisit’ records, as well as duplicate ‘response’ records is still present in the archive.
We also explain how we have used this information to refine our crawling strategies in order to rationalise our digital content generation going forward. We also define potential policies to curate the existing archive prior to ingestion in a long-term digital repository, where the impact on the carbon footprint may be even more significant.
br>SESSION 06: CURATION: SOCIAL MEDIA
Developing Social Media Archiving Guidelines at the National Archives of the Netherlands
Lotte Wijsman, Geert Leloup, Susanne Van den Eijkel, Sander Wellens
National Archives of the Netherlands, NetherlandsAt the beginning of 2024, we started a project to develop a nationwide guideline for archiving public social media content. This project aimed to address the increasing use of social media by Dutch governments and the current lack of archiving there is. Our presentation at the Web Archiving Conference 2025 will focus on the process of creating this guideline and presenting the final version.
The primary target audience for this guideline are the information professionals, who play a vital role in managing and preserving the archived social media content. However, we also recognise communication professionals as an important target audience, given their role in setting up and using the accounts.
The guideline is structured into six modules:
- Definitions and scope
This module provides a definition of social media and identifies what constitutes as public information on the various platforms.
- Legal framework
In this module, we examine the Dutch and European legal requirements and constraints related to archiving public social media content. Understanding the legal landscape is essential to ensure compliance and address any legal challenges.
- Recommendations for communication professionals
This module provides practical recommendation on using social media in a way that facilities easier archiving. Aimed at those managing the social media accounts, it includes tips on account settings and content creation.
- Appraisal and selection
This module addresses how content can be appraised and selected. Also to ensure that historically important information will be transferred to the Dutch National Archives at a certain moment.
- Criteria and techniques
In this module we establish quality criteria for archiving social media content and explore various techniques to archive social media. Methods discussed include screen capturing and API usage. This module aims to equip professionals with the knowledge to choose the most effective archiving methods.
- Case studies
The final module presents real-world examples from the Netherlands and abroad. These case studies illustrate diverse methods and results, providing practical insights and lessons learned from other practitioners in the field.
br>The creation of this guideline was a collaborative and intensive year-long process. We systematically engaged with a wide range of stakeholders and incorporated their feedback to ensure the guideline is comprehensive and practical. Our goal is to support government agencies in archiving their social media communications effectively.
We are excited to share our journey and the outcomes of this project with our colleagues at the Web Archiving Conference. By presenting our experiences and insights, we hope to contribute to the ongoing discourse on social media archiving and inspire others in the field.
Archiving the Social Media Profiles of Members of Government
Ben Els
National Library of Luxembourg, LuxembourgAs part of the 2023 national elections, the National Library of Luxembourg, in collaboration with the National Archives and the Ministry of State, launched a pilot project to archive the social media profiles of members of the government. The technical obstacles to archiving social platforms are becoming increasingly problematic, resulting in the situation that none of the major platforms can currently be archived effectively by our harvesters and service providers. Since most social media platforms are practically inaccessible by web crawlers and conventional web archiving methods, we decided to try a more direct approach, by asking the members of government directly to download the data from their profiles and hand them over to the National Library and National Archives.
With the help of the Ministry of State, we sent out a call for participation, with specific guidelines to exporting datasets from social networks to the archive delegates and communication departments of each ministry, as well as to the ministers themselves. The response to this first call for participation was very positive - despite the pressure of time, between the election and formation of a new government, with a high chance of many ministers leaving their offices.
In addition to elaborating the guidelines for downloading datasets from different platforms, we offered direct technical support to the people involved in the ministries, even the members of government themselves and retrieved the data individually on site.
We were able to retrieve the majority of profiles of the government, for the time span of the 5 years of their term. This pilot project represents a direct and effective method, to secure the data of profiles of high public interest. The National Library and National Archives of Luxembourg are looking to repeat the same collection process by the end of 2024 and hope to move to a regular operation after that.
This presentation will cover the different steps of the collection process, the lessons learned from the pilot project and the second operation end of 2024. We will conclude with an outlook to the changes we hope to implement in the future, a possible extension of the collection scope and our plans in terms of public access to the collections.
From Posts to Archives: The National Library of Singapore’s Journey in Collecting Social Media
Shereen Tay, Meiyu Lee
National Library Board Singapore, SingaporeSocial media plays a huge role in our everyday life today. It is used for a myriad of activities such as communication, entertainment, business, and even as personal diaries. In Singapore, about 85% of the population uses social media, the most popular ones being Facebook, Instagram, YouTube, and TikTok. Besides individuals, many organisations have also turned to social media to engage and communicate with their followers. With such prevalence use, social media is becoming an important source of information about the lives and stories of our country and people.
Recognising this, the National Library of Singapore (NLS) began looking at collecting social media. Our journey started in 2017, and the initial years focused on research and experiments, such as conducting environmental scan of other heritage institutions’ experiences in collecting social media, proof-of-concept using web archiving and available APIs, and trialling commercial vendors’ solutions. Our experience was similar to many institutions around the world. Collecting social media is complex and poses many technical, legal, and ethical challenges such as limited access to APIs and needing to manage personal data and third-party content.
Despite these challenges, we knew that we had to start collecting social media given its increasing significance. This was not only to meet our mandate of collecting and preserving our countries’ digital memories, but to also gain practical experience on how to collect, organise, and manage this format.
Putting together what we have learnt, we developed a social media collecting framework in 2023 to provide guidance on how to collect social media amidst these challenges while ensuring that a representative set of social media content can be collected for future generations and research. Our framework covered the selection criteria, the collecting methods, and our collecting approach for key social media platforms that are widely used in Singapore.
We piloted our first social media collecting in the same year, under NLS’ new 2-year project to collect contemporary materials on Singapore food and youth. The purpose was to assess individuals and organisations’ receptiveness to contribute their social media accounts to us, which was more forthcoming than we anticipated. In 2024, we made collecting social media as part of our operational work. Our collection strategy was three-prong: 1) outsourcing the archiving of significant persons/organisations’ social media accounts to a commercial vendor; 2) approaching identified organisations based on subjects to contribute their social media accounts; and 3) engaging and promoting social media collecting through advocates and an annual public call to nominate favourite Singapore social media accounts, YouTube and TikTok videos, as well as websites.
This presentation will highlight NLS’ journey in collecting social media, our collecting framework and strategy, as well as learning points and future plans.
Innovative Web Archiving Amid Crisis: Leveraging Browsertrix and Hybrid Working Models to Capture the UK General Election 2024
Nicola Bingham, Jennie Grimshaw
British Library, United KingdomThe British Library, in collaboration with the National Libraries of Scotland and Wales, the Bodleian Library and Cambridge University Library, has created collections of archived websites for all UK general elections since 2005. This time series shows how internet use in political communication has evolved, and how the fortunes of political parties have changed. The 2024 general election was called unexpectedly on May 22nd, and took place on July 4th, at a time when the UK Web Archive was inaccessible, and our Web Archiving and Curation Tool was unavailable following a devastating ransomware attack on the British Library on October 29th 2023. Working together, we nevertheless created a collection of 2253 archived websites covering candidates' campaign sites, social media feeds of significant politicians and journalists, local and national party sites, comment by think tanks, community engagement, news sources, and manifestos of a plethora of interest groups seeking to influence the new government.
To facilitate use by researchers tracking change over time, we have organised the material into these same sub-collections since 2005. We collected campaign websites for a sample of English candidates for the same counties and urban areas as we have covered since 2005, but all Scottish and Welsh candidates’ sites were gathered as numbers are manageable. We also targeted marginal constituencies which had increased in numbers dramatically since 2019. The 2024 general election saw the rise of formerly minor parties such as Reform UK to national prominence, a Liberal Democrat resurgence, growing influence of independent candidates, and the rise of identity politics with groups encouraged to vote as a bloc on issues such as the war in Gaza, and an increasingly sophisticated use of social media.
The technical outage caused by the ransomware attack necessitated a unique approach due to the disruption in our usual workflows. Despite the challenges, websites continued to be archived using Heritrix on AWS servers rather than the Library's in-house infrastructure. This shift required a new workflow, involving the use of simple spreadsheets and collaborative efforts to quickly refine metadata definitions and crawl scope, aiming to replicate our existing curatorial software as closely as possible.
In addition, the British Library secured a free-trial subscription to Browsertrix, which allowed us to explore and learn this new tool’s capabilities ahead of a more formal subscription. Despite the challenges, we successfully captured 1,600 snapshots of social media content, including posts from X (formerly Twitter), Facebook, and Instagram.
This experience introduced library staff to working within data and time constraints, enhancing our understanding of how to effectively scope crawls, monitor them in real-time, and implement new quality assurance practices. The project resulted in a hybrid collecting model, utilising both Heritrix and Browsertrix for the same thematic collection.
The presentation will discuss the challenges and opportunities encountered during this project, providing valuable insights for those interested in Browsertrix’s capabilities and in executing web archiving with a mixed-model approach across different institutions with diverse interests and expertise in unusually challenging circumstances within the framework provided by a historic time series.
br>SESSION 07: RESEARCH & ACCESS
From Pages to People: Tailoring Web Archives for Different Use Cases
Andrea Kocsis2, Leontien Talboom1
1: Cambridge University Libraries, United Kingdom; 2: University of Edinburgh, United KingdomOur paper explores different modes of reaching the three distinct audiences identified in previous work with the National Archives UK: readers, data users, and the digitally curious. Building on the examples of our work conducted at the Cambridge University Libraries and the National Library of Scotland, our paper gives recommendations and demonstrates good practices for designing web archives for different audience needs while assuring wide access.
Firstly, to improve the experience of the general readers, we employ exploratory and gamified interfaces and public outreach events, such as exhibitions, to bring the library users' awareness to the available web archive resources. Secondly, to serve the data user community, we put an emphasis on curating metadata datasets and the Datasheets for Data documentation, encouraging the quantitative research of the web archive collections. This work also involves outreach events, such as data visualisation calls, which later can be incorporated into the resources for the general readers. Finally, to overcome the obstacle of the digital skill gap, we tailored in-library workshops for digitally curious - those who recognise the potential of web archives but lack advanced computational skills. We expect that upskilling the digitally curious can open their interest towards exploring and using the web archive collections.
To sum up, our paper introduces the work we have been doing to improve the useability of the UK Web Archive within our institutions with the help of developing additional materials (datasets, interfaces) and planning outreach events (exhibitions, calls, workshops) to ensure we meet the expectations of readers, data users, and the digitally curious.
Making Research Data Published to the Web FAIR
Bryony Hooper, Ric Campbell
University of Sheffield, United KingdomThe University of Sheffield’s vision for research is that our distinctive and innovative research will be world-leading and world-changing. We will produce the highest quality research to drive intellectual advances and address global challenges. https://www.sheffield.ac.uk/openresearch/university-statement-open-research
Research data published to the web can offer opportunities for wider discovery and access to your research outputs. However, it also presents risk in terms of assurances that that discovery and access will remain for as long as the need for it remains. Websites are an inherently fragile medium, and present risks in terms of providing assurances that we can evidence our research impact over time. This includes potentially wanting to submit sites as part of a UK’s Research Excellence Framework submission (the next scheduled for 2029).
Funding requirements may also stipulate how long they expect the outputs to remain accessible. Years of work, including work undertaken with public funding could disappear if there is no intervention. In addition, publishing research data to the web cannot provide assurances in terms of meeting the University of Sheffield’s commitment to FAIR principles (findable, accessible, interoperable and reusable) and Open Research and Open Data practices.
At the University of Sheffield, colleagues in our Research Data Management (RDM) team have also noticed a trend in researchers depositing in the Institutional Repository (ORDA), links to URLs where the data is situated. In some cases, the website is the research output in its entirety, meaning the maintenance falls outside of the RDM team’s remit, meaning we cannot provide the usual assurances in terms of preserving that deposit in these cases.
This paper will discuss the work undertaken by the University of Sheffield’s Library to mitigate potential data loss from research published online. It will include a case study of the capturing of a research group’s website to deposit in our institutional data repository, the creation of collaboratively created guidance for researchers and research data managers, and the embedding good practice at the University to enable Open Research and Open Data will remain open and FAIR.
Enhancing Accessibility to Belgian Born-Digital Heritage: The BelgicaWeb Project
Christina Vandendyck
Royal Library of Belgium (KBR), BelgiumThe BelgicaWeb project aims to make Belgian digital heritage more (FAIR ( i.e. Findable, Accessible, Interoperable and Reusable) to a wide audience. BelgicaWeb is a BRAIN 2.0 project funded by BELSPO, the Belgian Science Policy Office. It is a collaboration between CRIDS (University of Namur) who provide expertise on the relevant legal issues, IDLab, GhentCDH and MICT (Ghent University) who will work on data enrichment, user engagement and evaluation and outreach to the research community, respectively, and KBR (Royal Library of Belgium) who act as project coordinator and work on the development of the access platform and API and data enrichment.
By leveraging web and social media archiving tools, the project focuses on creating comprehensive collections, developing a multilingual access platform, and providing a robust API enabling data-level access. At the heart of the project is a reference group of experts who provide iterative input on the selection, development of the API and access platform, data enrichment and quality control and usability. Therefore, the project contributes to moving towards best practices for search and discovery.
The project goes beyond data collection by means of open-source tools by enriching and aggregating (meta)data associated with these collections using innovative technologies such as Linked Data and Natural Language Processing (NLP). This approach enhances search capabilities, yielding more relevant results for both researchers and the general public.
In this presentation, we will provide an overview of the BelgicaWeb project’s system architecture, the technical challenges we encountered, and the solutions we implemented. We will demonstrate how the access platform and API offer powerful, relevant, and user-friendly search functionalities, making it a valuable tool for accessing Belgium’s digital heritage. Attendees will gain insights into our development process, the technologies employed, and the benefits of our open-source approach for the web archiving and by extension the digital preservation communities.
Using Generative AI to Interrogate the UK Government Web Archive
Chris Royds, Tom Storrar
The National Archives (UK), United KingdomOur project seeks to make the contents of Web Archives more easily discoverable and interrogable, through the use of Generative AI (Gen-AI). It explores the feasibility of setting up a chatbot, and using UK Government Web Archive data to inform its responses. We believe that, if this approach proves successful, it could lead to a step-change in the discoverability and accessibility of Web Archives.
Background
Gen-AIs like ChatGPT and Copilot have impressive capabilities, but are notoriously prone to “hallucinations”. They can generate confident-sounding, but demonstrably false responses – even to the point of inventing non-existent academic papers, complete with fictitious DOI numbers.Retrieval-Augmented Generation (RAG) seeks to address this. It supplements Gen-AI with an additional database, queried whenever a response is generated. This approach aims to significantly reduce the chance of hallucination, while also enabling chatbots to provide specific references to the original sources.
Additionally, any approach used would need to take into account the occasional need to remove individual records (in line with The National Archives’ takedown policy: https://www.nationalarchives.gov.uk/legal/takedown-and-reclosure-policy/). In traditional Neural Networks, “forgetting” data is currently an intractable problem. However, it should be possible to set up RAG databases such that removal of specific documents is straightforward.
Approach
Our project is focused on two open-source tools, both of which allow for RAG based on Web Archive records.The first is WARC-GPT, a lightweight tool developed by a team at Harvard designed to ingest Web Archive documents, feed them to a RAG database, and provide a chat-bot to interrogate the results. While the tool’s creators have demonstrated its capabilities on a small number of documents, we have attempted to test it at a larger scale, on a corpus of ~22,000 resources.
The second, more sophisticated tool is Microsoft’s GraphRAG. GraphRAG identifies the “entities” referenced in documents, and builds a data structure representing the relationships between them. This data structure should allow a chat-bot to carry out more in-depth “reasoning” about the contents of the original documents, and potentially provide better answers about information aggregated across multiple documents.
Results
Our initial findings suggest that WARC-GPT produces impressive responses when queried about topics covered in a single document. It quickly discovers which one of the documents in its database best answers the prompt. It summarises relevant information from that document, and provides its URL. Additionally, with a few minor tweaks to the underlying source code, it is possible to remove individual documents from its database. However, WARC-GPT’s responses fare poorly when attempting to aggregate information from multiple documents.Our experiments with GraphRAG suggest that it outperforms WARC-GPT in aggregating information. However, while GraphRAG is reasonably quick to generate these responses, it is significantly slower and more expensive to set up than WARC-GPT. Additionally, removing individual records from GraphRAG, while possible, is computationally expensive.
br>SESSION 08: HANDLING WHAT YOU CAPTURED
So You’ve Got a WACZ: How Archives Become Verifiable Evidence
Basile Simon, Lindsay Walker
Starling Lab for Data Integrity, Stanford-USC, United States of AmericaThis talk will present a workflow and toolkit, developed by the Starling Lab for Data Integrity, for collecting and organizing web archives alongside integrity and provenance data.
Co-founded by Stanford and USC, Starling supports investigators–be they journalists, lawyers, or human rights defenders–in their collection of information and evidence. In addition to using Browsertrix to crawl (and test) large sets of web archive data, we have built a downstream integration, so data flows into our cryptographically-signed and append-only database called Authenticated Attributes (AA).
AA extends Browsertrix’s utility by enabling archivists to securely attach and verify claims, provenance, and other context-critical metadata about the archived content in a secure and decentralized manner. It allows for the addition, preservation, and sharing of provenance data while facilitating efficient organization, searchability, and integration with other tools. Through AA, web archives and metadata become accessible for other applications and verification workflows, e.g. OSINT investigations.
In this presentation, we will showcase case studies and projects with our collaborators including Black Voice News, the Atlantic Council’s DFRLab, and conflict monitors.
Warc-Safe: An Open-Source WARC Virus Checker and NSFW (Not-Safe-For-Work) Content Detection Tool
László Tóth
National Library of Luxembourg, LuxembourgWe present warc-safe, the first open-source WARC virus checker and NSFW (Not-Safe-For-Work) content detection tool. Built with particular emphasis on usability and integration within existing workflows, this application detects harmful material and inappropriate content in WARC records. The tool uses the open-source ClamAV antimalware toolkit for threat detection and a specially trained AI model to analyze WARC image records. Several image formats are supported by the model (JPG, PNG, TIFF, WEBP, …), which produces a score between 0 (completely safe) and 1 (surely unsafe). This approach makes it easy to classify images and determine what to do with those that exceed a certain threshold. The warc-safe tool was developed with ease of use in mind; thus, it can be run in two modes: test mode (scan WARC files on the command line) or server mode (for easy integration with existing workflows). Server mode allows the client to use several features over an API, such as scanning a WARC file for viruses, scanning for NSFW content, or both. This makes it easy to use together with popular web archiving tools.
To illustrate this, we present a case study where warc-safe was integrated into SolrWayback and the UK Web Archive’s warc-indexer. This integration made it possible to enrich the metadata indexed from WARC files, by extending the existing Solr schema with several new fields related to virus- and NSFW-test results, allowing for advanced searching and statistical analysis. Finally, we discuss how warc-safe could be used within an institutional framework, for instance by scanning newly harvested WARC files resulting from large-scale harvesting campaigns as well as including it within existing indexing workflows.
Detecting and Diagnosing Errors in Replaying Archived Web Pages
Jingyuan Zhu1, Huanchen Sun2, Harsha Madhyastha2
1: University of Michigan, United States of America; 2: University of Southern California, United States of AmericaWhen a user loads an archived page from a web archive, the archive must ensure that the user’s browser fetches all resources on the page from the archive, not from the original website. To achieve this, archives rewrite references to page resources that are embedded within crawled HTMLs, stylesheets, and scripts.
Unfortunately, the widespread use of JavaScript on modern web pages has made page rewriting challenging. Beyond rewriting static links, archives now also need to ensure that dynamically generated requests during JavaScript execution are intercepted and rewritten. Given the diversity of scripts on the web, rewriting them often results in fidelity violations, i.e., when a user loads an archived page, even if all resources on the page had been crawled and saved, either some of the content that appeared on the original page is missing or some functionality that ought to work on archived pages (e.g., menus, change page theme) does not.
To verify if the replay of an archived page preserves fidelity, archival systems currently compare either screenshots of the page taken during recording and replay or errors encountered in both loads (e.g., https://docs.browsertrix.com/user-guide/review/). These methods have several significant drawbacks. First, modern web pages often include dynamic components, such as animations or carousels; so, screenshots of the same page copy can vary across loads. Second, neither does incorrect replay always result in additional script execution or resource fetch errors, nor does the presence of such errors indicate the existence of user-visible problems. Lastly, even if an archived page does differ from the original page, existing methods cannot pinpoint what inaccuracies in page rewriting led to this problem.
In this talk, we will describe our work in developing a new approach for a) more reliably detecting whether the replay of an archived page violates fidelity, and b) pinpointing the cause when this occurs. Fundamental to our approach is that we do not focus on only the externally visible outcomes of page loads (e.g., pixels rendered and runtime/fetch errors). Instead, both during recording and replay, we capture each visible element in the browser DOM tree, including its location on the screen and dimensions, and the JavaScript writes that produce visible effects. Our fine-grained representation of page loads also enables us to precisely identify the rewritten source code that led to fidelity violations. The fix has to be ultimately determined by a human developer. However, we are able to validate the root cause we identify by either inserting only the problematic rewrite into the original page or by selectively rolling back that edit from the rewritten archived page and examining the corresponding effects.
In our study across tens of thousands of diverse pages, we have found that pywb (version 2.8.3) fails to accurately replay archived copies of approximately 15–17% of pages. Importantly, compared to relying on screenshots and errors to detect low fidelity replay, our approach reduces false positives by as much as 5x.
Building a Toolchain for Screen Recording-Based Web Archiving of SVOD Platforms
Alexis Di Lisi
Institut National de l'Audiovisuel (INA), FranceAs Subscription Video on Demand (SVOD) platforms expand, preserving DRM-protected content has become a critical challenge for web archivists. Traditional methods often fall short due to Digital Rights Management (DRM) restrictions, necessitating more adaptable solutions. This presentation covers the ongoing development of a generic toolchain based on screen recording designed to effectively address DRM restrictions, capture high-quality content, and scale efficiently.
The project is structured into two main phases. Phase One focuses on developing a system that automatically checks the quality of screen recordings. By monitoring key metrics such as frame rate, resolution, and bit rate, the system should ensure that recordings match the original content’s quality as closely as possible. This phase addresses several technical challenges, including video glitches, frame drops, low resolution, and audio syncing issues. These problems arise from varying network conditions, software performance issues, and hardware limitations. To refine and validate the toolchain, over 100 hours of competition footage from the Paris 2024 Olympic Games have been collected and are being used to assess the system’s performance. This dataset is crucial for ensuring that the toolchain can handle high-quality recordings effectively.
Phase Two tackles the specific challenges posed by DRM restrictions. Level 1 DRM, which involves a trusted environment and hardware restrictions, uses hardware acceleration that causes black screens when video playback and screen recording are attempted simultaneously. Additionally, many SVOD platforms limit high-resolution playback on Linux systems, complicating the capture of high-quality content. To circumvent these issues, playback should be handled on distant machines running Windows, Mac, or Chrome OS—environments where high-resolution limitations do not apply—while recording is performed on Linux systems. For HD video content, which generally involves Level 3 DRM with only software restrictions, Linux can be used directly for both playback and recording without encountering black screen issues.
The toolchain will utilize Docker to scale the recording process by virtualizing hardware components such as display and sound cards. Docker should enable the system to manage multiple recordings concurrently, improving efficiency and reducing the time required for large-scale archiving. FFmpeg will be employed for recording, while Xvfb and ALSA will be used to virtualize the display and sound cards, respectively. By leveraging Docker for virtualization and managing workloads across various instances, the system is expected to scale effectively and accelerate the archiving process.
This ongoing work aims to provide a robust and scalable solution for capturing DRM-protected content when direct downloading is not possible. The toolchain should be adaptable to various SVOD platforms and DRM systems, offering a flexible fallback method. The presentation will offer insights into the technical challenges being addressed, the strategies being developed to bypass DRM restrictions, and how the toolchain should evolve to manage large-scale content archiving effectively and attendees will gain an understanding of the methods used to overcome DRM challenges, the role of Docker in scaling, and the practical applications of this toolchain in preserving valuable web content.
br>PANELS
PANEL 01: ENGAGING AUDIENCES
Beyond Preservation: Engaging Audiences and Researchers with Web Archives
Eveline Vlassenroot1, Peter Mechant1, Friedel Geeraert2, Christina Vandendyck2
1: University of Ghent, Belgium; 2: Royal Library of Belgium (KBR), BelgiumAs the digital landscape continues to evolve, web archives have emerged as essential tools for preserving our collective digital heritage, as they have an unique historical value and help preserve the digital cultural memory of society.
However, the challenge facing web archives today extends far beyond preserving information. The true value of these archives lies not only in the wealth of information they safeguard but in their ability to make this information accessible, meaningful, and usable to a wide range of audiences. Web archives hold the potential to be vibrant, interactive resources that engage not just researchers but also educators, students, journalists, and the broader public. This panel seeks to address the critical question: How can web archives move beyond preservation to engage and inspire diverse user groups actively?
One of the key focuses will be on the development of advanced (search) interfaces and (AI-driven) discovery tools that can revolutionize how users interact with archived content. For example, conventional search methods frequently struggle to effectively navigate datasets found in web archives. Discovery tools offer the potential to enhance searchability, improve data visualization, and tailor the user experience to meet individual needs. However, the implementation of (for example AI-driven) technologies also raises questions about biases, ethics, and the risk of creating new barriers to access, which the panel will critically examine.
Participatory archiving practices will also be a point of discussion. The concept of involving users directly in the archiving process—whether through community-driven content curation, crowdsourced metadata enhancement, or collaborative research initiatives—challenges traditional notions of archival practices. This participatory approach not only democratizes the archiving process but also enriches the archive’s content, making it more reflective of diverse societal voices and experiences. By turning users into active contributors rather than passive consumers, web archives can foster a deeper connection with their audiences, encouraging ongoing engagement and investment in the archive’s future.
The discussions during the panel will highlight successful case studies and collaborative projects that have bridged the gap between archives and their users, showcasing how archives can support new research methodologies and public engagement.
By fostering a dialogue on these topics, we aim to inspire web archivists and researchers to rethink how they can make web archives more relevant and impactful for diverse audiences. The set-up of the workshop goes well with the subtheme ‘Strategies for engagement of internal and external stakeholders’ in the Call for papers.
The panel discussion will be centered around specific statements that the panelists can respond to and provide their insights. To engage the public and gather their feedback, we will also incorporate an interactive format that allows for audience participation.
Below are some examples of the questions and statements that will be used:
- Users of web archives should be active participants in the shaping of web archives, not just consumers.
- A user-centric web archive must enhance accessibility and engagement.
- Web archives are stuck in an academic bubble: to break free they must engage with the broader public.
- New methodologies (in digital humanities) must be developed to manage and use this (overwhelming) flood of data in web archives. // Web archives as new research opportunities or an overwhelming flood of data?
- Without innovative interfaces web archives will remain inaccessible to most.
- Web archives need to tailor their services to the archival literacy of their public. E.g., web archives should provide different search and discovery interfaces tailored to the archival literacy of their users.
- Web archives lack sufficient funds and/or expertise to thoroughly engage with their users.
- Doing active outreach and user engagement is the best way to avoid biases in our web archival collections.
- AI can revolutionize how users interact with web archive content.
- AI will help us curate more inclusive/representative collections in the future.
br>PANEL 02: CROSS-INSTITUTIONAL COLLABORATIONS
Past, Present & Future of Cross-Institutional Collaboration in Web Archiving: Insights from the Norwegian and Danish Web Archive, the NetArchiveSuite Community, & Beyond
Anders Klindt Myrvoll1, Thomas Langvann2, Sara Aubry3, José Carlos Cerdán Medina4, Niels Ørbæk Chemnitz5, Abbie Grotke6
1: Royal Danish Library, Denmark; 2: National Library of Norway, Norway; 3: Bibliothèque nationale de France, France; 4: Biblioteca Nacional de España, Spain; 5: Analysis & Numbers, Denmark; 6: Library of Congress, United States of AmericaThe panelists will embark on a journey looking at the formation and history of the Danish and Norwegian Web Archives, the NetArchiveSuite cross-institutional collaboration, their present state, and which lessons have been learned from the first 20 years of web archiving. We’ll examine the future of web archiving at these two legal deposit, national library institutions, the NAS-community and beyond.
By “Beyond” we mean being inspired by and collaborating with other web archives and institutions in the Cultural Heritage-sector, with companies on the cutting edge of web crawling, as well as other organizations like panelists Analysis & Numbers who specialize in quantifying complexity: including the mapping of social phenomena and tendencies, measuring attitudes and behaviour, and quantifying digital and social movements. Their work provides insight into complex topics such as volunteering, digital spaces, democracy and participation, and patterns in the spread of hate speech, mis- and disinformation online. Through their work, they identify patterns and communicate their findings with precision and thoroughness.
This knowledge is also relevant for memory-institutions like web archives. It supports the gain of innovative knowledge of internet trends, including social media, as well as providing valuable input for relevant content to crawl in a given society - even touching on the elusive goal of representativeness. Insights gained can also help provide the best services in collaborating with researchers in scholarly dialogue, to help make sense of web archives with their vast amounts of data and complexity, while being an overall inspiration towards best practices. There can also be value in sharing knowledge in order to get better content, as well as potential for collaboration on long-term preservation of data at risk of loss at companies, organizations, and research institutions including universities.
Our aim is to give a valuable account of web archiving history, of newer, bold approaches, and expectations and predictions for the future, focusing on lessons learned, best practices, and how we in the web archive community can help one another at large in the coming years.
Examples of the questions that will be discussed by the panelists:
- What are the best practices at your organization?
- How have best practices changed over time?
- What would you do different if you could go back in time to the start of you initiative or earlier?
- What is the next evolution for web archiving?
- How do you see the future of browser-based-crawling?
- What impact does sustainability have on web archiving?
- Is API-based crawling real web archiving and does it matter?
- Will Social Media open their walled gardens in the future for web archives/others?
- Can anything be done to get Social Media companies involved in Cultural Heritage on our community’s terms?
- How can we expand the scope of web archiving to include additional countries?
- How can legal deposit institutions cooperate with other institutions like universities, companies, non-profit organizations, etc. to have data preserved for future generations and to build best practices for these kinds of collaborations?
br>PANEL 03:CROSS-INSTITUTIONAL COLLABORATION: THE END OF TERM ARCHIVE
Coordinating, Capturing, and Curating the 2024 United States End of Term Web Archive
Mark Phillips1, Sawood Alam2, James Jacobs3, Abbie Grotke4
1: University of North Texas, United States of America; 2: Internet Archive, United States of America; 3: Stanford University, United States of America; 4: Library of Congress, United States of AmericaThe End of Term (EOT) Web Archive is composed of member institutions from across the United States who have come together every four years since 2008 to complete a large-scale crawl of the .gov and .mil domains to document the transition in the Executive Branch of the Federal Government in the United States. In years when a presidential transition does not occur, this effort serves as a systematic crawl of the .gov domain in what has become a longitudinal dataset of crawls. Since 2008, over 500 TB of data has been collected in total from the harvests in 2008, 2012, 2016, and 2020. Access to these collections is provided via the global Wayback Machine at the Internet Archive. In 2022 the EOT team worked to make these datasets available via the Amazon Web Services' Open Data Sponsorship Program to provide greater computational access to these archived web resources.
Now in its fifth iteration with the 2024 election, the team involved with the EOT Archive has learned much over the years that guides and shapes the activities carried out during fall 2024 and winter 2025. The project has evolved from an intense set of unknown activities in 2008 to a standard set of activities with a known timeline and the ability to leverage existing tools and workflows that have been proven in previous efforts.
This panel will present the efforts to preserve the United States Federal web as part of the presidential transition after the election of 2024. It will provide an overview of the process, details on the organization of the participating institutions, overview of scoping and nomination activities and crawling activities, and an overview of the dissemination activities of the project. This moderated panel will highlight the importance of these activities to the various members' home institutions as well as the overall value of this kind of activity to the United States. The panelists will focus both on the specifics of the 2024 EOT project but also discuss lessons learned through this process that can be used by other organizations tasked with operating a large, multi-institutional web archiving project.
This panel is organized with a moderator and individuals from organizations who participated in the 2024 End of Term planning and operation. They will represent both the project's activities and also highlight the importance of this effort in their local institutions whenever possible. The panel will be organized into the sections of Organization, Scope and Nomination, Crawling, and Dissemination and Access. These will be introduced briefly to provide an overview to the audience and then a panel discussion will allow for more in depth discussion to happen across the various subject areas.
br>WORKSHOPS
WORKSHOP 01: EXPLORING DILEMMAS IN THE ARCHIVING OF LEGACY WEBPORTALS: AN EXERCISE IN REFLECTIVE QUESTIONING
Exploring Dilemmas in the Archiving of Legacy Webportals: An Exercise in Reflective Questioning
Daniel Steinmeier, Sophie Ham
National Library of the Netherlands, NetherlandsSince 2023 the National Library of the Netherlands (KBNL) is proud to curate a digital collection that has become UNESCO world heritage: the Digital City (De Digitale Stad, henceforth: DDS). Material belonging to this collection consists of an original freeze from 1996, as well as two student projects and miscellaneous material that was contributed by users and founders over the course of multiple events. The two student projects were the first attempt to revive the portal of DDS and store it as a disk image. The two groups of students used two methods for this reviving: one based on emulation, the other based on migration. But what choices were made during restoration and which version is more authentic? Furthermore, KBNL has several websites, scientific articles and newspaper clippings in its collections that might serve as context information. Do we consider this context information crucial for understanding DDS or do we rather leave users to find these resources by themselves if they are interested?
As can be seen from this description, there is a lot of complexity when we consider archiving DDS and making it accessible to our users. We can think of a lot of difficult dilemmas when making decisions on what to archive and how to present it. Do we want users to experience how it is to create a homepage in DDS or do we want to present a historically correct picture of the homepages existing at the time? What should be considered part of the object and what part of the context? Is the migrated or the emulated version more authentic? What is more important, the privacy of the original users or providing full access to researchers? What do we consider belonging to DDS and what not? Only the HTML? Or also any news group material that might still be online but isn’t part of the archival material? Do users want a real authentic experience or rather a convenient way of viewing the content?
Even though DDS was a Dutch portal, it was based on software of the American Free-nets and inspired other cities in Europe and Asia. Therefore, we think this case might have a lot of recognizable features that also apply to the archiving of other legacy portals. Arguably, there are no right or wrong answers. They are typically dilemmas where multiple options have both benefits and drawbacks.
In our workshop we want to present a couple of these real-world dilemmas to participants to stimulate discussion based on the idea of opposing values. In webarchiving and webarcheology tough decisions have to be made sometimes. In the above description we can already perceive some opposing options, for instance whether to prioritize interactivity or historical accuracy. Another example would be the opposition between privacy and openness. How do we weigh these options in practice? What values are important to us and how do they interact? Through principles of reflective questioning and open dialogue we will try to create awareness about the idea of value prioritization as part of the decision-making process.
The idea is that we present a number of dilemmas, based on our collection material, for participants to discuss in groups. Participants may also choose an example that illustrates the same dilemma from their own collection. Each group has to choose a preferred solution and present their reasoning to the group. People are encouraged to explore the reasons for choosing one or the other, for instance by reflecting on their own organizational context or personal assumptions regarding digital preservation. We try to stay away from providing clear cut answers or guidance but rather provide participants with the opportunity to explore these questions together. Participants will learn how to ask the right questions to delve deeper into their own reasoning process during decision making, based on our method of reflective questioning. Participants should be able to apply this method and the cases presented to benefit their own curatorial decision-making process regarding legacy webportals in their own collections. For KBNL, the group discussions may provide important community input and food for thought on some of the decisions we are going to be making regarding DDS in the near future.
br>WORKSHOP 02: WEB ARCHIVE COLLECTIONS AS DATA
Web Archive Collections as Data
Gustavo Candela1, Chase Dooley2, Abbie Grotke2, Olga Holownia3, Jon Carlstedt Tønnessen4
1: University of Alicante, Spain; 2: Library of Congress, United States of America; 3: IIPC, United States of America; 4: National Library of Norway, NorwayGLAM (Galleries, Libraries, Archives and Museums) have started to make available their digital collections suitable for computational use following the Collections as Data principles.1 The International GLAM Labs Community2 has explored innovative and creative ways to publish and reuse the content provided by cultural heritage institutions. As part of their work, and as a collaborative-led effort, a checklist3 was defined and focused on the publication of collections as data. The checklist provides a set of steps that can be used for creating and evaluating digital collections suitable for computational use. While web archiving institutions and initiatives have been providing access to their collections - ranging from sharing seedlists to derivatives to “cleaned” WARC files - there is currently no standardised checklist to prepare those collections for researchers.
This workshop aims to involve web archiving practitioners and researchers in reevaluating whether the GLAM Labs checklist can be adapted for web archive collections. The first part of the workshop will introduce the GLAM checklist, followed by two use cases that show how the web archiving teams have been working with their institutions’ Labs to prepare large data packages and corpora for researchers. In the second part of the workshop, we want to involve the audience in identifying the main challenges to implementing the GLAM checklist and determining which steps require modifications so that it can be used successfully for web archive collections.
First use case
The Library of Congress has been working to refine and improve workflows that enable creation and publishing of web archive data packages for computational research use. With a recently hired Senior Digital Collections Data Librarian, and working with our institution’s Labs, web archiving staff have prepared new data packages for web archive data in response to recent research requests. We will provide some background into this work and developments that led to the creation of the data librarian role, and will share details about how we are creating our data packages and sharing derivative datasets with researchers. Using a recent data package release, we will compare local practices in providing data to researchers with the GLAM checklist and talk through ways in which our institution does or does not comply.Second use case
The National Library of Norway recently launched its first Web News Corpus, making more than 1.5 million texts from 268 news websites available for computational analysis through API. The aim is to facilitate text analysis at scale.4 This presentation will provide a brief description of “warc2corpus”, our workflow for turning WARCs into text corpora, aiming to satisfy the FAIR principles, while also taking immaterial rights into account.5In this presentation, we will showcase how users can:
- Tailor research corpora based on keywords and various metadata
- Visualise general insights
- Exercise different types of ‘distant reading’, both with the Library Labs package for Python and with user-friendly web applications6
br>1 Padilla, T. (2017). “On a Collections as Data Imperative”. UC Santa Barbara. pp. 1–8
2 https://glamlabs.io/
3 Candela, G. et al. (2023), "A checklist to publish collections as data in GLAM institutions", Global Knowledge, Memory and Communication. https://doi.org/10.1108/GKMC-06-2023-0195
4 Tønnessen, J. (2024). “Web News Corpus”. National Library of Norway. https://www.nb.no/en/collection/web-archive/research/web-news-corpus/
5 Tønnessen J., Birkenes M., Bremnes T. (2024). “corpus-build”. GitHub. National Library of Norway. https://github.com/nlnwa/corpus-build; Birkenes M., Johnsen, L., Kåsen, A. (2023). “NB DH-LAB: a corpus infrastructure for social sciences and humanities computing.” CLARIN Annual Conference Proceedings.
6 “dhlab documentation”. National Library of Norway. https://dhlab.readthedocs.io/en/latest/Datasheets for Web Archives Toolkit
Emily Maemura1, Helena Byrne2
1: University of Illinois Urbana-Champaign, United States of America; 2: British Library, United KingdomSignificant work in web archives scholarship has focused on addressing the description and provenance of collections and their data. One approach to the challenge of describing large datasets comes from the field of machine learning where Gebru et al. (2018, 2021) propose developing “Datasheets for Datasets,” a form of short document answering a standard set of questions arranged by stages of the data lifecycle.
This presentation reports back on the findings of a collaborative project to apply the Datasheets for Datasets to web archive collections. It reviews the methodologies used to create the toolkit and lessons learnt from its implementation. The Datasheets for Web Archives Toolkit was published in the British Library Research Repository. The toolkit provides information on the creation of datasheets for web archives datasets. It is composed of several parts including templates, examples, and guidance documents. Implementation of the toolkit for published datasets from the UK Web Archive also informs the practicalities and resources required for implementation, as well as scalability of these practices. The toolkit has documentation examples based on three collections:
- Indian Ocean Tsunami December 2004 - Collection Seed List (CSV/JSON)
- Blogs - Collection Seed List (CSV/JSON)
- UEFA Women's Euro England 2022 - Collection Seed List (CSV/JSON)
br>We conclude by reflecting on how datasheets contrast with more traditional documentation formats and metadata, as well as what documentation and shifts in data practices are needed in addition to the datasheet itself. We propose an agenda for future work exploring how the toolkit and datasheets framework can be extended to other kinds of datasets.
br>WORKSHOP 03: INTRODUCTION TO WEB GRAPHS
Introduction to Web Graphs
Sebastian Nagel, Pedro Ortiz Suarez, Thom Vaughan, Greg Lindahl
Common Crawl Foundation, United States of AmericaThe workshop will begin with a brief introduction to the concept of the webgraph or hyperlink graph - a directed graph whose nodes correspond to web pages and whose edges correspond to hyperlinks from one web page to another. We will also look at aggregations of the page-level webgraph at the level of Internet hosts or pay-level domains. The host-level and domain-level graphs are at least an order of magnitude smaller than the original page-level graph, which makes them easier to study.
To represent and process webgraphs, we utilize the WebGraph framework, which was developed at the Laboratory of Web Algorithms (LAW) of the University of Milano. As a "framework for graph compression aimed at studying web graphs," it allows very large webgraphs to be stored and accessed efficiently. Even on a laptop computer, it's possible to store and explore a graph with 100 million nodes and more than 1 billion edges. The WebGraph framework is also used to compress other types of graphs, such as social network graphs or software dependency graphs. In addition, the framework and related software projects include tools for the analysis of web graphs and the computation of their statistical and topological properties. The WebGraph framework implements a number of graph algorithms, including PageRank and other centrality measures. It is an open-source Java project, but a re-implementation in the Rust language has recently been released. Over the past two decades, the WebGraph format has been widely used by researchers, for example those at LAW or Web Data Commons, to distribute graph dumps. It has also been used by open data initiatives, including the Common Crawl Foundation and the Software Heritage project.
The workshop focuses on interactive exploration of one of the precompiled and publicly available webgraphs. We look at graph properties and metrics, learn how to map node identifiers (just numbers) and node labels (URLs), and compute the shortest path between two nodes. We also show how to detect "cliques", i.e. densely connected subgraphs, or how to run PageRank and related centrality algorithms to rank the nodes of our graph. We share our experiments on how these applications are used for collection curation: how cliques can be used to discover sites with content in a regional language, how link spam is detected or how global domain ranks are used to select a representative sample of websites. Finally, we will build a small webgraph from scratch using crawl data.
Participants will learn how to explore webgraphs (even large ones) in an interactive way and learn how graphs can be used to curate collections. Basic programming skills and basic knowledge of the Java programming language are a plus but not required. Since this is an interactive workshop, attendees should bring their own laptops, preferably with the Java 11 (or higher) JDK and Maven installed. Nevertheless, it will be possible to follow the steps and explanations without having to type them into a laptop. We will provide download and installation instructions, as well as all teaching materials, prior to the workshop.
br>WORKSHOP 04: HOW TO DEVELOP A NEW BROWSERTRIX BEHAVIOR
How to Develop a New Browsertrix Behavior
Ilya Kreymer, Tessa Walsh
Webrecorder, United States of AmericaBehaviors are a key part of Browsertrix and Browsertrix Crawler, as they make it possible to automatically have the crawler browsers take certain actions on web pages to help capture important content. This tutorial will walk attendees through the process of creating a new behavior and using it with Browsertrix Crawler.
Browsertrix Crawler includes a suite of standard behaviors, including auto-scrolling pages, auto-playing videos, and capturing posts and comments on particular social media sites. By default, all of the standard set of behaviors are enabled for each crawl. Users have the ability to instead disable behaviors entirely or select only a subset of the standard set of behaviors to use on a crawl.
At times, users may need additional custom behaviors to navigate and interact with a site in specific ways automatically during crawling if they want the resulting web archive and replay to reflect the full experience of the live site. For instance, a new behavior could click on interactive buttons in a particular order, “drive” interactive components on a page, or open up posts sequentially on a new social media site and load comments.
This tutorial will walk through the process of creating a new behavior step by step, using the existing written tutorial for creating new behaviors on GitHub as a model. In addition to demonstrating how to write a behavior’s code (using JavaScript), the tutorial will also discuss how to know when a behavior is the appropriate solution for a given crawling problem, how to test behaviors during development, how to use custom behaviors with Browsertrix Crawler running locally in Docker, and finally how to use custom behaviors from the Browsertrix web interface (a feature that is currently planned and will be completed by the conference date).
Participants will not be expected to write any code or follow along on their own laptops in real time during the tutorial. The purpose is instead to demonstrate how one would approach developing a new behavior, lower the barrier to entry for developers and practitioners who may be interested in doing so, and to give attendees the opportunity to ask questions of Webrecorder developers in real time. We would additionally love to foster a conversation about how to develop a community library of available behaviors moving forward to make it easier than ever for users to find and use behaviors that meet their needs.
The tutorial will be led by Ilya Kreymer and Tessa Walsh, developers at Webrecorder with intimate knowledge of the Browsertrix ecosystem. The target audience is technically-minded web archiving practitioners and developers - in other words, people who could either themselves write new custom behaviors or communicate the salient points to developers at their institutions. Because this is not a hackathon-style workshop, the tutorial could have as many participants as the venue allows. By the conclusion of the tutorial, attendees should understand the concept of how Browsertrix Behaviors work, when developing a new behavior is a good solution to their problems, the steps involved in developing and testing a new behavior, and where to find additional resources to help them along the way. Our hope is to foster a decentralized community of practice around behaviors to the entire IIPC community’s benefit.
br>LIGHTNING TALKS
LIGHTNING TALK SESSION 01
STRATEGIES AND CHALLENGES IN THE PRESERVATION OF MEXICO’S WEB HERITAGE: FIRST STEPS
Carolina Silva Bretón
National Library of Mexico, MexicoThis paper discusses the challenges faced during the first stage of selection and awareness raising of what will be the Mexican Web Archive, through the National Library of Mexico.
We are working with strategies and policies of selection, in which one of the objectives is to combine the isolated projects that have been worked in relation to the Web Archiving in Mexico, as well as collecting the web pages created in Mexico and by Mexicans, Considering the principle of inclusion of materials that contain historical, scientific or artistic value, which represent different perspectives and approaches to a topic with the intention of enriching their understanding, which are considered for their diversity. Also included are those that due to their temporality and update are ephemeral. These groups include government, academic, museum, library and other sites with valuable information. My role as Web Designer and digital conservationist of the National Library, in conjunction with the Coordinator of the National Library, Dra. Martha Romero, we have determined NOT to leave the award to collect websites with domain. mx, since there are sites developed by Mexicans, who choose other domains for being more economical as the .com and the .net; will therefore be considered as long as they prove to bring cultural, scientific or testimonial value to the nation.
Through the multidisciplinary and inter-institutional research group: "Preservation of websites" of the Digital Preservation Group (GPD in Spanish), organized by the National Library, We analysed different selection criteria and held a workshop to learn about other criteria that were relevant for web developers and the general public. Likewise, in this group we also worked with a tool to generate web preservation policies for companies and institutions and join this good practice.
This presentation will describe the decision-making process, ethical problems, strategies developed in this stage of selection and awareness of the subject because it is little known in Mexico. These activities and the experience gained in this project will serve as a background for Web Archives of other national libraries, especially those of Latin America.
CHALLENGES AND STRATEGIES IN IMPLEMENTING WEB ARCHIVING LEGISLATION IN BRAZIL
Jonas Ferrigolo Melo1, Moisés Rockembach2
1: University of Porto, Portugal; 2: University of Coimbra, PortugalIn the context of digital societies (Gomes, 2022) and the shift from paper to electronic records (Silva, 2022; Wiehl, 2020), web archiving and digital preservation are vital for maintaining information accessibility in the digital age. This affects many nations that use the internet intensively, such as Brazil, where 84% of households are now online, necessitating effective methods to preserve this growing digital heritage (CGI.br, 2023; Globalad, 2023). In 2023, the Brazilian National Archives Council (CONARQ) issued Resolution No. 52, establishing the Policy for the Preservation of Websites and Social Media, and Resolution No. 53, outlining the minimum preservation requirements. These resolutions serve as normative guidelines for public-sector practices and as references for private-sector professionals (Conarq, 2023a, 2023b; Terrada, 2022).
The introduction of a website preservation policy in Brazil opens the possibility of creating a culture of systematic preservation of these sources of information and social memory, and through this, “...achieving a sense of community, national identity, and rootedness among Brazilian citizens, in the sense that it will preserve information that, in a way, shapes national identity” (Melo, 2020, p. 21). However, the publication of legislation will not be sufficient if not accompanied by research into practical solutions for the effective implementation of web archives (Melo et al., 2023), as well as debates on theoretical issues, such as the selection process (Melo & Rockembach, 2024). Thus, this communication aims to explore the challenges and opportunities involved in implementing CONARQ Resolutions 52 and 53 in web archiving and digital preservation.
Implementing these resolutions faces complex challenges and significant opportunities to protect Brazil's digital heritage. Some strategies to address these challenges and promote the effective implementation of these resolutions include strengthening technical and professional capacity through continuous training and team updates and developing adequate technological infrastructure by investing in technology and using open standards. As an emerging area in Brazil, it is also possible to identify that creating partnerships and inter-institutional collaboration can be a pathway to the effective implementation of the resolutions.
Beyond technical and operational capacities, managers must be aware of the need to preserve this informational content from websites. It enables the development of a national web archiving strategy that unites efforts across various institutions and government levels. This strategy should feature clear guidelines, well-defined responsibilities, and sustainable funding mechanisms. Prioritizing preservation projects that ensure the integrity and accessibility of digital records over time, with particular attention to continuous technological updates and the mitigation of technological obsolescence, can guide these practices.
Therefore, implementing CONARQ Resolutions represents a crucial step for preserving Brazil's digital heritage. However, its success will depend on an integrated approach that combines technical training, robust infrastructure, and institutional awareness. Only through collaborative efforts, supported by a clear national strategy and the continuous commitment of public managers, will it be possible to meet the challenges and fully seize the opportunities these resolutions offer, thereby ensuring the preservation and long-term access to digital information essential to Brazil's memory and identity.
ARQUIVO.PT TOOLKIT FOR WEB ARCHIVING
Daniel Gomes
Arquivo.pt, PortugalThe Web is the largest and most widely used source of information. Despite its digital objects becoming immediately accessible to millions of people as soon as they are published, most of them are solely hosted at their original source and are at-risk of being irremediably lost. Therefore, ready-to-use tools and services to safeguard digital objects published online are required to safeguard this invaluable digital legacy for future generations.
Arquivo.pt is a public digital preservation infrastructure that enables anyone to store, search and access historical digital objects preserved from the Web since the 1990s. It contains over 20 billion digital objects (1.3 PB) in multiple formats and languages, acquired from websites from all over the world. About half of Arquivo.pt users come from outside of Portugal.
The main objective of the Arquivo.pt toolkit for web archiving is to support the preservation of the born-digital information published online that rules modern societies by providing services freely accessible to a broad scope of users so that any Internet user can contribute to the digital preservation lifecycle of objects published online. As different users may have different needs regarding digital preservation, providing a comprehensive toolkit potentiates the fulfillment of most requirements.
The Arquivo.pt toolkit for web archiving was officially launched in October 2023 after 15 years of iterative development. It is composed of 13 running tools/services listed at https://arquivo.pt/catalog to support the preservation of online digital objects from their acquisition to dissemination:
- Search and access (arquivo.pt)
- Application programming interfaces (arquivo.pt/api)
- Suggest websites (arquivo.pt/suggest)
- SavePageNow (arquivo.pt/savepagenow)
- Integration of historical web data collections (arquivo.pt/donate)
- Training (arquivo.pt/training)
- Open data (arquivo.pt/dadosabertos)
- CitationSaver (arquivo.pt/citationsaver)
- Arquivo404 (arquivo.pt/arquivo404)
- Memorial (arquivo.pt/memorial)
- High-quality archive (on-demand)
- Creation of collections and thematic exhibitions (arquivo.pt/expos)
- Itinerant exhibition of posters at external institutions (arquivo.pt/posters)
br>The Arquivo.pt toolkit for web archiving is an innovative and comprehensive set of services to safeguard digital legacy published online for future generations available to anyone. This presentation will provide a glimpse of these tools and their usage.
TRACKING THE POLITICAL REPRESENTATIONS OF LIFE: METHODOLOGICAL CHALLENGES OF EXPLORING THE BNF WEB ARCHIVES
Guillaume Levrier1,2, Dorothée Benhamou-Suesser2
1: Centre de recherches politiques de Sciences Po (CEVIPOF, CNRS), France; 2: Bibliothèque nationale de France, FranceThe aim of this presentation is to give an account of how a researcher designed and implemented an open-source software solution to explore full-text indexed web archives. By sharing lessons learned from the points of view of both the practice of research and the practice of web archiving, we aim to foster result reproducibility in the IIPC community. We also hope that this opportunity to detail the chosen approach will make it a strategy for others to build on, adopt, or emulate.
How did people make biology a political matter 25 years ago? The time of the early internet was also the time of new biotechnological innovations, such as the sequencing of the human genome or the cloning of larger mammals. This new means of communication enabled citizens to peer into both scientific and democratic processes like never before. Information transited vertically, as both scientific articles and the minutes of the parliamentary debates were published and accessible online. But knowledge was also shared horizontally, with netizens able to debate these issues using various systems, be it through blogs, forums, or online newspapers comment sections.
Exploring what is left of that political past is still a daunting challenge. The French electoral web archives offer but a partial register of data captured with the available means of the time, which are often both too exhaustive and too shallow for the political scientist to manage. While much has been written on the theoretical problems of web archive historiography, the down-to-earth technical problem of making this matter technically and cognitively accessible to the average empirical social scientist remains open.
In this presentation, we introduce a new comprehensive pipeline to support empirical studies on web archives using specific features of the open-source software PANDORÆ, Zotero, and two ad-hoc open-source javascript notebooks. PANDORÆ allows users to perform precise queries on the Solr index of designated full-text indexed collections and offers several interfaces to explore, comment, and analyze the resulting corpuses. It also addresses the need to request different digital resources with a single software, including scientometric databases such as Scopus or Web of Science, parliamentary debates, tweets, and other data sources.
This ongoing collaboration effort between a political scientist and the BnF web archiving team has been very valuable for all parties involved. The process ranged from dealing with legal issues to more technical ones, such as inserting the software in the BnF DataLab information system to make it part of the available research tools and services.
This presentation will be an opportunity to both present this new research system and outline the two-year process that enabled it to emerge. We will then offer examples of what this research workflow is currently producing. We will conclude by opening the discussion on how it could serve the interests of our IIPC partners, including by being plugged to the IIPC collaborative collections.
COLLABORATIVE CURATORIAL APPROACHES OF THE CZECH WEB ARCHIVE USING THE EXAMPLE OF THEMATIC LITERARY COLLECTIONS
Marie Haškovcová
National Library of the Czech Republic, Czech RepublicThematic collections are one of the three acquisition lines of the collection policy of the Czech web archive of the National Library of the Czech Republic - Webarchiv. Webarchiv considers them as datasets for further research and occasionally involves other institutions, experts or colleagues from the web archives in their creation. Specifically, collections dedicated to Milan Kundera or Franz Kafka have been contributed to by colleagues from national library web archives in France, Germany, Austria and Israel. Mentions of the topic on social media are also included in the collections, and Webarchiv is now experimenting with automated acquisition procedures. It thus creates a thematic dataset combining two curatorial approaches - sources that are carefully selected and described by the curators, with a large volume of sources that were scraped automatically according to set parameters documented in metadata (keywords, etc.) and contain unverified information.
Users are provided with a description of how the collection was built and what data it contains (which resources were suggested as part of the curatorial selection and which were obtained automatically and based on what parameters). Users can also suggest additional resources to add to the collection.
Webarchiv has been cooperating with the Institute for Czech Literature of the Academy of Sciences of the Czech Republic for several years and participated in the research project Czech Literary Internet. One of the outputs is an extensive bibliographic database of Czech literature on the Internet including links to the Webarchiv. The database can now be also used as another valuable resource for the creation of literary datasets and their research.
br>LIGHTNING TALK SESSION 02
MODELLING ARCHIVED WEB OBJECTS AS SEMANTIC ENTITIES TO MANAGE CONTEXTUAL AND VERSIONING ISSUES
Tom Storrar1, Manuela Pallotto Strickland2
1: The National Archives (UK), United Kingdom; 2: King's College London, United KingdomStandard digital preservation reference models such as OAIS and LOCKSS conceive preservation of born-digital artefacts either as the creation of substitute representations of original 'Information Objects' which have undergone a series of transformations (OAIS), or as the provision of multiple, identical copies of an original bitstream (i.e., 'Data Replicas' in LOCKSS). Within these reference frameworks, data objects to be preserved are regarded as ‘self-contained’ and (insofar as they are endowed with contextual information) ‘self-describing’ Information Objects.
Such a general conceptual framework underpins most current web archiving practices and leads to modelling archived representation of websites/webpages as static replicas of originally dynamic information objects which can be replayed in a web browser. A technological corollary of these practices is the use of standard formats such as WARC, implemented to capture/archive dynamic web content and replay it in a browser-like execution environment.
To manage the capturing of the changes that affect live websites/webpages and thus resolve the problem of representing a dynamic digital object as a static and self-contained data structure ready to be preserved, web archiving practices lead to the creation of a dataset made of static snapshots of the original web data object that needs archiving. Although each captured snapshot can be considered as a static self-contained data object in itself which provides a representation of ‘a’ timestamped version of the archived web data object (website or webpage), the dataset that represents the live dynamic archived web object (website or webpage) as a unity and a whole remains bound to change further and further, depending on the changes that the live web object undergoes. Such a dataset should also be regarded as a dynamic data object. As each snapshot (re)captures the totality of the archived web data object, important issues of duplication rise over time, which make versioning unsustainable.
We conducted an investigation over instances of born-digital content archived in the UK Government Web Archive (UKGWA). The findings led us to discuss the need for a novel conceptual framework that, by recognizing that web data objects cannot be represented as self-contained information objects, takes web archiving practices beyond the scope of building replicas of original data objects. In such a conceptual framework, where archived websites/webpages are modelled as Semantic objects, the Resource Description Framework can be leveraged to create a model that, by enabling the definition in a graph of the position of a resource on the web, can address issues related to identity, decontextualization, redundancy of web objects, commonly manifest in current web archives’ versioning and timelining practices.
In this lightning talk, we present such a data model with the ontology layering that describes it and discuss how this model could support more sustainable versioning practices.
MODERNIZING WEB ARCHIVES: THE BUMPY ROAD TOWARDS A GENERAL ARC2WARC CONVERSION TOOL
Pedro Ortiz Suarez, Sebastian Nagel, Thom Vaughan
Common Crawl Foundation, United States of AmericaSince its introduction in 2009, the WARC (Web ARChive) file format has become a standard for storing large web data collections. With this, many tools, pipelines and software in general have been developed to transform, annotate, index, read, extract and process data encoded in WARC files. However, WARC was actually developed as an extension of the ARC file format that was first introduced by the Internet Archive in 1996 and was traditionally used to store web crawls by multiple web archiving organizations around the world. The first version of WARC was then proposed in November 2008 allowing for more flexibility allowing later-date transformations and giving the option of accommodating more metadata and secondary content.
However, the WARC file format was purposely made different from the ARC file format so that existing tooling can effectively differentiate between them. This in turn created a situation where much of the modern tooling and software is no longer compatible with ARC files, making older web archives far more difficult to process, analyze and explore.
In this talk we share our experience converting Common Crawl’s older ARC archives to WARC. We first describe how we patch the existing WARCIO streaming library so that it can convert existing Common Crawls ARC archives to WARC and then we discuss the different challenges that we have encountered causing the conversion to fail such as: invalid URIs causing SURT canonicalization to fail, white spaces in URLs breaking the ARC headers and wrong content-length in the ARC headers causing parsing errors. We also briefly compare our patch to other existing solutions such as jwarc for our particular use-case.
We hope that by discussing our experiences in this conversion, web archivists and other web archiving organizations can effectively avoid and address some of the pitfalls that we encountered.
POKING AROUND IN PODCAST PRESERVATION
Jasper Snoeren
Netherlands Institute for Sound and Vision, NetherlandsSlowly but steadily the podcast as a format gained more and more power as a dominant form of media. It is the way in which millions of people consume news, politics, entertainment and gossip on a daily basis. So since 2021 we’ve been actively working on preserving these audio stories that are created both by media professionals and hobbyists with a microphone on the kitchen table.
In this talk, we share key insights into how we’ve been archiving and preserving podcasts from the Netherlands for over four years. Why, after studying the distribution models of podcasts, we decided to ignore playback platforms like Apple Music or Spotify, but make use of a podcast RSS aggregator service instead. Using the Listennotes API, our script allows us to automatically gather podcasts in MP3 format together with any descriptive metadata that's included in the RSS feed by the podcaster creators. Simply adding new shows to a playplist enables us collect the latest episodes on a weekly basis. As we will walk you through our method, we go in-depth as to how we enrich episodes with additional metadata and make the MP3-files accessible in our archive to users. We explain our selection process using license agreements with creators and how we’re trying to get as wide of a vertical slice as possible of the Dutch podcasting landscape.
Finally we address paywall related challenges that have become more frequent and that we are struggling with.This talk provides pointers that will allow anyone to get a grasp on how to preserve podcasts and make sure these stories can be told for generations to come.
AUTOMATIC CLUSTERING OF DOMAINS BY INDUSTRY FOR EFFECTIVE CURATION
Thomas Smedebøl
Royal Danish Library, DenmarkWhen archiving 1.4 million .dk domains, we need practical tools to curate them by clusters. We have observed that, for example, hairdressers, massage therapists, and physiotherapists often have crawler traps around their booking systems. The same applies to hotels.
Takeaway restaurants often have crawl traps around their ordering systems. Car dealerships have general issues with their used car databases, and online shops tend to pose great difficulties around the sorting of the offered products. Each industry seems to have their own set of specifics we should take into account when curating the archiving of their domains.
It would be useful to analyse and manage these domains by industry. In Denmark, all companies have a CVR number that identifies them. This number must be displayed on their website. In the central business register, the company's industry is listed. By scraping all domains for the company’s CVR number, a connection can be established between domains and industries, and we can quickly generate a list of domains within museums, churches, dentists, water utilities, and all other industries in the register.
All it takes is good planning, a lot of scraping, access to the central business registrys database and a database.
By working with industries as a starting point, we can improve our insight, quickly manage large volumes, and spend focused time on special cases. We can also offer researchers a unique register of segmented domains.
BEST PRACTICE OF PRESERVING POSTS FROM SOCIAL MEDIA FEEDS
Magdalena Sjödahl
Arkiwera wcrify AB, SwedenThis talk will provide insights into the work, since 2017, of preserving posts from different social media platforms. In this time, our team have been handling changes regarding both the technical, legal and ethical aspects of helping our customers harvesting this information.
The importance of preserving information from social media platforms is more relevant than ever. For example, officials all over the world announce the latest news on X (formerly Twitter) and other platforms. Lots of people get almost all their news from different social media feeds. This however also creates an arena for propaganda, “fake news” and other harmful content that is spread and read by millions of people each day.
This 5-minute lightning talk will focus on the methods we use to capturing content on different social media platforms. We will talk about how we work with adapting our solution to the changes of the platforms – whether the changes are results of regulatory updates (such as the GDPR or the legacy of the “Cambridge Analytica Scandal” in 2018), structural or ownership changes (like when Elon Musk bought Twitter) or technical changes (such as new functionalities on the platforms).
Some of the harvesting challenges that we also will be addressing in our presentation are how to capture information that is constantly changing and affected by different events on the platforms. We will talk about how to structure and to present the collected information so that it’s preserved for the future but also is reliable for future researchers. Closely related to these challenges are of course the question about to whom the archived information should be available and who should be responsible of collecting (and therefore also storing) the information.
br>LIGHTNING TALK SESSION 03
CHANGE DETECTION OVER A LARGE NUMBER OF URLS
Hassan Feissali
UK Government Web Archive, United KingdomEffective change detection and targeted crawlling allow for capturing only what has changed, prevent duplication, and produce web archives that are more cost effective, and more ecologically sustainable.
Change detection using ‘Screaming Frog SEO Spider’, and crawling the changes using ‘Browsertrix Crawler’ is a technique that we developed at the UK Government Web Archive as part of a larger project aimed at capturing web content related to the 2024 UK General Election.
‘Screaming Frog SEO Spider' is a proprietary crawler that both user-friendly and has many useful features such as crawl comparison; while 'Browsertrix Crawler’ is an Open Source crawler that is fairly easy to configuer in comparison to other crawlers. Both these software produce very decent results, and can be set up fairly quickly in response to important events that requre constant monitoring of a large body of web pages such as a general election.
Our change detection method utilises Screaming Frog SEO Spider to perform daily scans of hundreds of web pages and detect change by comparing each scan to the previous one. This comparison highlights changes in the body of the page such as word count of a page, changes in H1, and changes in the page title.
When a page was identified as having changed, its URL was used as a seed for a shallow crawl using Browsertrix Crawler. This approach allowed us to efficiently monitor a large number of web pages and initiate crawls as needed. This method eliminated the need for crawling 800 election-related websites every day, and instead we crawled only a very small number of web pages after each scan.
In conclusion, thanks to this method, we were able to capture only websites that had changed, without producing unncessary duplicates in the web archive.
THE PRACTICE OF WEB ARCHIVING STATISTICS AND QUALITY EVALUATION BASED ON THE LOCALIZATION OF ISO/TR 14873:2013(E): A CASE STUDY OF THE NSL-WEBARCHIVE PLATFORM
Zhenxin Wu1, Jiali Zhu2,3, Jiying Hu1
1: National Science Library, Chinese Academy of Sciences, China; 2: Zhejiang Economic & Information Center, China; 3: Zhejiang Economic & Information Development Co., Ltd, ChinaISO/TR 14873:2013(E) is an international technical report developed in response to a worldwide demand for guidelines on the management and evaluation of Web archiving activities and products. The working group has adopted this document through an equivalent translation method to create a national standardization technical document applicable to China. This study first outlines the adjustments made to adapt the international standard for national implementation. It then presents the practical application of the localized standard through a case study of the Statistical and Quality Evaluation of Archived Resources on the NSL-WebArchive Platform. The NSL-WebArchive platform, developed by the National Science Library (NSL) of the Chinese Academy of Sciences, is designed to automate the collection of information from major international research institutions.
As of 2023, the platform has archived over 2,000 sites, with a total of 47 terabytes of data and more than 275 million web pages. The study employs statistical indicators to assess the development, characterization, usage, preservation, and archiving costs of the resource collection,providing a basis for further analysis and interpretation. Additionally, quality metrics are applied to evaluate the current archiving performance from four perspectives: management, collection process quality, accessibility and usage, and preservation. These evaluations aim to further assess the platform’s ability to meet management requirements. Based on the evaluation results, strategies for improving the platform's archiving performance are proposed. Regular evaluation and improvement of web archiving performance should be integrated into the workflow of archiving institutions. This study can serve as a reference for the statistical and quality evaluation of institutional web archiving efforts.
ARQUIVO.PT QUERY LOGS
Pedro Gomes, Daniel Gomes
Arquivo.pt, PortugalAnalyzing user behavior is an important research topic to understand users’ information needs and enhance the quality of search results. Thus, when a user interacts with a search engine, the system records the user’s actions in a file called the query log. Query logs from web archives are unique resources for research because they describe the real needs of web-archive users about the historical information published online over time.
Arquivo.pt is a research infrastructure that provides tools to preserve and exploit data from the web to meet the needs of scientists and ordinary citizens and our mission is to provide digital infrastructures to support the academic and scientific community. However, until now, Arquivo.pt has focused on collecting data from websites hosted under the .PT domain, which is not enough to guarantee the preservation of relevant content for the academic and scientific community.
Arquivo.pt provides a “Google-like” service that enables searching pages and images collected from the web since the 1990s. Notice that Arquivo.pt search complements live-web search engines because it enables temporal search over information that is no longer available online on its original websites.
Query log datasets are a valuable resource for researchers and practitioners aiming to improve information retrieval systems, enhance user experience, and develop more accurate algorithms in areas such as search engine optimization, natural language processing, and recommendation systems. The main objective is to provide a Query logs dataset with public access to support researchers. We will demonstrate how access to well-structured query logs has led to significant advancements in understanding user intent, optimizing search algorithms, and driving innovation in AI-based applications. This contributes to comparing users’ search behavior between live-web and web-archive search engines.
MODIFYING EPADD FOR ENTITY EXTRACTION IN NON-ENGLISH LANGUAGES
Pierre Beauguitte, Tita Enstad
National Library of Norway, NorwayThe preservation and accessibility of email archives have become increasingly important for historical research, organizational records, and personal documentation. However, many existing tools are tailored for English-language content, limiting their utility globally. In this presentation, we talk about our recent advancements in enhancing ePADD, an open-source email archiving tool, by modifying its lexical search and entity extraction pipeline to accommodate a non-English language.
We collaborated with a national library's department for private archives and their digital preservation team with a case study on a specific collection of email correspondence between a theatre director and an award-winning author. Through our case study, we demonstrate how the integration of suitable language models improves the overall usability of the email archives by facilitating precise entity recognition and keyword search. We also address the challenges encountered during this process.
Today's version of ePADD operates as a monolithic system, and as such it lacks flexibility regarding non-English languages and newly available technologies. We envision a more modular architecture that allows components to be easily swapped out, accommodating the rapid evolution of language technology. Although our current work focuses on a single language, we lay the groundwork for extending this framework to support additional languages.
We will conclude by addressing two challenges that we see as crucial towards a complete email archiving solution. First, an ideal email archiving solution would support multiple languages within a single archive. Additionally, ensuring this new, modular architecture for ePADD remains accessible and easy to use is crucial to ensure that archivists of all technological backgrounds can fully benefit from its features. Achieving this remains a necessary objective for the web archiving community's ongoing development efforts.
br>LIGHTNING TALK SESSION 04
COLLABORATIVE COLLECTIONS AT ARQUIVO.PT: FOUR YEARS OF RECORDINGS FROM THE CITY OF SINES (PORTUGAL)
In 2020, at the start of the pandemic, the Sines Municipal Archive, in a small town on the coast of Portugal, began recording web content relating to the region, in partnership with two local radio stations and Arquivo.pt. Since then, archivists have been identifying and recording content and sending a copy of the WARC files to be integrated into Arquivo.pt. This collaboration has been ongoing for four years and has tested Arquivo.pt's ability to interact with the community, providing support and training and integrating their contributions.
In this presentation we briefly describe: 1) how the local collections were carried out; 2) how the recorded content was integrated into Arquivo.pt; 3) the results obtained from the collaboration; 4) lessons learnt.
1) At the Sines Municipal Archive, recording was done manually by archivists using the Webrecorder tools. In 2020, webrecorder.io desktop app was still used. Also the Conifer service from where they exported the WARCs to local storage. Recently, in 2024, they started using ArchiveWeb.page. Four people took part in this project, two archivists and two external collaborators who scour the media and identify pages of local interest to be recorded. The periodicity of the recordings has endeavored to remain monthly. Recording web content was an extra effort by the Sines Municipal Archive. We can see from the dates of the captures that there was a personal commitment on the part of the archivists. They recorded content beyond their normal working hours. It was a time-consuming job considering that manual recording is done page by page, one at a time.
2) The WARC files were stored locally at the Sines Municipal Archive and a copy was sent to the digital curator at Arquivo.pt. The recordings were checked and placed in a type of collection called RAQ (High Quality Collection). It includes donations, recordings made by the community, recordings from the Arquivo.pt SavePageNow service and collections made on-demand. All these small recordings are integrated and can be accessed at Arquivo.pt.
3) As a result of this collaboration, a public training session was held in 2022. The project was mentioned at events with archivists as an exemplary case of innovative practice. The digital curator provides training support. 1800 WARC files were uploaded between 2020 and 2024, 340 Gigabytes of information, and made available at Arquivo.pt.
4) From this collaboration with the Sines Municipal Archive, we learnt that collaboration with the community is very important. Content input from the community improves the collections and the coverage of local or thematic content. In return, web archives can prove useful to the community. Printing a web page in PDF format can be useful for a traditional archive, but it's much better to rely on a web archive to reproduce a WARC file. All of this takes extra effort on both sides and needs time to consolidate. Soon, archivists will know how to use tools like browsertrix-crawler for more efficient collections. However, the main question is: Are web archives prepared to integrate community contributions?
PARTICIPATORY WEB ARCHIVING: THE TENSIONS BETWEEN THE INSTRUMENTAL BENEFITS AND DEMOCRATIC VALUE
In 2020, at the start of the pandemic, the Sines Municipal Archive, in a small town on the coast of Portugal, began recording web content relating to the region, in partnership with two local radio stations and Arquivo.pt. Since then, archivists have been identifying and recording content and sending a copy of the WARC files to be integrated into Arquivo.pt. This collaboration has been ongoing for four years and has tested Arquivo.pt's ability to interact with the community, providing support and training and integrating their contributions.
In this presentation we briefly describe: 1) how the local collections were carried out; 2) how the recorded content was integrated into Arquivo.pt; 3) the results obtained from the collaboration; 4) lessons learnt.
1) At the Sines Municipal Archive, recording was done manually by archivists using the Webrecorder tools. In 2020, webrecorder.io desktop app was still used. Also the Conifer service from where they exported the WARCs to local storage. Recently, in 2024, they started using ArchiveWeb.page. Four people took part in this project, two archivists and two external collaborators who scour the media and identify pages of local interest to be recorded. The periodicity of the recordings has endeavored to remain monthly. Recording web content was an extra effort by the Sines Municipal Archive. We can see from the dates of the captures that there was a personal commitment on the part of the archivists. They recorded content beyond their normal working hours. It was a time-consuming job, considering that manual recording is done page by page, one at a time.
2) The WARC files were stored locally at the Sines Municipal Archive and a copy was sent to the digital curator at Arquivo.pt. The recordings were checked and placed in a type of collection called RAQ (High Quality Collection). It includes donations, recordings made by the community, recordings from the Arquivo.pt SavePageNow service and collections made on-demand. All these small recordings are integrated and can be accessed at Arquivo.pt.
3) As a result of this collaboration, a public training session was held in 2022. The project was mentioned at events with archivists as an exemplary case of innovative practice. The digital curator provides training support. 1800 WARC files were uploaded between 2020 and 2024, 340 Gigabytes of information, and made available at Arquivo.pt.
4) From this collaboration with the Sines Municipal Archive, we learnt that collaboration with the community is very important. Content input from the community improves the collections and the coverage of local or thematic content. In return, web archives can prove useful to the community. Printing a web page in PDF format can be useful for a traditional archive, but it's much better to rely on a web archive to reproduce a WARC file. All of this takes extra effort on both sides and needs time to consolidate. Soon, archivists will know how to use tools like browsertrix-crawler for more efficient collections. However, the main question is: Are web archives prepared to integrate community contributions?
A MINIMAL COMPUTING APPROACH FOR WEB ARCHIVE RESEARCH
Alan Colin-Arce1, Rosario Rogel-Salazar2
1: University of Victoria, Canada; 2: Universidad Autónoma del Estado de México, MexicoWeb archives offer huge amounts of data for research in several disciplines, but they have high technical barriers of entry that prevent their use in research. However, some research questions require smaller datasets from a subset of a web archive rather than the full-text of thousands of archived websites. In these cases, a minimal computing approach could be beneficial to reduce the technical and infrastructural barriers to research with web archives.
Minimal computing challenges the idea that innovation is defined by newness, scale, or scope. Instead, it connotes digital humanities work undertaken in the context of constraints in access to hardware, software, network capacity, or technical education (Risam & Gil, 2022). It also seeks to reduce the need for substantial storage and processing power (Sayers 2017). As web archives increase in size, the number of institutions with the human, financial, and technical resources to conduct research with them is likely to decrease, especially for institutions in the Global South, where access to computing technology is more limited. Minimal computing is a way to prevent this exclusion by encouraging the use of only the necessary technologies to develop digital scholarship in these constrained environments (Risam & Gil, 2022).
As an example of a minimal computing approach to research with web archives, we will present the process of analyzing a sample of the archived PDFs on Latin American feminism in the Human Rights collection developed by Columbia University Libraries, which had over 16 TB of data in 2023.
To obtain data from the collection, we used the Archives Research Compute Hub (ARCH),, which we accessed through a free trial thanks to the support of the Internet Archive. ARCH is a subscription service developed by an American organization and it does not follow minimal computing principles, but such “maximum” computing tools can be used to serve the minimal computing ends of inclusiveness and participation (McGrail 2017), especially in environments with material constraints.
We decided to use ARCH to obtain the metadata of all the unique PDFs in the collection (812,042 in total) and filter them by searching a set of keywords on the URL and filenames of the archived PDFs, which gave us a sample of 10,296 documents about Latin American feminism. While ARCH also allows obtaining the full-text of all the captures in a collection, our interest in Latin American feminism, intermediate programming skills, our limited access to computing power and storage forced us to think creatively of ways to analyze a subset of the Human Rights collection. Using a sample overcame our constraints and allowed us to conduct a topic modeling analysis to explore the main topics that Latin American feminist organizations were addressing on their websites.
While previous studies have analyzed archived PDFs (Phillips & Murray, 2013), they did not follow a minimal computing approach that sought to overcome technical and infrastructural limitations. We argue that minimal computing approaches can lower the barriers to web archive research and encourage creative ways of using web archives.
WHERE FASHION MEETS SCIENCE: COLLECTING AND CURATING A CREATIVE WEB ARCHIVE
Elisabeth Thurlow
University of the Arts London, United KingdomWorking closely with the donors, we have been collecting the websites of the Helen Storey Foundation Archive, which charts the 30-year history and development of a former not-for-profit arts organisation promoting creativity and innovation.
The Archive includes design illustrations, correspondence, press and publicity material, material samples, and finished garments, as well as project websites created by the Foundation and their project partners. Like the wider collection, the websites themselves chart a succession of innovative projects which bring together the worlds of arts and science, using fashion to explore complex global issues.
However, the agreement to collect the archive and its web-based content was granted prior to the development of any formal in-house web archiving programme. This presented significant challenges for staff tasked with meeting the requirements of the donation agreement. Although we had recently developed a digital preservation programme, the programme, policies and workflows focused on our existing digital archives and collections. It had not been extended to include robust consideration for web archives.
Harnessing the ability of the Archive to educate and inspire, we developed a web archiving project in response to the demands of the donation. We trialled new approaches, including experimenting with the use of open-source tools such as Conifer and ReplayWeb.page, to build the collection of websites to sit alongside the related physical materials.
The theme of collaboration is present in all the prominent projects represented in the Archive. The work of the Foundation often involved partnering with scientists, to explore issues such as climate change and the role of plastics in fashion. Sharing that same ethos, we looked for ways of collaborating with the donors, to bring them along with us on the learning journey, as we developed our web archiving approach. We celebrated milestones with stakeholders, whilst also sharing honest insights into the curatorial challenges of collecting and making accessible both legacy and contemporary websites, such as the loss of access to sites during capture stages or limited capture of dynamic content. We shared the project with them through introductions to web archiving practices, live demos of the tools in use, and feedback sessions.
The Archive itself is a source of inspiration for the next generation of fashion creatives, and we hope our project response will inspire future web archiving projects and further considerations of ways to engage internal and external stakeholders in the creation and curation of web archives.
WHAT YOU SEE NO ONE SAW
Mat Kelly1, Alex H. Poole1, Michele C. Weigle2, Michael L. Nelson2, Travis Reid2, Christopher B. Rauch1, Hyung Wook Choi1
1: Drexel University, United States of America; 2: Old Dominion University, United States of AmericaWhen we explore web archives, we expect a faithful window into the past. Yet, what we often encounter is a distorted reflection. These archives are not true to the dynamic web we once navigated but are instead static snapshots—fragments captured by automated crawlers and headless browsers. Designed for preservation, these methods fail to replicate the rich, personalized experiences that define our daily web interactions. What we see in these archives is a facade, a simplified version of the vibrant, evolving web. This happens because web archives primarily capture static elements, overlooking the personalized, algorithm-driven content each user encounters—content shaped by browsing history, preferences, location, and timing.
In today’s web, no two users are exposed to the exact same content, even when visiting the same site. Algorithms continuously adjust content based on user behavior, personalizing everything from search results to advertisements. The more we engage with the web, the more it reflects our digital identities. This inherent personalization is a key aspect of modern web experiences, yet traditional archiving methods treat the web as a static entity, failing to capture these dynamic, individualized interactions.
This presentation explores a technical and human-centered approach to better preserve the web by focusing on personas—archetypes of web users with distinct behaviors, preferences, and interactions. By simulating these personas, we aim to capture the diversity of web experiences, particularly in relation to web-based advertisements. Advertisements, which are often personalized and ephemeral, play a critical role in shaping the modern digital landscape. Understanding how different users experience these ads over time allows us to surface the subtle ways personalization shapes the web.
Through this persona-based approach, we challenge the traditional notion of a “canonical” web user. The idea that any single web experience can represent all users is a myth. Each person’s interaction with the web is unique, and any attempt to comprehensively archive the web using static snapshots is inherently limited. By embracing web personalization, we can push beyond these limitations and begin to preserve a more authentic version of the web. This not only offers a truer glimpse into the digital past but also opens new possibilities for how we understand web history, user experiences, and the evolving nature of personalization.
br>POSTERS
POSTER SESSION
‘WE ARE NOW ENTERING THE PRE-ELECTION PERIOD’: EXPERIMENTAL TWITTER CAPTURE AT THE NATIONAL ARCHIVES
As the web archiving community are well aware, changes to the Twitter/X API have made archiving this important content increasingly challenging. Meanwhile Facebook and Instagram have always been notoriously difficult to archive. But the content posted on these services remains of critical importance and is perhaps uniquely vulnerable during times of political transition. In light of the unexpectedly early announcement of the 2024 UK General Election, the UK Web Archive embarked on a period of rapid experimentation and deployment of alternative social media capture methods, this poster and lightning talk will explore the results of these experiments, focusing on our capture of Twitter/X.
The primary capture method chosen for Twitter/X was gallery-dl, a command-line application originally developed for the client-side downloading of image galleries but also supporting the capture of text-based tweets in JSON format. Gallery-dl was deployed in-house via an Ubuntu-based EC2 running in AWS. High-priority accounts belonging to government departments and cabinet ministers were identified and targeted for capture during the election period, and retrospectively to July 2023 when capture via the Twitter API became impossible. A simple Python script was used to quickly generate bash scripts for launching the application with the appropriate parameters to capture each account in turn. Downloaded content was scanned for malware and synced to an S3 bucket. Since the election we have maintained a program of weekly capture of priority accounts and at time of writing some 100,000 digital objects have been captured from almost 60,000 tweets.
Quality Assurance of such a large collection presents a challenge. Alongside manual spot-checks, we have made use of data from the Ahrefs SEO tool. By comparing extracted outlinks from Ahrefs crawl data with our captured content we have been able to identify gaps in the collection, which can then be patched with targeted captures from gallery-dl. Work is continuing on alternative approaches as content captured by Ahrefs only represents a sample of the total tweets, and we have found that gallery-dl sometimes struggles with threaded content in particular. Another approach has been to look at the number of tweets captured over time, in order to check for gaps. This has provided additional assurance as well as yielding interesting insights into the tweeting behaviour of different groups (for example departments tweeted less during the pre-election period, while ministers tweeted more). Further work is also required with our partners to ingest the captured content into our existing social media access platform.
This poster and lightning talk will summarise the capture and quality assurance methods described above, as well as touching on other client-side social media capture processes in use: ESUIT for Facebook and Instaloader for Instagram that were also employed during the election period. It is hoped that sharing these processes will be of use to other members of the web archiving community and contribute to building a community of good practice in this challenging space.
THE BNF DATALAB SERVICES AND TOOLS FOR RESEARCHERS WORKING ON WEB ARCHIVES
Sara Aubry, Dorothée Benhamou-Suesser
Bibliothèque nationale de France, FranceIn October 2021, the National Library of France opened “the BnF DataLab” as both a physical and a virtual place whose aim is to facilitate the use of digitized and born digital collections of the Library.
In this poster, we want to give an up-to-date overview of the services and tools we offer in the DataLab to support research use of web archives. They have widely evolved over the past three years, since the research questions, the researcher needs and the different forms of events hosted in the DataLab have led us to constantly enrich our works.
The poster will introduce our approach to work with web archives as data through two main applications:
- The “Web archive Research portal” (Archives de l’internet Outils pour la recherche), a selection of different tools to build, explore and export corpuses. This portal gathers both web cartography and datavisualisaton tools developed by research teams such as Hyphe, tools developed by IIPC members such as SolrWayback or Archives Unleashed Toolkit, and utilities developed in-house to search, find and compare URLs sets.
br>- Webkit, an internal tool used by digital curators to extract content in various formats: websites, documents, derivatives WARC files, sets of metadata.
br>Working with web archives as complex datasets requires an advanced level of digital literacy that the DataLab is fostering. We have gained expertise in assisting research projects through their whole lifecycle. Webarchivists are also taking advantage of the collaborations with research teams specialised in image search, AI or text mining, and work with them to study their needs and possibly adjust the tools and methodologies they develop to our collections.
This poster will thus showcase the BnF DataLab’s approach as a way to expand research use of web archives and develop a community of practices and knowledge.
EXPERIENCES SWITCHING AN ARCHIVING WEB CRAWLER TO SUPPORT HTTP/2
Sebastian Nagel
Common Crawl Foundation, United States of AmericaHTTP/2 was introduced in 2015 as the second major version of the Hypertext Transfer Protocol. Addressing several issues with HTTP/1.1 (introduced in 1997), HTTP/2 focuses on performance, specifically low latency in loading a web page in web browsers and efficient use of network and server resources. It is supported by virtually all modern web browsers, has been adopted by a significant portion of web sites, and accounts for the majority of HTTP requests on the web. The successor of HTTP/2, HTTP/3, was published in 2022. HTTP/3 is based on the same principles as HTTP/2, but uses a different transport layer to further improve performance.
After a brief technical overview of HTTP/2, we describe how we added support for HTTP/2 to our web crawler with web archiving capabilities and how we ensure backward compatibility with older HTTP protocols. We compare our implementation to other HTTP clients or client libraries.
We share Statistics from our crawler about the adoption and usage of HTTP/2 compared to HTTP/1.1 and 1.0. A brief look at performance metrics illustrates the benefits of HTTP/2 and whether it has helped us save computing resources during harvesting.
Finally, we describe how HTTP/2 requests and responses are stored in WARC files and discuss the storage format in light of proposals #15 and #42, which extends the WARC format specification with respect to HTTP/2. Since HTTP/2 is a fully binary protocol that also stores HTTP headers in a binary compressed format, it is stored in a format that is fully compatible with HTTP/1.1: textual headers and the payload as it would be passed from the HTTP client to the renderer, HTML parser, and so on. This representation of HTTP/2 captures is consumed by all known WARC readers without any problems.
WEB SCRAPING IN THE HUNGARIAN WEB ARCHIVE
Gyula Kalcsó
National Széchényi Library, HungaryWeb archives mostly save entire websites using the technique of crawling. However, there are cases where it is not necessarily required to save a website in high quality, because only the data contained therein or certain files or parts of a website are needed. There are also cases where high quality crawling of certain sites is not possible due to the technical solutions used, but some of the content on those sites can be harvested by scraping. In addition to the basic tasks of a web archive, it may also occasionally perform tasks related to born-digital archiving, where it can also use scraping to collect the targeted digital objects.
The poster presents the scraping activities of the Hungarian National Library's Web Archive through a project in which almost half a million photos and their metadata were scraped for the Digital Image Archive from a social image sharing site.
The Digital Image Archive (DIA) is a public online service of the National Széchényi Library, established in 2007. The DIA is a hybrid collection of digitised small prints, illustrations from old books and newspapers, and modern digital photographs. The images can be browsed in a variety of resolutions, with detailed metadata.
Kozterkep.hu is a web community and database for the presentation of artistic works in public spaces and community areas, based on independent and voluntary work. The site currently contains more than 43 thousand artworks, captured in more than half a million photos. Each work is described on an artwork page, which contains metadata and textual descriptions. The site is a great resource for expanding the DIA by scraping its content. Scraping is necessary because only the images and certain data intended to be saved. Other born digital collections can be expanded in a similar way, such as the national library's podcast collection.
The project processes and their interrelationships are presented in a flowchart. The saved dataset will be described, as well as its mapping to standard library metadata and loading into the DIA database. The poster aims to demonstrate how a national web archive can contribute to born digital archiving through scraping, and the workflows and tools needed to do so.
ARQUIVO.PT API/BULK ACCESS AND ITS USAGE
Vasco Rato, Daniel Gomes
Arquivo.pt, PortugalWhen collecting large amounts of data for archiving purposes it is important to ensure that the data is accessible and searchable so that researchers can find the useful data they need. For this reason in 2018 we implemented Application Program Interface (API) access to Arquivo.pt, which also allowed for microservices to be built on top of it, and currently close to half of the web traffic to Arquivo.pt is made through API requests. Every year new projects are built on top of the Arquivo.pt APIs coming from almost every field, from economic analysis to artistic projects to computer science projects. Nowadays we have four different APIs which cater to different needs of our community.
More recently, the research and education community has been requesting the bulk download of web-archived data and index files, for instance, to feed AI training models, optimise routing of web archive requests or recover information from selected websites (e.g. news). Arquivo.pt began making all its index files publicly available in real-time to facilitate the bulk download of web-archived data. This enabled new possibilities for the researchers who quickly took this opportunity - our network bandwidth increased over sixty times since we implemented the bulk download access. Additionally, this allowed for artificial intelligence research to build Large Language Models (LLM) with our data, including GlórIA, an LLM with 35 billion tokens for European Portuguese.
With this poster we hope to show the importance of providing the users this kind of access to the archived data, and detail how our APIs and services are used by the community.
POLITELY DOWNLOADING MILLIONS OF WARC FILES WITHOUT BURNING THE SERVERS DOWN
Pedro Ortiz Suarez, Thom Vaughan, Greg Lindahl
Common Crawl Foundation, United States of AmericaWith the ever growing interest in web data and web archives that has been partially driven by the advent of Large Language Models (LLMs), Artificial Intelligence (AI) and Retrieval-Augmented Generation (RAG), web archivists managing open repositories confront themselves with an unprecedented quantity of download requests. With web archiving infrastructure being sometimes constrained in resources, it has been increasingly difficult to properly serve and fulfill all of these new incoming requests without them taking down our servers or completely saturating our network, as clients and users use far too aggressive retry policies, sometimes without even knowing it, when they try to download our archives. This increasing amount of requests has made it harder and harder to effectively serve open web archives and allow users to download the data in a timely manner. This has been the case, even for organizations like ours, the Common Crawl Foundation, benefitting from the Amazon Registry of Open Data on AWS, and having access to industrial grade infrastructure.
With this in mind we first study the existing software that has been typically used to download the Common Crawl snapshots. We identify that the standard tools and HTTP clients used to request our data leave a lot to be desired in terms of portability, ease of use, reliability and availability of crucial features such as retry strategies.
As such, we develop and release an open source, cross-platform, dependency-free and user-friendly tool that implements an HTTP client with an easy to configure retry-strategy with exponential backoff and jitter. The introduction of these more polite retry strategies allow users to avoid download errors, and to more quickly download data as the exponential backoff with jitter lets the server more easily handle concurrent requests. The released tool will be able to be easily recompiled with different default values so that web archivists and other archiving organizations can distribute a download tool tailored to their infrastructure’s size and their users’ needs.
NEXT STEPS TOWARDS A FORMAL REGISTRY OF WEB ARCHIVES FOR PERSISTENT AND SUSTAINABLE IDENTIFICATION
Eld Zierau
Royal Danish Library, DenmarkThe aim of this poster is to continue the discussion of establishment of a formal web archive registry, based on the community input. This was presented in a lightning talk at IIPC 2024, and there are expected to be a community meeting about it late this winter.
The purpose of such a registry is to make it possible to identify web archives from which web materials have been used in research, either as part of collections or as specific references to web pages. No matter how a web element is referenced today, it need to be traceable in the future where it was originally archived and then where it can be found in that point in the future. Using archive URLs, we already see that web archives shifts to new Wayback machines where the URL therefore changes, there are also examples of web archives where the placement of the resources are moved to a new domain, as e.g. the Irish web archive. There will also be other ways to identify web archive materials that will rely on such a registry in order to find the identified resource.
The poster will present the varies options for how to identify a web archive uniquely over time, and what possible different parts that would be relevant to register, when dealing with a registry with relatively stable information. For example, unique web archive identifier, date interval for when the registered information was/is valid, readable alias for identifier in the given period, possibly access services for web identifiers, or valid URL for web pages in archive. The poster will also present outcome or summaries of important issues that have been discussed.
USING WEB ARCHIVES TO CONSTRUCT THE HISTORY OF AN ACADEMIC FIELD
Tegan Pyke
University of Bergen, NorwayLink rot is one of the biggest threats to the field of digital literature. This phenomenon leads not only to the loss of individual digital literary works—which are often hosted as webpages or websites—but also to the loss of digital literary dissemination and preservation efforts. Such efforts, usually instigated by scholars or practitioner-scholars, are largely dependent on academic grant funding. As a result, the continued presence of web-based project resources are similarly funding dependent. Hosting for the project’s resources will eventually but inevitably expire when financing comes to an end and no other funding sources are found.
While no longer available by conventional browsing means, many ‘lost’ digital literature community hubs and documentation projects have been captured in web archives, either partially or fully. Here, they continue exist alongside captures of former iterations of still existing sites.
This presentation provides a case study of the use of web archives in research. It will document how—via the combination of web captures, found documents, internal communications, and inquiry-based reading—a rich and detailed history of the digital literature field has been uncovered. One that shows its changing approaches to its subject matter in practical terms.
In doing so, the value of web archive collections as data will be highlighted. Not just for their ability to facilitate quantitative data analyses but also qualitative ones. Ultimately, this will point to the importance of decentralisation as sustainable data management. A practice the digital literature field has recently recognised, with multiple projects looking to utilise the open source WikiData platform to assure long-term preservation of project-related data.
ARQUIVO.PT ANNUAL AWARDS: A GLIMPSE
Daniel Gomes
Arquivo.pt, PortugalThe Arquivo.pt annual awards began in 2018 to showcase innovative web archiving use cases. The main goal is to award works that use preserved historical information from the web so that they raise awareness about the importance of web archiving for digital societies.
The works can address any subject as long as they use Arquivo.pt as source of information. These works should use the research and access services made publicly available by Arquivo.pt and clearly demonstrate the usefulness of the service. Works can be done individually or in group. The awards are 10000€ (1st place), 3000€ (2nd place) and 2000€ (3rd place). Partnerships have been established to promote the Awards and grant additional Honourable Mentions. In the 2024 edition, there were 3 Mentions granted by the Público Newspaper, DNS.PT Association and Aveiro Media Competence Center.
Along its 7 editions, 194 applications were received and the 29 awarded works spanned several areas of knowledge such as Media Studies, Economics, Health, Ecology or Computer Science (winners at arquivo.pt/awards). For example, the winners of 2024 were:
-
“Noticioso” (https://noticioso.pt/) is a platform where users can compare media coverage of various topics through a game (Quiz) and explore trends over time using an analytical tool.
-
“Habitação” (https://www.habitacao.net/) is a tool to interactively explore the evolution of the average value of the Portuguese housing and rental market, contextualised with news published on the subject and housing policies over time.
-
“Pegada Lusa” (https://pegadalusa.pt/) is a work that shows the evolution of sustainable policies and initiatives in the various regions of Portugal, based on an analysis of projects and good practice from the United Nations Sustainable Development Goals (SDGs).
Researchers and higher education students from Portugal and Brazil have been the main applicants. However, anyone in the world can apply.
Some submitted works already made use of works from previous editions. For example, Conta-me Histórias (Tell me stories) is an online service which provides a temporal narrative about any subject using 24 electronic news platforms. It won the first edition of the Awards and it was integrated in Arquivo.pt UI to support the “Narrative” function. Some recent applications used Conta-me Histórias as a research tool for their work. However, many other works disappeared after their application. But even these unfortunate cases demonstrate the importance of web archiving because their project websites were web-archived and they can still be consulted and referenced. For example, the winner of the 2021 edition “Major Minors” project is available at https://arquivo.pt/wayback/20211103003731/http://minors.ilch.uminho.pt/).
The Arquivo.pt Awards have the High Patronage of the President of the Portuguese Republic since 2019. The awards have been delivered in person by individualities such as the Minister of Science and Higher Education or the Prime Minister of Portugal during the annual Portuguese Science Summit (Encontro Ciência).
ASYNCHRONOUS AND MODULAR PIPELINES FOR FAST WARC ANNOTATION
Pedro Ortiz Suarez, Thom Vaughan
Common Crawl Foundation, United States of AmericaSince the introduction of unsupervised and semi-supervised learning to Natural Language Processing (NLP) techniques, many advances have been made allowing researchers to annotate, classify and index large quantities of multilingual and heterogenous textual data. Moreover, the ever growing demand for data and compute in order to satisfy the needs of modern day Large Language Models (LLMs), has made some of the initial smaller models more widely available and cost-effective to use, allowing more researchers with smaller infrastructures to access them. Furthermore, given that most of the modern LLMs and unsupervised models in NLP are actually developed by pre-training them on large quantities of textual web data, has led NLP researchers to develop pipelines to annotate, filter and select data from WARC and ARC files in order to extract "relevant" web documents and use them in LLM training. And while the final goal of these pipelines is to produce data to train large models, they have effectively become compelling tools to explore, annotate and analyze large web corpora by directly processing the raw data instead of working at the metadata level.
However, even though some private companies have made some efforts to develop and maintain these pipelines such as Nemo Curator from NVIDIA or Datatrove by Hugging Face, most of them have been developed as part of research projects that are not maintained long-term or have been kept secret in order to obscure the pre-training step of LLMs training due to political, legal or commercial reasons.
On the other hand, the web archiving community has developed their own software to process and analyze WARC files through the years such as, Archives Research Compute Hub (ARCH), The Archives Unleashed Toolkit or ArchiveSpark, and while most of the are well maintained they have focused on integrating more classical analysis or annotation tools, and have been optimized to act on the metadata, rather than on the raw content itself.
We thus inspire ourselves on existing pipelines already developed by both the NLP and the web archiving communities and develop our own modular pipeline for WARC annotation. The first experimental version of our pipeline aims to be efficient so that it runs on constrained infrastructures, modular so that practitioners can develop and attach custom annotators, open-source so that we can foster a community that would continue to maintain it, and user-friendly so that little knowledge of large-scale data processing engines is needed. With the first version of this pipeline we want to bring some of the tools and techniques from the NLP community to the web-archiving one, hoping that both communities can benefit from it and use it to better categorize, annotate, filter and finally explore their existing web datasets.
CONSORTIUM ON ELECTRONIC LITERATURE (CELL)
Hannah Ackermans
University of Bergen, NorwayAs a literary genre, electronic literature (E-Lit) is characterized by creative experimentation and subversive use of digital media. Functioning outside the confines of traditional publishing, divergent practices of collecting, documenting, and archiving E-Lit have emerged within the field’s community. Founded in 2009, the Consortium on Electronic Literature (CELL) overarches these vital efforts to preserve the field.
After some years of inactivity of the CELL project, the Center for Digital Narrative (University of Bergen, Norway) is revitalizing this common goal of bringing together divergent archival and documentation practices as CELL, The Index. CELL, The Index has a new interface, infrastructure, and data model that will ensure longevity and lessen the burden on individual scholars.
We are migrating the outdated CELL to a new infrastructure, and, with the rise of linked open data, we have chosen to use the collaborative, open-data knowledge graph Wikidata. Advantages of Wikidata are numerous. As a document-oriented database, Wikidata is ideally suited to the weird and wonderful ways that electronic literature breaks the boundaries of what literature and art can be. Based on our carefully developed data model, the documentation will be expansive on the Wikidata platform. Wikidata allows for multilingual documentation, which furthers the trend of global inclusivity in the electronic literature community. This migration provides new opportunities to make metadata about electronic literature available to more people, as well as promoting the individual electronic literature databases who have been doing foundational work over the decades. Our storage work creates sustainable access to web archive metadata, especially as some collections go offline after funding periods end.
At the same time, we are creating our own focused wikibase environment with the taxonomy that is relevant to the electronic literature community in particular. Each record in the wikibase will contain links to the various places in which a work is documented or archived. In some cases the work itself has become unavailable, but has either been archived for preservation in its entirety, or documentation exists. In those cases, screenshots and walkthroughs created by individual working groups are a replacement for the original work. For this reason it’s especially useful to have the overarching Consortium Wikibase linking to all these efforts to piece together what it left of the work.
In partnership with a variety of institutions archiving electronic literature, we are providing access for reuse of collections and datasets.
DESIGNING ART STUDENT WEB ARCHIVES
Katherine Martinez
The New School, United States of AmericaEstablishing direct partnerships with faculty and academic programs has become essential to collecting and preserving born-digital student work at The New School, and is rising as an approach to web archiving. These relationships also provide a rare window onto how digital collections are accessed by researchers. These insights have led the New School Archives to evaluate the way archived websites are organized within the collections. One especially influential initiative began in 2022, when administrators from the Creative Publishing and Critical Journalism Master’s program contacted the Archives to request assistance with preserving an annual online student publication, BackMatter. They wished to safeguard against the disappearance of previous issues at the end of the publishing platform’s subscription cycle–a recurring issue with online student projects. As an ongoing, particularly difficult series of sites to capture, BackMatter presents an ideal case study for how highly technical, labor-intensive solutions for archiving websites are now described in the Archives’s accession records, finding aids, and various access points.
Cargo Collective, the web publishing service selected for BackMatter, supports customizable themes that are preferred by students at The New School, where there is an emphasis on art and design through transdisciplinary programs. The platform is frequently used by professional artists and designers in the New York City art world for building personal websites. While Cargo Collective’s design templates are visually appealing, accessibility problems frequently arise during the capture process that lead to broken menus, missing media, and other viewing obstacles in web emulators. This has sometimes required creative solutions to provide immediate access, such as low tech page captures through PDFs and screen recordings. Documenting this information provides critical stewardship history that may be later referenced by managing archivists, and also communicates how to access archived pages to researchers. Over the past two years, the department has tested and revised internal web archiving standards, facing unexpected challenges during each stage of this process, such as data loss related to the reorganization of seeds in Archive-It. As a result, the web archiving team of two staff members has learned to creatively adapt workflows and consider alternative solutions.
While the New School Archives has begun to encourage faculty and students to consider web archiving factors during the web design process, it remains difficult to share this information widely across the University community, or see the knowledge put into practice. By promoting more partnerships like the one established with Creative Publishing and Critical Journalism, the Archives is moving closer to becoming an essential resource for ensuring long-term access to student websites.
FAILED CAPTURE OR PLAYBACK WOES? A CASE STUDY IN HIGHLY INTERACTIVE WEB BASED EXPERIENCES
Mari Allison
Independent Researcher, United States of America12 Sunsets is a dynamic and interactive website that allows visitors to “drive” through time and space to explore a series of archival photographs. The archive consists of over 65,000 contiguous photographs the artist Ed Ruscha took by driving a truck down Los Angeles’ Sunset Boulevard, an experiment he repeated over a time period of 50 years.
Both the high degree of interactivity and sheer volume of content presented challenges for capturing 12 Sunsets, and these issues will only become more common as web technology advances. The site uses MapBox to represent the streets of Los Angeles with photographs above and below representing each side of the street. The user utilizes the left/right arrow keys to initiate a horizontal scroll through the streets, and the up/down to view different years of the same location. It would take hours manually to scroll through every photograph of each year.
Our current web archiving infrastructure is based entirely around Archive-It for both capture and delivery, but we have captured sites using Conifer and uploaded the WARC to Archive-It on rare occasions. Our quality assurance process was also very basic. We simply reviewed the capture in the Wayback Machine, and if we saw any issues, we would try to recapture the site. Our approach to 12 Sunsets was very experimental, but not necessarily methodical. When Archive-It didn’t work, we tried Conifer, and when Conifer didn’t work we tried archiveweb.page, and so on. We used several tools to try and capture this site before asking the question. What if the capture technology isn’t the issue? From there we began experimenting with extracting our WARCs from Archive-It and uploading them to different replay tools.
Through this approach, we gained a lot of knowledge that will be taken forward on future web archiving projects. However, it was probably not the most strictly efficient path to obtaining a working capture. This poster will describe the methods used to successfully capture 12 Sunsets and outline a few different troubleshooting and QA workflows that would improve the efficiency of future complex captures. The main goal is to determine whether the error is because of an issue with the WARC or with the replay technology, so that we can be more focused when developing potential solutions. We also must address the question: What do we do if the best tool for the situation falls outside of our current web archiving infrastructure?
FROM NEW MEDIA ARCHIVES ON SOCIAL MEDIA PLATFORMS TO WEB ARCHIVES - CHALLENGES IN PRESERVING SCRAPED CULTURAL MATERIALS
Camilla Holm Soelseth
OsloMet, NorwayIn 2019 I scraped 35,000 posts from Instagram tagged with #instapoesi, the Norwegian equivalent of #instapoetry. The intent was to capture an extensive collection to explore a cultural phenomenon and use the posts (with their metadata and paratextual elements). When I scraped these posts, it was still possible to scrape social media with scripts. However, quickly after, partly following the Facebook-Cambridge Analytica data scandal, Meta and other social media companies shut down their APIs. Even semi-manual collections utilizing the web interfaces, like Zeeschuimer and 4CAT, are experiencing data capture limitations when used.
At the same time, modern folk culture and popular culture have become platformized. Social media platforms are infrastructures of cultural production and access to popular culture today. This also relates to aesthetic practices of doing culture, such as instapoetry, a type of folk poetry or popular lyric. This means that it is only on social media that it is possible to trace the development of a poem or a cultural expression as it moves into folk culture and experiences changes, appearing in new formats and expressions.
With this as a background, the temporary data set, consisting mostly of poems part of contemporary folk culture, has now changed status. As it is no longer possible to retrieve these hashtagged archives and take them out of the platform to study them, they suddenly hold more value as an important cultural source in the development of popular lyrical traditions and practices in Scandinavia. This data set, despite its challenges, is a treasure trove for understanding the evolution of popular lyrical traditions and practices in Scandinavia. However, as a collection, there are challenges to how to archive it best. While being a large collection, it is in no way a collection solely made from public personas. Additionally, as users control their own content, some have been deleted – both posts and accounts.
In the past, the UK National Library has collected by user submissions, which has flaws and benefits in terms of collection. But it has also explicitly focused on single poems and not the posts, also tagged with #instapoetry, which contain photos of different material instantiations, either new versions or posts of photos of new materializations, such as posters and cross-stitches.
In this presentation, I want to bring forth the challenges of preserving a collection of scraped Instagram images related to a cultural phenomenon, as well as comparing it with similar projects and how they can help or challenge how to go about archiving contemporary folk culture as it comes to life on social media platforms. It further highlights the importance of how changing platform regulations affect our ability to study contemporary culture today and preserve it. Therefore, this presentation contributes to reflecting on best practices when challenged with digital ethical challenges and large corporations running the infrastructures of contemporary culture.
The contribution, therefore, advances the understanding of curation, collections, infrastructures, and discovery and access to contemporary popular and folk culture.
HAWATHON: PARTICIPANTS EXPERIENCE
Ingeborg Rudomino, Anamarija Ljubek
National and University Library in Zagreb, CroatiaThe National and University Library in Zagreb (NSK) and its programme the Croatian Web Archive (HAW) have been conducting the HAWathon project since 2023. The project is based on crowdsourcing, and it has been envisaged as a competition in collecting relevant content as well as participation in the creation of thematic collections. The aim of HAWathon is to collect URLs of websites related to a specific topic, while following the exact selection criteria of the HAW for cataloging and archiving, for the purpose of creating online thematic collections. In this project, the HAW collaborated with public and school libraries and secondary schools throughout Croatia and organized HAWatons in several Croatian cities, which resulted in new thematic collections on the HAW portal. The HAWathon project was implemented as part of a regular computer science class, with secondary school students as participants, and with the HAWathon competition, they strengthened their digital literacy and got acquainted with web archiving in general. All HAWathon participants filled out the evaluation questionnaire based on the project's formative assessment template.
The poster will present the results of the evaluation questionnaire completed by the project participants. The results will show how the participants evaluated the organizational aspect of the project, which obstacles they encountered and what they liked the most. The poster will highlight the advantages and disadvantages based on the experience in organizing and conducting such a competition and present the future steps.
IMPLEMENTING THE E-ARK STANDARD FOR INGEST OF SOCIAL MEDIA ARCHIVES: GOALS, OPPORTUNITIES AND CHALLENGES
Nastasia Vanderperren, Ellen Van Keer
meemoo, Flemish Institute for Archives, BelgiumSocial media are important but complex digital resources. They are particularly voluminous, volatile and heterogeneous, containing multiple digital content types. It is a major challenge for heritage institutions to ensure that these dynamic digital-born multimedia archives are preserved in their integrity and remain readable, accessible and interpretable over time, independent of the platform they were created on or the system they are preserved in. Implementing international open standards and endorsing community-supported best practices is crucial here. To tackle the challenge, meemoo, the Flemish Institute for Archives, has adopted the e-ARK standard to optimize the delivery of social media archives to its central repository.
The e-ARK standard1 was developed to provide a pan-European OAIS implementation format. OAIS is an international standard (ISO 14721) defining the requirements for an archival preservation system to provide long-term preservation of digital information. A core concept is the Information Package (IP), a conceptual container that packages a digital object with information about its content, its files and the package itself. By bundling all this information, digital objects can remain accessible over time, even when softwares or technologies change - as we know they rapidly do. In addition, the e-ARK implementation format defines a generic package structure and a set of metadata elements from open archival standards (METS, Premis). Notably, an e-ARK compliant container allows the packaging of different representations of a digital resource (e.g. in various file formats) and stores them in separate folders, which can be managed separately. This is particularly useful for social media archives as output formats can take on many forms (e.g. JSON from APIs, WARCs from web crawlers... ) that will require different preservation strategies. Specific (e.g. preservation) information is stored on the representation level. On the package level, the required information is stored to keep the structure, the content and all the files of the container readable and interpretable. Thus, e-ARK compliant containers are standardised, interoperable, self-contained and scalable, as well as able to support a wide variety of content types and optimize long-term preservation management.
At meemoo, adopting the e-ARK standard was first explored as part of a best practices project on social media archiving in cultural archives.2 From an archival perspective, a central concern is to ensure ingest and long-term preservation of archived social media streams in their full complexity and integrity. At the conference, we want to present the technical SIP specification we have developed for submitting social media archives into our central repository. We want to explain why and how we are implementing the e-ARK standard to tackle, standardise and optimise the delivery and management of these particularly complex digital resources. We want to discuss implementation decisions we made as well as raise questions we still have, e.g. where and how to manage rights information, how to manage incremental content? As this is a work in progress, gathering feedback and input from the sector is crucial for us.
1 https://digital-strategy.ec.europa.eu/en/activities/earchiving
2 https://meemoo.be/en/projects/best-practices-for-archiving-social-media-in-flanders-and-brusselsPLANNING WEB ARCHIVING WITHIN A FOUR-YEAR SCOPE: MAKING THE NEW COLLECTION PLAN FOR THE YEARS 2025-2028 IN THE NATIONAL LIBRARY OF FINLAND
Sanna Haukkala
National Library of Finland, FinlandThe National Library of Finland has been crawling the Finnish online sphere for over 20 years, and web archiving has been a legally bound task of the library since 2008. Planning for the future has always been challenging, but how to reflect on the past and what to expect from the future when you are required by law to plan web archiving and preservation of electronic legal deposits in four-year cycles?
The collection plan encompasses both web archiving and preserving other electronic publications, requiring multiple viewpoints and focus areas required by law to write out for the next four-year phase. The required focus points in the collection plan are a planned scope for collecting online materials and other related deposit practices, technical development plans, and perspectives on the usage of preserved online materials (both on legal deposit workstations and more general usage). The plan is also required to include considerations for research and cultural-historical archiving as well as the equal treatment of online publishers.
The collection plan has been written and updated since the enactment of the Act on Collecting and Preserving Cultural Materials (1433/2007) in 2008; however, in 2024, the web archiving team decided to create separate, more flexible documentation for content selection to the Finnish Web Archive harvests, a sort of scoping documentation.
The idea for separate documentation was to keep the four-year collection plan as simple as possible, while making the scoping document more flexible to work with, opening more details and perspectives of web archiving work to the public that had been previously kept out of the four-year collection plan. This was done mostly because the format of the collection plan is defined in the legislation, keeping it quite strict.
During a comparison of different countries performing web archiving, it was noticed that many countries had different kinds of documentation, mostly a permanent one for general collection documentation, and some countries had separate scope-related documentations for different content types. The same kind of time-bound collection plan was not a common method to reflect web archiving plans, as most seemed to be written from a more general and permanent point of view, rather than as planning documentation for specific years. The scoping documentations that were compared usually had more frequent updating cycles and no requirement for national ministry level approval rounds.
The poster will present the history of the Finnish collection plan and differences between the two types of documentation, what different perspectives a separate plan can bring in, and how these two documents will provide a fuller view of the web archiving and preserving online materials and publications. Both documents will be published in the beginning of 2025.
REDIRECTS UNRAVELED: FROM LOST LINKS TO RICKROLLS
Kritika Garg1, Sawood Alam2, Michele Weigle1, Michael Nelson1, Mark Graham2, Dietrich Ayala3
1: Old Dominion University, United States of America; 2: Internet Archive, United States of America; 3: Filecoin Foundation, NetherlandsThe dynamic and ever-changing nature of the World Wide Web often requires the implementation of URL redirects to accommodate evolving website structures. However, these redirects introduce complexities that affect web usability, SEO, and digital preservation efforts. In this study, we analyze redirecting URLs to uncover patterns in redirect usage. Using the Heritrix web crawler in September 2023, we conducted a comprehensive crawl of 11 million redirecting URLs, allowing for up to 10 redirects per URL. 50% of the URLs ended at live web pages, while the other half resulted in error statuses. Only 0.42% of URLs required more than four redirects. This suggests that setting a cap of five redirects during a crawl could be effective.
We categorized the redirects into two groups: canonicalized and non-canonicalized. Canonical redirects (6 million URLs) adhere to web standards, such as switching from HTTP to HTTPS," consolidating URL variations into a single form and generally mapping to the same TimeMap in archives. Users typically don't notice these redirects while browsing.
The remaining 3.5 million non-canonical redirects often indicate changes in website structure, such as domain shifts or path changes, typically seen during migrations, rebranding, or security updates. While these URLs may return a "200 OK" status, they often lead to root pages or irrelevant content, causing the loss of specific content.
We identified several types of "sink" URLs, where multiple redirects converge on a single target. Examples include organizations consolidating traffic to a single root page, such as regional sites redirecting to a global platform, often losing original content. We noticed many online e-commerce domains redirected to a single sink, likely for affiliate marketing. Login pages, triggered by embedded share buttons, also serve as sinks, where crawlers repeatedly archive login pages instead of the intended content.
Search engines like Google and Bing acted as sinks for outdated or broken URLs, leading users to their homepages and indicating lost original content. Hosting services similarly redirected inactive domains to generic landing pages, reflecting attempts to keep user engagement despite losing the original content. We also found 62,000 custom 404 error pages, many being "soft 404s" where missing content was archived as valid, wasting resources. Lastly, the viral "Rickroll" video served as a unique sink, humorously redirecting users and highlighting cultural rather than functional redirection.
This study highlights the crucial role of URL redirections in web management and the challenges they pose for digital integrity, preservation, and user experience. Our findings offer insights to improve strategies for managing redirects and ensuring secure, accessible online content.
ROBOTS.TXT AND CRAWLER POLITENESS IN THE AGE OF GENERATIVE AI
Sebastian Nagel, Thom Vaughan
Common Crawl Foundation, United States of AmericaThe robots.txt standard was initially proposed in 1994 as a way for website owners to signal to web crawlers how to best crawl their sites. A text file called "robots.txt" is placed in the root folder of a web site and contains access policies that specify which file paths web crawlers ("robots") are allowed to read on the site. Access policies can be specified for individual crawlers by "user-agent" name, or by a wildcard rule block that catches all crawlers not addressed by a named policy.
As a convention based on consensus - not a legally binding regulation - the robots.txt proposal has nevertheless been adopted by all major web search engines and has been extended in a variety of ways, for example to express more fine-grained access rules, to define URL canonicalization rules, or to specify the location of a "sitemap", an exhaustive list of pages, images, and videos on the site, which gives crawlers a way to quickly obtain an exhaustive list of resources on the site while avoiding duplicates.
The various extensions led to differences in how individual search engine crawlers implemented the robots.txt standard. To formalize the robots.txt standard, Google researchers submitted an RFC proposal in 2019, along with an example robots.txt parser implementation. In 2022, the Robots Exclusion Protocol was standardized by RFC 9309.
Very recently, the robots.txt standard has seen increased interest due to the rise of generative artificial intelligence, large language models, and machine learning in general. Many ideas have been proposed to extend the standard and to allow content owners to opt out of certain use cases, such as training machine learning models and specifically generative AI. We will provide a brief overview of some of these proposals.
We also study how the robots.txt is being used across the web by webmasters and site owners. We analyze which web crawlers are addressed by specific rule blocks and whether there is a bias in favor of certain search engines or web crawlers focused on other use cases, such as search engine optimization or machine learning. We also look at how access policies change over time. To do this, we evaluate eight years of archived robots.txt files. We compare our results to previous research work, and also present preliminary numbers on the usage of other machine learning opt-out protocols that have emerged recently.
As there is a growing demand to use web archives as a data source for training machine learning models, both among academic researchers and for-profit companies, we hope that our presentation will help in understanding the impact of robots.txt and analogous protocols on web archive harvesting or access policies.
SOLVING THE PROBLEM OF REFERENCE ROT VIA WEB ARCHIVING: AN OA PUBLISHER’S SOLUTION & FUTURE SOLUTIONS IN THOTH
Miranda Barnes
Loughborough University, United KingdomIn an increasingly digital publishing landscape, the issue of link rot, specifically reference rot in scholarly articles and monographs, is a known problem. In the past print works were the primary reference source, but in the current landscape many scholars now access their research via the internet. A study performed by Ahrefs found that 66.5% of links sampled (over 2 million) from the decade prior were broken and did not lead to the intended content.1 Content drift, where the link directs to entirely different content or potentially insidious content, is also an issue.2 Scholars referencing web-based content, whether a website or a digital publication, need to ensure the content they reference in their research is consistently accessible. This is essential to the integrity of both their work and future research. While university presses or presses otherwise tied to an institution may have access to a perma.cc membership, many independent publishers will find membership a prohibitive cost.
Open Book Publishers, an established publisher of open access books and partner in the COPIM project (2019-2023) and Open Book Futures project (2023-2026), practices the archiving of reference links for their books via an open-source script.3 This script is available for other publishers to use should they wish, via the link in Github. The script, developed by the publisher’s software development team, automatically scans the PDF version of the book for reference links and backs them up via the Internet Archive’s Wayback Machine at the time of publication. These are then available as archived versions of the references, ensuring future access, as long as the original link was still unbroken at that point. OBP is now investigating possibilities for moving this archiving step earlier in the production workflow, to ensure that archived links represent the authors’ intentions as closely as possible (including providing the opportunity to correct links found to be broken) and enabling them to insert the archived URLs into the digital and printed editions prior to publication.
Thoth, the open metadata and dissemination system developed within Copim, is working to ensure that it can store the archived URLs for all references in a work (as well as links to archived versions of the work itself). In combination with OBPs open-source technology, Thoth could then provide a service for publishers, allowing them to create and store archived URLs and insert the archived reference links back into the published works, helping prevent complete loss of referenced web resources in the future. Clearly there will be limitations, as books on publisher backlists will likely already have some broken reference links when they are added to Thoth’s database. Work is also ongoing within Open Book Futures’ Work Package 7 (Archiving and Preservation) to determine a way to influence changes in PDF standards, moving towards inclusion of web-archived reference links as a future method for preventing reference link rot and loss.
We will show current and proposed future workflows, providing examples of how these can solve certain link rot issues.
1 https://ahrefs.com/blog/link-rot-study/
2 https://doi.org/10.1629/uksg.237
3 https://github.com/thoth-pub/archive-pdf-urlsUSE OF SCREENSHOTS AS A HARVESTING TOOL FOR DYNAMIC CONTENT AND USE OF AI FOR LATER DATA ANALYSIS
Gaja Zornada, Boštjan Špetič
Computer History Museum Slovenia (Računališki muzej), SloveniaThe rapidly evolving nature of dynamic web content poses significant challenges for digital preservation. Traditional web harvesting tools often struggle to capture and archive interactive and complex media such as dynamic web pages, social media platforms, or AJAX-driven applications. Screenshots have the potential to provide an alternative approach, capturing visual representations of dynamic online content in a more straightforward and basic manner when automatic harvesting methods are not applicable. However, while screenshots preserve the appearance of web pages, they lack the underlying structure, where later data extraction is considered difficult.
We would like to postulate screenshots as a convenient, compact form method of harvesting demanding content due to the interoperability of the format for potential machine analysis and further application development. Surpassing videocapture of user experience due to the lesser demand for storage space as well as human resources needed for the acted out use, given efficient contextual content or personalized content simulation and considering sustainability and ethical questions of archiving real time content. Screenshots are easily accessible, and provide ample opportunities for citizen science engagement due to ease of implementation.
This case study explores the value of the use of screenshots as a complementary (interactive content, i.e. videogame) or as a primary harvesting tool for dynamic content and the application of Artificial Intelligence (AI) techniques for post-harvest data analysis based on Microsoft AI’s implementation best practices. We propose a framework where screenshots are utilized to capture dynamic and ephemeral web content that traditional crawlers may miss. By leveraging AI-powered image recognition, Natural Language Processing (NLP), and Optical Character Recognition (OCR), we demonstrate how the visual content of these screenshots can be analyzed to extract meaningful data, such as text, metadata, and user interactions.
ADVANCING PARTICIPATORY DEMOCRACY THROUGH WEB ARCHIVING: THE KRIA ICELANDIC CONSTITUTION ARCHIVE
Eileen Jerrett
KRIA Icelandic Constitution Archive, United States of AmericaThis presentation will explore the KRIA Icelandic Constitution Archive, a pioneering web archive that preserves the world's first crowdsourced constitution. The KRIA Archive documents the democratic process behind Iceland’s constitution, capturing the extensive citizen engagement that shaped the draft. However, many of the digital artifacts representing this citizen engagement, including public comments, social media interactions, and live sessions, have started to disappear over time.
In response, a partnership was formed with the Icelandic Constitution Society, a dedicated group of activists, Archive-It, and the University of Washington. Through this collaboration, we have successfully recovered a significant portion of these crucial digital artifacts, ensuring that the legacy of this groundbreaking democratic process is preserved for future generations.
The presentation will detail the curation and collection strategies employed in the KRIA Archive, addressing the challenges of capturing and preserving dynamic content. It will also cover the tools and infrastructure used to ensure the sustainability and long-term preservation of these digital materials. By showcasing the KRIA Archive, this presentation will advance the understanding of how web archiving can support and sustain participatory democracy, providing valuable insights into best practices for similar initiatives.