PRACTICAL
INFORMATION

SESSIONS

SESSION#01: ARTIFICIAL INTELLIGENCE & MACHINE LEARNING

Re-imagining Large-Scale Search & Discovery for the Library of Congress’s .gov Holdings

Benjamin Lee
Library of Congress, United States of America

Longstanding efforts by the Library of Congress over the past two-plus decades have yielded enormously rich web archives. These web archives – especially the .gov holdings – represent an unparalleled opportunity to study the history of the past 25 years. However, scholars and the public alike face a persistent challenge of scale: how to navigate and analyze the .gov domain archives, which contain upwards of billions of webpage snapshots and have limited affordances for searching them. Given the centrality of these archives in understanding the digital revolution and broader society in the 21st century, the importance of addressing this challenge of searchability at scale is redoubled.

I will present progress on an interdisciplinary machine learning research project to re-envision search and discovery for these .gov web archives within the Library of Congress’s holdings with the goal of understanding the U.S. government’s evolving online presence. My talk will focus on two primary areas. First, I will detail my work to incorporate recent developments in human-AI interaction and interactive machine learning toward new search affordances beyond standard keyword search. In particular, I will discuss my in-progress work surrounding multimodal, user-adaptable search, enabling end-users to interactively search not only over text but also images and visual features as well according to facets and concepts of interest. I will present a short demo of these affordances. Second, I will describe how such affordances can be utilized by end-users interested in studying the online presence of the United States government at scale. Here, I will build on existing work surrounding scholarly use of web archives to describe next steps for evaluation. I will also detail new collaborations with scholars of other disciplines to use these affordances in order to answer research questions.

Extending Classification Models with Bibliographic Metadata: Datasets and Results

Mark Phillips¹, Cornelia Caragea², Seo Yeon Park², Praneeth Rikka¹, Saran Pandi²¹University of North Texas, United States of America; ²University of Illinois Chicago, United States of America

The University of North Texas and the University of Illinois at Chicago have been working on a series of projects and experiments focused on the use of machine models to assist in the classification of high-value publications from the web. The ultimate goal of this work is to create methods for identifying publications that we can add to existing digital library infrastructures that support discovery, access, and further preservation of these resources.

During the first round of research, the team developed datasets to support this effort including datasets that represent state documents from the texas.gov domain, scholarly publications from the unt.edu domain, and technical reports from the usda.gov domain. These datasets are manually labeled into categories of either “in-scope” or “not in-scope” of collection development plans for local collections of publications. Additionally the research team has developed datasets containing positive-only samples to augment the labeled datasets to provide more potential training data.

In the second round of research, additional datasets were created to test new approaches for incorporating bibliographic metadata into model building. A dataset of publications from the state of Michigan and its michigan.gov domain was created and labeled with the “in-scope” and “not in-scope” labels. Next, several metadata-only datasets were created that would be used to test the applicability of leveraging existing bibliographic metadata in model building.

Finally, collections of unlabelled PDF content from these various web archives were generated to provide large collections of data that can be used to experiment with models that require larger amounts of data to work successfully.

This presentation will present information on how various web archives were used to create these datasets. The process for labeling the datasets will be discussed as well as the need for additional positive only datasets that were created for the project. We will present findings in the utility of existing bibliographic metadata for assisting in the training of models for document classification and report on the design and results of experiments completed to help answer the question of how we can leverage existing bibliographic metadata to assist in the automatic selection of high value publications from web archives. This presentation will provide concrete examples of how web archives can be used to develop datasets that can contribute to research projects that span the fields of machine learning and information science. We hope that the processes used in this research project will be applicable to similar projects.

Utilizing Large Language Models for Semantic Search and Summarization of International Television News Archives

Sawood Alam¹, Mark Graham¹, Roger Macdonald¹, Kalev Leetaru^2

1Internet Archive, United States of America; ²GDELT Project, United States of America

Among many different media types, the Internet Archive also preserves television news from various international TV channels in many different languages. The GDELT project leverages some Google Cloud services to transcribe and translate these archived TV news collections and makes them more accessible. However, the amount of transcribed and translated text produced daily can be overwhelming for human consumption in its raw form. In this work we leverage Large Language Models (LLMs) to summarize daily news and facilitate semantic search and question answering against the longitudinal index of the TV news archive.

The end-to-end pipeline of this process includes tasks of TV stream archiving, audio extraction, transcription, translation, chunking, vectorization, clustering, sampling, summarization, and representation. Translated transcripts are split into smaller chunks of about 30 seconds (a tunable parameter) with the assumption that this duration is neither too large to accommodate multiple concepts nor too small to fit only a partial concept discussed on TV. These chunks are treated as independent documents for which vector representations are retrieved from a Generative Pre-trained Transformer (GPT) model. Generated vectors are clustered using algorithms like KNN or DBSCAN to identify pieces of transcripts throughout the day that are repetitions of similar concepts. The centroid of each cluster is selected as the representative sample for their topics. GPT models are leveraged to summarize each sample. We have crafted a prompt that instructs the GPT model to synthesize the most prominent headlines, their descriptions, various types of classifications, and keywords/entities from provided transcripts.

We classify clusters to identify whether they represent ads or local news that might not be of the interest of the international audience. After excluding unnecessary clusters, the interactive summary of each headline is rendered in a web application. We also maintain metadata of each chunk (video IDs and timestamps) that we use in the representation to embed a corresponding small part of the archived video for reference.

Furthermore, valuable chunks of transcripts and associated metadata are stored in a vector database to facilitate semantic search and LLM-powered question answering. The vector database is queried with the search question to identify most relevant transcript chunks stored in the database that might be helpful to answer the question based on their vector similarities. Returned documents are used in LLM APIs with suitable prompts to generate answers.

We have deployed a test instance of our experiment and open-sourced our implementation (https://github.com/internetarchive/newsum).

MeshWARC: Exploring the Semantic Space of the Web Archive

Amr Sheta², Mohab Yousry², Youssef Eldakar¹
¹Bibliotheca Alexandrina, Egypt; ²Alexandria University, Egypt

The web is known as a network of webpages connected through hyperlinks, but to what degree are hyperlinks consistent with semantic relationships? Everyday web browsing experience shows that hyperlinks do not always follow semantics, such as in the case of ads, so people mostly resort to search engines for navigating the web. This sparks the need for an alternative way for linking resources in the web archive for a better navigation experience and to enhance the search process in the future. We introduce meshWARC, a novel technique for constructing a network representation of web archives based on the semantic similarity between webpages. This method transforms the textual content of pages into vector embeddings using a multilingual sentence transformer and then constructs a graph based on a similarity measure between each pair of pages. The graph is further enriched with topic modeling to group pages of the same topic into clusters and assign a suitable title to each cluster.

The process begins with the elimination of irrelevant content from the WARC files and filtering out all non-HTML resources, for which we use the DBSCAN clustering algorithm, as we anticipate, based on observation experimenting with data, that pages with no actual textual content, e.g., “soft 404” pages and possibly homepages, will have closely related vector embeddings. A graph is then constructed by using cosine similarity between each pair of pages and connecting pairs that have a similarity score higher than a certain threshold.

For topic modeling, we developed an enhanced version of BERTopic, which incorporates our new clustering algorithm to generate clusters and identify the remaining noise from our previous technique. It also uses the attention values of each word in the document to highlight the most important words and reduce the document size. The resulting clusters are each labeled with a generated topic title, providing a comprehensive and semantically meaningful representation of the cluster.

To expand on this work in the future, a search engine can be created by representing the search text as a vector embedding and then comparing it to the centers of the clusters, which narrows down the search space. The page rank can then be deduced given the similarity of each page with the search terms and how centered the page is within its cluster. We are also interested in assessing how much the graph constructed based on semantic similarity has in common with the graph of the hyperlinks.

SESSION#02: UNIQUE CONTENT

80 Thousand Pages On Street Art : Exploring Techniques To Build Thematic Collections

Ricardo Basílio
Arquivo.pt, Portugal

Street art, especially graffiti, has a strong presence in the streets of Lisbon. The city authorities have a service dedicated to urban art and relations with artists, the GAU: Gabinete de Arte Urbana (Urban Art Office) (http://gau.cm-lisboa.pt). On its website, more than 500 artists have been registered with their creations since the 1990s. Every year, GAU organises a festival dedicated to street art, the MURO(Wall) Festival (https://www.festivalmuro.pt/).

In 2023, Arquivo.pt (https://arquivo.pt) began a collaboration with GAU with the aim of promoting Arquivo.pt, offering training to the street art community and creating a special collection of web content of their interest.

This presentation describes the process of creating the web collection about street art in Portugal, namely 1) the techniques used to identify thousands of URLs; 2) the type of recording; 3) the difficulties and the lessons learned.

First, we describe several techniques that have been tried. We highlight the technique that simply uses a search engine and a link extractor. This allowed us to obtain around 80 thousand unique URLs from searches for "graffiti" and artist names. Anyone can use this technique, even without being an IT expert to quickly identify large amounts about of web content about a given topic.

We also show how we used automatic search tools to add new URLs (seeds) to our collection. We experimented with Bing Search API to obtain results from Bing and with SearXNG to get results from multiple search engines. For digital curators capable of running scripts, automatic search through APIs is a powerful technique to be explored.

Finally, we consider the identification of content about street art by experts, through the collaboration with the GAU team. In this case, the contents were chosen individually, recommended for their relevance and accompanied by additional information, ready to be included in collections such as the "Street Art" collection promoted by IIPC.

Second, we explain how we proceeded with the recording. To record more than 80 thousand seeds, we chose to limit recording to the page level (single page), aggregating the recording into batches of up to 10,000 URLs and treating some types of pages separately, such as social networks. We used the Webrecorder's Browsertrix-cloud service (beta) and Browsertrix-crawler. The generated WARC files were integrated into Arquivo.pt.

In the third point, we share the difficulties we found in crawling the list of selected URLs, we explain how we broadened the criteria to include pages about street art from other countries and why we tolerated some off-target pages.

In conclusion, the use of various techniques to speed up the identification of existing content on the Web about a given topic is useful for building collections. However, it raises questions that curators should debate. This is an ongoing project and in April 2024 we will publish the results of the interaction between the street art community in Portugal and Arquivo.pt.

Saving Ads: Assessing and Improving Web Archives’ Holdings of Online Advertisements

Christopher Rauch¹, Mat Kelly¹, Alexander Poole¹, Michele C Weigle², Michael L Nelson², Travis Reid^2

1Drexel Univesity, United States of America; ²Old Dominion University, United States of America

Advertisements provide foundational source material for social, cultural, ethnographic, and commercial historical studies. The existence of research library collections for printed advertisements, such as Penn State University Libraries’ “Advertising: History and Archives” and The Smithsonian National Museum of American History’s Advertising collection, suggest the importance of preserving evidence of the mores and norms of a time and place represented by advertisements and their context. Online advertisements have similar—if not greater—cultural significance and impact. Just as physical ephemera in libraries, archives, and museums fuels compelling research, so too do online ads illuminate the contemporary objectives of advertisers, social norms, viewpoints, and ideals in ways that carefully curated news stories cannot.
As the scale and scope of data sources on the web have grown, the creators of web archival tools, such as Ilya Kreymer’s Webrecorder, have added methods to exclude content from capture for both practical reasons : the expense of storing the massive volume of data, and legal or social restrictions: the owners assert copyright, or the content is deemed not appropriate to archive. The maintainers of popular web archives, such as the Internet Archive, have employed these techniques to focus collection activities on certain categories of content, excluding most ad-related resources. Even when the ad-related content is captured, it is often not possible to experience replay in context because of the dynamic components of rich media ads and other restrictions that discourage the rendering of ads outside of their original intended advertising channels. Exclusive curatorial decisions come at a cost. Libraries and web archivists have a strong interest in being able to provide these types of resources for scholars, as scholars cannot study what has not been preserved.

In this presentation, we describe the current state of archival practice as it relates to the preservation of web-based advertising in context, highlight the large gap in the historical record created by the failure to archive such advertisements, and suggest approaches archivists can use to ensure that the temporally and culturally relevant information sources represented by web advertisements are preserved. Attendees will become familiar with the settings in popular tools that control the scope of archival crawling, techniques to broaden the scope of capture to include advertising materials with a reasonable tradeoff to additional storage requirements, and understand how already archived data can be coaxed into replaying web advertisements that are not rendered by default.

Working Together to Capture, Preserve and Provide Access to Digital Artworks

Claire Newing¹, Tom Storrar¹, Patricia Falcao², Sarah Haylett², Jane Kennedy^2

1The National Archives, United Kingdom; ²Tate, United Kingdom

It is well known that interactive, multimedia content is difficult to capture and present using traditional web archiving techniques. We will seek to explain how two organistions with a long standing relationship, developed a new way of working together to improve the quality of complex content held in a web archive.

The first organisation is a specialist art institution, which is also a public body. The second is a national memory organisation which hosts a national web archive. The specialist art institution recognised that some interactive, multimedia content was missing from archived websites held in the national web archive. It undertook a project to capture the content using Webrecorder as part of a focused effort to preserve web artworks. Although it had experience of digital preservation it did not have any experience hosting web archives. The national memory organisation was capturing their websites regularly as part of large scale crawls but did not have resources available to manually capture interactive artworks. It was agreed that it was most appropriate to provide access to the additional content through the national web archive where it could be presented within the wider context of archived versions of its website.

As it was the first time either institution had embarked on a WARC transfer project, all workflows and documentation were either built from scratch or by adapting resources developed for different purposes. This included: composing an agreement between the two organisations, creating a safe and secure workflow for transferring the WARC content, establishing a system for quality assurance checks and developing a methodology for cataloguing. It was a truly collaborative effort across the two organisations which had a good outcome for both.

We will discuss the process followed to achieve the successful transfer and the various legal, technical and practical considerations involved, from the perspective of both the transferring and receiving organisations. We hope to provide the audience with some knowledge of the factors they would need to consider when undertaking similar projects.

Put it Back! Archived Memes in Context

Valérie Schafer
University of Luxembourg

Memes constitute a significant aspect of online digital cultures (see Shifman, 2014; Milner, 2018; Denisova, 2016). Their role in reactions to events like the Ukrainian war highlights their broad influence. However, preserving memes, especially in their original context, proves challenging (Pailler & Schafer, 2022). Initiatives like SUCHO's Meme Wall and the one by the Library of Congress (which archived Know Your Meme but also Meme generator, when creating a particular collection in 2017 “Remix, Slang and Memes: A New Collection Documents Web Culture”) reflect attempts at preserving these cultural artifacts. Furthermore, institutional web archives, as exemplified by projects I led like Hivi (A history of online virality) and Buzz-F in collaboration with the BnF Datalab, have significant content related to memes and viral online phenomena such as the Harlem Shake. Nevertheless, the process of archiving memes remains underdeveloped, resulting also in challenges related to retrievability, searchability, and contextualization.

This presentation delves first into the unique challenges posed by memes concerning contextualization. It explores their circulation across platforms, notably on social media networks, their diverse and ephemeral nature, and the cognitive context intricacies tied to content and usage. Memes can for example vary drastically in meaning based on where they appear – be it a tweet, a post within an activist Facebook group, or other contexts. Understanding these multilayered intricacies is vital for comprehending the broader significance of memes and for developing suitable archiving processes.

Following a brief presentation of the various contextual layers involved into meme analysis, we will delve into the extent to which meme archiving addresses – or could address – these diverse dimensions. It becomes crucial to identify elements that must be preserved to facilitate proper contextualization and potential new approaches to these challenges. One proposal involves for example creating shared ontologies to enrich metadata. Additionally, the idea of specialized meme collections within archives could be explored.

This presentation draws upon the experience gained from two research projects, Hivi and Buzz-F, with the aim of initiating a discussion on context preservation and the challenges posed by ephemeral online phenomena such as memes. It offers insights relevant to the Research & Access theme, touching upon topics such as Unrealized business and technical requirements for web archives research; Creating & providing researchers with datasets from web archives collections; Collaborations between researchers and web archivists.

References:

Denisova A (2016) Political Memes as Tools of Dissent and Alternative Digital Activism in the Russian-language Twitter. PhD thesis, Westminster University.

Milner R (2018) The world made meme: Public conversations and participatory media. Cambridge, MA: The MIT Press.

Pailler F & Schafer V (2022). « Never gonna give you up ». Historiciser la viralité numérique. Revue d’histoire culturelle 5, http://journals.openedition.org/rhc/3314

Shifman L (2014). Memes in digital culture. Cambridge, MA: The MIT Press.

SESSION#03: CONTEXTUAL

Averting the “Digital Dark Age”: The Digital Preservation Moment and the Birth of Modern Web Archiving, 1994-1996

Ian Milligan
University of Waterloo, Canada

The Call for Papers for WAC 2024 was announced on the 20th anniversary of the forming of the International Internet Preservation Consortium (IIPC), a conscious nod to the importance of history in understanding the context of contemporary web archiving. One cannot understand web archives without understanding not only the broader context of the time in which they were created, but the broader historical context of why and how web archiving organizations developed and how particular approaches to web archiving were adopted. Yet histories of web archiving are often quick and somewhat rote. A nod towards the Internet Archive and a few other national libraries launching web archiving projects in 1996, perhaps a glimpse to the Wayback Machine in 2001, and occasionally the IIPC’s 2003 founding.

In my presentation, I provide discussion around the origin moment of web archiving. Why did web archiving begin at the Internet Archive, the National Library of Canada, the Swedish National Library, the Dutch Royal Library, and the Australian National Library all roughly at the same time? I argue that 1994-1996 witnessed a process that built a social and cultural consensus about web archiving.

My presentation argues that the 1990s transformed a debate which had been largely happening within record management and within the archival profession throughout the 1960s and 1970s, and propelled it into public conciseness through the conscious framing of a “digital dark age.” This concept made concerns around digital obsolescence seem like a problem that was not just for the Fortune 500 and governments, but all of society. Ideas spread outwards from academic venues and fora such as the 1994-1996 Task Force on Digital Archiving and research libraries to have broader social and cultural impact.

Between 1995 and 1998, a series of individuals – including science fiction author Bruce Sterling, Microsoft Chief Technology Officer Nathan Myhrvold, information scholar Margaret Hedstrom, technologist Brewster Kahle, documentarian Terry Sanders, and Long Now Foundation founder Stewart Brand – reshaped the cultural conversation to broaden digital preservation from an academic field to one understood as having wide-ranging implications. This would not just be the preservation of technical or corporate documents, but rather the collective digital memory of our society.

Why does this fit with this conference? Curators and users of web archives need to understand the broader historical context which gave rise to their programs, and this can help understand ways in which to motivate it today. When we speak of a “digital dark age” in 2023, we are drawing on an intellectual conversation stretching back 30 years. My goal is that practitioners and researchers, with a better understanding of their history, will be well positioned to explain the value of their programs today.

The Form Of Websites: Studying The Formal Development Of Websites, The Case Of Professional Danish Football Clubs 1996-2021

Niels Brügger
Aarhus University, Denmark

This presentation discuss how we can analyse the historical development of the form of websites based on the holdings in web archives. In other words: what can we say about a website's form, no matter which concrete words, images, graphics, videos and sounds it conveys?

Focus is on the following topics that can help us understand the formal characteristics of a website:

delimiting the website in space and time by striking the balance between having a website with most web pages, and keeping it temporally consistent
including the web pages that were not archived, but linked to
size (number of web pages), and structure (broad/deep website)
word-heaviness (number of words in total and on average per web page)
image-heaviness (number of images in total and on average per web page)
menu items (number of main-/submenu items, vertical/horizontal)
length of web pages

The formal characteristics of websites are important to study historically since they constitute the changing framings of the content and the use.

Inspiration: Theoretically the presentation is informed by the monograph The Form of News (Barnhurst & Nerone, 2001), in which the authors analyse the various elements that have historically played into making the news look as they have done, and how these forms have changed from 1750 to 2000, including analyses of the role of pictures, the front page, the overall format and design, etc. The presentation investigates how such an analysis would look like if the web is the object of study, keeping the specificities of the archived web in mind (Brügger, 2018). Also, the presentation is inspired by the seminal work of Byrne about the use of the archived web as a source to study football history (Byrne, 2020).

Context: The presentation takes its point of departure in the ongoing research project "Becoming professional: The web presence of Danish Superliga clubs 1996-2021". Part of this analysis is a detailed analysis of the formal development af the soccer club's official websites.

Data: The study is based on the holdings of the national Danish web archive Netarkivet. As for the study's first years Netarkivet's material is constituted by an ingest of relevant Danish websites acquired from the Internet Archive.

Methods: Quantitative methods, using the software Orange, and manual coding.

Contribution: The presentation contributes to the fields of researcher use of web archives as well as to web history by demonstrating how web archives can be used in the study of the development of the web as such and of cultural phenomena in the context of web archives such as the communicative use of the web by sports clubs. As a side effect the presentation also indicates how one can study the completeness of what is archived.

References:

Barnhurst, K.G., Nerone, J. (2001). The Form of News: A history. New York/London: The Guildford Press.

Brügger, N. (2018). The archived web: Doing history in the digital age. Cambridge, MA: MIT Press.

Byrne, H. (2020). Reviewing football history through the UK Web Archive, Soccer & Society, 21(4), 461-474, DOI: 10.1080/14660970.2020.1751474

Challenges of Putting Web Archives in a Comprehensive Context: the Case of Vdl.lu

Carmen Noguera
University of Luxembourg

This presentation aims to showcase the importance of context when analyzing the archived web, may it be for a research on a specific topic using web archives as a primary source, for a particular period – for example, the web of the 1990s– or for the analysis of a diachronic study of a particular website.

The presentation will be based on my experience using web archives for my PhD research on Digital Cultures and their development in Luxembourg from the 1990s to the present day. One of my case studies aims to analyze the origins of e-government and e-citizens in the country. In order to do that, I mapped the communes that had a website in the 1990s and early 2000s, identified their main objectives, and analyzed how they evolved over the years towards greater participation of citizens.

In concrete, this presentation will focus on the analysis of the website of the Ville de Luxembourg (www.vdl.lu) and what additional context is needed to conduct this research through web archives. Based on the challenges I faced during the diachronic analysis of this website using different web archives, such as the Internet Archive, arquivo.pt and the National Library of Luxembourg web archives, I will discuss the needs, strategies, and requests in terms of contextualization, documentation, and metadata, as well as discoverability and incompleteness of data. I will also underline what would be needed in terms of interoperability between web archives.

The need for greater contextualization of the data context and more documentation and descriptive metadata (Milligan, 2020; Webster, 2017; Brügger, Schafer, Geeraert, Isbergue & Chambers, 2020; Venlet et al., 2018; Vlassenroot et al., 2019, among others) is a recurrent request of web archives researchers. The case study aims to demonstrate the practical challenges when analyzing a website and the importance of contextualization and documentation, especially when facing information loss and working with several web archives.

References:

Milligan, I. (2020). You Shouldn't Need to Be a Web Historian to Use Web Archives: Lowering Barriers to Access Through Community and Infrastructure. Newark New Jersey USA: WARCnet. https://doi.org/10.1145/2910896.2910913

Webster, P. (2017). Users, technologies, organisations: Towards a cultural history of world web archiving. In N. Brügger & N. (Eds.), Web 25. Histories from 25 years of the World Wide Web (pp. 175–190). New York: Peter Lang.

Brügger, N., Schafer, V., Geeraert, F., Isbergue, N., & Chambers, S. (2020). Exploring the 20-year evolution of a research community: Web-archives as essential sources for historical research. Bladen voor documentatie/Cahiers de la documentation, 2, 62-72. http://hdl.handle.net/10993/43903

Vlassenroot, E., Chambers S., Di Pretoro E., Geeraert, F., Haesendonck G., Michel A., & Mechant, P. (April 1, 2019). Web Archives as a Data Resource for Digital Scholars. International Journal of Digital Humanities 1, no. 1: 85–111. https://doi.org/10.1007/s42803-019-00007-7

Venlet, J., Farrell, K. S., Kim, T., O'Dell, A. J., & Dooley, J. (2018). Descriptive Metadata for Web Archiving: Literature Review of User Needs. https://doi.org/10.25333/C33P7Z

SESSION#04: Delivery & Access

Renascer Project Brings Back Old Websites at Arquivo.pt

Ricardo Basílio, Daniel Gomes
Arquivo.pt, Portugal

Organisations keep domains that referenced websites which are no longer being used, to prevent them from being bought or because they were just forgotten. In this presentation we share how Arquivo.pt developed the Renascer project to recover the content of historical websites and make it widely accessible once again. The aim of the Renascer (Reborn) project is to bring back historical websites whose content is no longer available online but whose domain continues to be held by their owners.

“Forgotten” domains can cause cybersecurity problems. For example, in May 2023, the domain hmsportugal.pt of the Harvard Medical School-Portugal project referenced just one default web page hosted on an active server but the domain continued to be owned by its owner. In this situation, the original content of the website was inaccessible despite that the domain continued to be owned by the author of the website. Furthermore, since the domain was still pointing to an active web server, cybersecurity incidents could occur if this server was not being properly maintained. However, the website hmsportugal.pt could be safely reference its historical content because it was preserved by Arquivo.pt (https://arquivo.pt/wayback/20131105224548/http://hmsportugal.pt/).

How are websites Reborn? The domain owner just needs to redirect it to Arquivo.pt, through the Memorial service (https://arquivo.pt/memorial). For example, the mctes.pt domain owned by the Portuguese Government resumed to reference its content which was preserved by Arquivo.pt. Thus, making this website content widely accessible online once again. Notice that all the links from other websites to the web pages of the reborn website are also reactivated.

During 2023, we demonstrated the potential of the Renascer project by applying it internally to the historical websites of our organisation FCT (Foundation for Science and Technology). The Renascer project identified 11 active domains managed by FCT which were not referencing valid content, and then brought their historical content back to life. From this initial project, we are now promoting that organisations should recover the content of their historical websites. We have received positive feedback. Reborning historical websites is very interesting for organisations to commemorate anniversaries or easily keeping track of past activities. On the other hand, it is a simple yet powerful service that demonstrates the “real-world” utility of web archives in modern societies.

Preserving the Uncrawlable: Serving the Server

Andrew McDonnell
University of Kentucky, United States of America

Many of the challenges inherent in archiving dynamic web content revolve around the capture and playback of ever-evolving social media sites. However, older dynamic web content that has ceased to evolve continues to elude the tools and conventions most widely available to web archivists. This presentation will share work conducted to preserve an online digital humanities project that defied preservation via Archive-It and other crawler-based web archiving tools. The presentation will, moreover, offer an exploration of alternative options for web archivists struggling to preserve sites dependent on server-side processes for their essential functionality.

Suda On Line is a collaborative digital translation of a 10th-century encyclopedia. The interactive project was initially developed in 1998 and constructed as a website using CGI scripts, a bespoke database suite, and server rules that, taken as a whole, required user input via search fields and clicks to dynamically generate pages. With over 30,000 entries compiled by more than 200 scholars from 20 countries across five continents, the last entry in the Suda was translated on the site 16 years after it launched. As the primary collaborators responsible for the site retired from academic work, they found an entity willing to temporarily host their server, but that arrangement is winding down. It has no apparent long-term home willing to maintain and update the site in perpetuity. As a result, one of the project’s original authors, with the permission of his collaborators, asked their university archives to archive this collective work of scholarship, and the archives accepted the task.

The university’s archives discovered that tools such as Archive-It and WebRecorder were capable of mimicking the look of the Suda interface, but none of its core functionality. The site's dependence on users' interactions with a database via server processes meant that even though 70,000+ files were captured during crawls, the end result could not playback as intended. As a secondary means of preserving the work, the archives were granted a copy of the site’s full Ubuntu server image, packaged as two virtual machines within an OVF file. This presentation will unpack the various paths and cul-de-sacs explored in the process of preserving and providing access to the Suda On Line, including tools such as ReproZip-Web and Oracle VM VirtualBox. It will also reflect on the feasibility and merit of providing offline access to tools, research, and scholarship that in their active lives of creation and purpose were so fundamentally online. In the broader context, is digital preservation of the offline surrogate of this site (and others like it) fundamentally transforming it from tool to artifact in a manner that is irreversible?

Lost and Found in Cyberspace: Reconstructing MultiTorg

Jon Carlstedt Tønnessen
National Library of Norway

In May 1993, a group of computer scientists published the very first website on the Norwegian top-level domain: MultiTorg. Probably among the first hundred websites in the world, MultiTorg pioneered web publishing by offering instant news from the National News Agency. Visitors could also view recent satellite images from outer space, short clips from upcoming video games, and download Paul McCartney’s latest hit “Hope of Deliverance”. In its time, MultiTorg was regarded as ground-breaking, creating “a door from people’s homes into cyberspace."

Since MultiTorg was published at the very dawn of the web, long before anyone thought of archiving the web, it escaped archiving in a standardised way. However, one of the creators made a backup of the server in September 1993. This was recently acquired by the Norwegian Web Archive, and provides a valuable glimpse into the birth of the web. The crucial question is how these remains of the early web can be preserved and examined in a reliable manner.

The presentation will unfold in three parts:

Taking inspiration from the innovative approach of Dutch computer scientists to perform web archaeology, I will present a systematic method to excavate, register, refit, and reconstruct the remains of MultiTorg, based on the server backup.

Delving into MultiTorg’s content, I will present valuable insights into web media in its infancy. We will explore the visions of the early pioneers, the revolutionary simplicity of HTML markup, and understand how MultiTorg tried to conceptualise cyberspace as a “marketplace”.

Finally, I will address some unsolved questions, related to the preservability of old web content that has escaped harvesting. This includes the need for international cooperation in rescuing remains of the lost, pre-archived web that may still be found and preserved.

Towards Multi-Layered Access with Automatic Classification

Jon Carlstedt Tønnessen, Thomas Langvann
National Library of Norway

Providing access remains a challenge for most web archives. In Norway, legislative preparations have suggested a multi-layered model for access, related to the material’s copyright status and the degree of personal data. However, information about these matters is rarely declared in the content we harvest, and never in a uniform or standardised way.

This paper will present an effort to automatically classify web archive records into different categories of access. The objective is to make as much of the material available to as many as possible, satisfying the needs for documentation and research, within the limits of copyright and data protection laws.

First, the different categories for access will be presented briefly:

Governmental sites (open access to anyone online)
Publications with a responsible editor-in-chief (accessible for anyone in University Libraries)
Content without a responsible editor-in-chief, or with personal data not regarded as widely known (accessible for researchers with a relevant purpose)

Second, we will share the process of establishing computational acceptance criteria for the different categories and a confidence score, necessary to perform automatic classification. And third, we will review our experiences with different approaches to automatic classification:

Linking data with external registers and authorities
Rule-based classification, based on regular expressions and link targets
Machine-learning, using labelled datasets with certain extracted features to train a model, and then test the model’s ability to predict the category

The presentation will focus on the advantages and disadvantages with the different approaches. Furthermore, we will touch upon dilemmas related to precision level (domain, page or resource), maintenance over time, tracking domain ownership history, and classifying older-generation web content.

SESSION#05: Collaborations

LGBT+ and Religion: Queering Web Archive Research

Jesper Verhoef
Erasmus University Rotterdam, Netherlands

It is often believed that religion is inimical to LGBT+ people in general (e.g. Wilcox, 2006). However, some scholars have rightly questioned the alleged disconnect between religion and non-normative gender and sexuality (e.g. Klinken, 2019; O’Brien, 2014). In fact, religion plays a critical role in the lives of many queer individuals and ‘queer-positive religious movements and spaces have mushroomed globally during the last few decades’ (Jansen, 2022, p. 5; cf. Talvacchia et al., 2014).

Websites are a prime example of these (safe) spaces. However, extant research fails to do these justice. In my talk, I will show how web archives can be used to study queer religious web sites. I will share results of the project Mapping the Dutch queer web sphere that I recently conducted at the National Library (KB) of the Netherlands, in conjunction with KB web archivists and a software engineer. As part of its special LGBT+ web collection, the KB has annually harvested sixteen websites catered to Christian queers and a couple geared towards LGBT+ Muslims. By means of hyperlink analyses, I analyzed the networks that these religious queer websites formed, and whether these changed over time (2009-2022, but because of available data, the focus will be recent years).

I will demonstrate that there was a distinct queer religious web sphere, where religious websites predominantly referred to one another. I will interpret these findings – which I will also render as insightful Gephi visualizations – and briefly discuss intriguing findings. For instance, the relation between catholic and protestant websites was strong, which suggests that the ‘common cause’ of being both Christian and LGBT+ trumps intra-religious differences. I will also go into inter-religious differences, i.e. between the links of queer Muslim and Chistian websites.

Crucially, this talk should have an impact on all conference attendees, because it does not merely highlight – in a hands-on fashion – how collaboration between researchers and web archivists can further our understanding of key societal and historical questions. Moreover, it discusses the workflows we developed to create and analyze datasets. These should be conducive to future researchers, thus finally leading to more researcher engagement with the invaluable but notoriously underused KB web archive - or any web archive, really.

Enter The Trading Zone: When Web Archivists And Researchers Meet To Explore Transnational Events In Archived Web Collections

Susan Aasman¹, Anat Ben-David², Niels Brügger³
¹University of Groningen, Netherlands; ²Open University of Israel; ³Aarhus University, Denmark

The aim of this presentation is to discuss the ways archived web collections can be suitable for transnational studies of past and present web communication. By sharing lessons learned from the Web Archive Studies (WARCnet) Network it will become clear what the opportunities and constraints are for both web archivists and researchers.

From 2020 until 2023, WARCnet has been an active group comprising web archiving practitioners and researchers, dedicated to researching web domains and events. The primary goal was to facilitate high-quality research about the history of both national and transnational web domains, as well as to open up ways for studying transnational events across web domains. During these years, the network actively encouraged interdisciplinary research among researchers and engaged with web archivists and IT developers as crucial partners in this emerging cultural heritage field.

From this community grew a tremendously rich output of papers, as well as the forthcoming edited volume The Routledge companion to transnational web archive studies. The book covers research by scholars and web archivists about the challenges of analysing entire national web domains from a transnational perspective; reflections on the opportunities of archiving and researching transnational events such as the COVID-19 pandemic the challenges of using digital methods in web archive studies; and assessments of the politics of web archives as collections.

During the presentation, we will share the main lessons learned from this research network and take our book’s output as a starting point for discussing the added value of close collaborations between scholars, archivists, IT developers, and data scientists in this field. We have come to appreciate both the challenges and opportunities to develop what has been called a ‘trading zone’(Collins et.al 2007) with interdisciplinary interactions that enable participants to have meaningful collaborations.

Collins, H, Evans, R., Gorman, M., Trading zones and interactional expertise, Studies in History and Philosophy of Science Part A, Volume 38, Issue 4, 2007.

Web Archiving, Open Access, & Multi-Custodialism

Monica Westin¹, Jefferson Bailey²
¹Internet Archive, United Kingdom, ²Internet Archive, United States of America

Starting in 2018, the Internet Archive pursued a large-scale project to build a complete collection of open-access scholarly outputs that are published on the web, the IA Scholar project (scholar.archive.org). Additionally, the project has worked to improve the discoverability of scholarly works archived through data processing and AI/ML work, partnership development, technical integrations, and related products that make this material available in myriad ways.

While the IA scholar program originated to solve the challenges of a lack of preservation infrastructure for many scholarly publications and related outputs (data, code, etc), especially from under resourced, “long-tail,” publishers, the project also has illuminated the benefits of multi-custodialism, i.e. that web archives have affordances that enable their preservation and accessibility at a variety of complementary stewardship organizations. The idea of “more copies in more places” is a core tenet of digital preservation; however, in practice this effort usually implies a geo-distributed replication of content across multiple technical locations administered by one institution or through cooperative agreements sharing technical infrastructure, and not necessarily long-term content stewardship by multiple independent, but collaborating, organizations.

This talk will describe how the web archiving work of the IA Scholar project, as well as related efforts on web data sharing at Internet Archive, have advanced a multi-custodial approach to ensuring both preservation and access to essential scholarly knowledge and how these archives can be shared and stewarded by a diverse set of organizations. Initiatives that will be discussed in this talk:

The IA Scholar Selfless Deposit Service: Using IAS and its underlying catalog, the IAS team has developed tools and techniques to automatically determine how many articles were published by authors affiliated with an institution, then compare these lists with an institution’s holdings in its own institutional repository (IR). Content in IAS that is not in an institution’s IR can then be shared with the institution, helping fulfill many of their local mandates on data stewardship, open access, and digital preservation.

Preservation Equity Program: The Internet Archive's Archiving & Data Services group is has run a variety of sponsored services, especially around web archiving, that ensure the preservation of at-risk content in multiple repositories to help ensure equal access to digital preservation tools for everyone, with a focus on empowering mission-driven organizations. Similarly, in collaboration with CLOCKSS, DOAJ, Keepers Registry and PKP, Internet Archive is a partner in Project JASPER, an initiative to preserve open access journals in multiple preservation systems in response to research that shows that smaller online journals are highly ephemeral and lack long-term digital preservation resources.

Web Data Sharing: The same IA team has been working with a number of national libraries and other organizations to provide extractions of early web archive data to partners to help complete their collections and ensure the preservation of this data in multiple repositories.

This presentation will provide an overview of these projects to illustrate and advocate for the larger idea of web archives being uniquely suited for a multi-custodial approach to preservation.

SESSION#06: Legal & Ethical

Intellectual Property & Privacy Concerns of Web Harvesting in the EU

Anastasia Nefeli Vidaki
Vrije Universiteit Brussel, Belgium

With data being confronted as the “oil of the 21st century” special attention on a global scale has been drawn to its more technologically efficient, fast and reliable accumulation methods. At the same time, new disruptive technologies such as, but not limited to Artificial Intelligence (AI) and machine learning have appeared and gradually dominated the data scene, assisting and facilitating even more data collection. Often based on them web harvesting is playing a leading role in obtaining digital data.

Web harvesting tools utilise software in many cases built upon AI so as to methodically browse the internet and extract the data carrying desired information. The process is rather easy to follow. A set of webpages is made available to the web harvester, which fetches further the other pages made accessible via this initial set. Through this procedure some parts of the webpage are stored and the content, possibly along with metadata in it, are downloaded and stored as well. The use of the data obtained afterwards can vary from archiving to data analytics and distribution to third-parties.

The paper focuses on the legal issues to which the aforementioned re-use of data gives birth. The most pertinent, at least for the European Union (EU) sphere, are considered the constraints imposed by the intellectual property and data protection legislation. On the one hand, the crawling of websites that include copyright-protected content and its extraction and reproduction without authorization violates EU copyright law and leads to the imposition of serious sanctions. However, the EU legislator has catered for offering some statutory exceptions, namely the one for private use, the one for temporary copies and the most recent one for Text and Data Mining (TDM) purposes. They will be explored theoretically and practically in order to figure out whether they provide a solution to the problem or if they demand deeper interpretation in light of the jurisprudence and the continuous technological development. The same observations can be made regarding the sui generis database right.

On the other hand, as long as the crawled content might consist of personal data, the matters of privacy and data protection come into play. With its strict regulatory framework EU has struggled to combat the unlawful processing of personal data. Therefore, data-intensive technologies like web scrapers should be designed and operated taking into account the principles prescribed in General Data Protection Regulation (GDPR). Nonetheless, compliance with data minimization, storage limitation, anonymization and lawfulness of the processing might be needed on behalf of data controller and processor along with complex organisational and technical measures. Questions arise concerning the set of rules for data access and scrutiny imposed by the Digital Services Act (DSA), which has not yet completely entered into force.

Finally, by bearing in mind the costs and burdens caused, a balancing between the obligations enshrined by the EU law and an aspiration for a more technological deterministic open data policy, powered by practices like web harvesting is suggested. There is need for transnational and interdisciplinary debate and cooperation.

DSM to the Rescue? Implications of the new EU Copyright Directive for Social Media Archiving: the Case of the Belgian Transposition and the Cultural Heritage Archives in Flanders

Ellen Van Keer, Rony Vissers
meemoo, Belgium

Main topic: legal context
Keywords: social media archiving, copyright, reproduction, text- and data mining

Social media archiving presents various legal obstacles for cultural heritage institutions (CHI’s). The aim of this contribution is to clarify how recent developments in EU copyright legislation can lower the barriers for social media archiving projects in the heritage sector.

Much content on social media is protected by copyright. Rightsholders hold exclusive rights over the use of their works and users need to gain their prior permission for using them. However, due to the large scale and wide diversity of social media content it is not realistic for heritage institutions to get permissions of all potentially involved rightsholders before engaging with social media archiving practices. Of course, this is not a completely new problem. Many items in heritage collections are protected by copyright, which generally lasts until 70 years after the author's death - this term is harmonised across the EU.

In order to keep a fair balance with competing fundamental rights such as the right of information and access to culture, copyright systems foresee a number of exceptions allowing certain uses in the public interest without the burden of prior authorization. CHI’s fulfil public tasks and are an important category of beneficiaries. While copyright remains a national competence, exceptions have been harmonised at European level. A first milestone in this development was the so-called InfoSoc-directive from 2001 (1), which included a closed list of 20 facultative exceptions (Art. 5 InfoSoc). A significant update came with the so-called DSM-directive in 2019 (2) which has been transposed into (most) national law systems over the last few years. In particular two provisions in the DSM are relevant here. First, an exception for preservation of cultural heritage has become mandatory (Art. 6 DSM). It provides a legal solution for digitising and preserving digital cultural content. Secondly, a new exception for text and data mining has been introduced (Art. 3 DSM). This creates a legal framework for large scale digital data collection and the application of AI.

This contribution will discuss the relevant provisions in more detail and clarify how they apply to social media archiving practices and projects in CHI’s, both in view of capture and preservation as well as access and valorization of social media content. As a case in point we will be looking at the Belgian transposition and its implications for the cultural heritage archives in Flanders, but the legislative framework and archival questions addressed bear relevance to the broader international web archiving community.

(1) https://eur-lex.europa.eu/legal-content/NL/TXT/?uri=CELEX%3A32001L0029

(2) https://eur-lex.europa.eu/legal-content/EN/TXT/HTML/?uri=CELEX:32019L0790&qid=1695379375092

Digital Legal Deposit beyond the web

Vladimir Tybin
National Library of France

Under the law on copyright and related rights in the information society passed in France in 2006, digital legal deposit was introduced at the Bibliothèque nationale de France along with web legal deposit. At the time, BnF was responsible for collecting all "signs, signals, writings, images, sounds and messages communicated to the public by electronic means in France". In reality, the library had begun archiving the web and building up collections of French websites long before, since our web archive collections have a historical depth dating back to 1996 and to date represent more than 48 billion URL for 2 petabytes of data. In this way, the heritage mission of collecting everything that is disseminated on the French web in order to build up a national digital memory has gradually developed and been strengthened to the point where it is now an essential component of BnF's historical legal deposit mission.

However, it soon became apparent that many digital objects distributed electronically were escaping the automatic harvesting carried out by our robots for technical reasons or simply because of the commercial barriers behind which they were hiding. This is the case for digital books and scores on the market; for digital periodicals, journals, magazines and newspapers; for digital maps and plans; for digital photographs distributed by agencies and authors and videos distributed on streaming and VOD platforms; for applications, software and egames on the market, but also for all the production of born-digital audio documents, music production distributed on platforms. A solution had to be found to guarantee continuity from the physical to the digital world, and to avoid any disruption to BnF's legal deposit collections and heritage gaps.

We therefore decided to set up a new system for collecting all born digital documents in the same way as the historical legal deposit system, under which publishers, producers, authors and distributors would deposit their digital files along with their metadata to guarantee long-term preservation, record in our general catalogue and access for consultation in the reading rooms. From digital books to digital sound, not forgetting all other types of digital document, all these channels for entry, cataloguing, preservation and access are gradually being put in place, and are part of a major strategic challenge for the Bibliothèque nationale de France.

The aim of this presentation is to describe the various projects that have gone into setting up these systems: changes to the legal framework, technical development of specific tools and workflows, organisational work required to integrate these new processes and the implementation of a scientific policy for legal deposit.

SESSION#07: Communities I

Wikipedia Articles Related To Switzerland For Eternity

Barbara Signori
Swiss National Library

To ensure that the Wikipedia source of knowledge is securely preserved and accessible for future generations, the Swiss National Library – a federal memory institution with the legal mandate to collect, preserve and disseminate information with a bearing on Switzerland – is building a digital collection in which it compiles Wikipedia content related to Switzerland, archives it permanently, and makes it freely available online. The Wikipedia collection is listed in the library catalog Helveticat and can be searched and consulted in e-Helvetica Access, the entry portal to the Swiss National Library digital collections (containing also books, dissertations, journals, standards and websites). Wikimedia CH – the Swiss Chapter of the Wikimedia Foundation and promoting free knowledge in Switzerland – welcomes and supports this new digital collection. The two organizations have worked closely together to realize it.

The collection was officially initialised in June 2023. The first overall crawl includes about 130,000 identified articles. Identifying the relevant articles with a reference to Switzerland is definitely the biggest challenge: What is considered to be related to Switzerland and what is not? Where to draw the line? We analysed whether AI could help us with this – however, the initial effort would have been very large and the quality of the result uncertain. That's why we opted for a simpler solution. The articles are identified with the help of the query tool PET-scan and the categories “Switzerland” including their subcategories. The URLs (the contents) are obtained via script, copied, downloaded, fed into the long-term archive system and processed. The individual articles are “frozen” and remain unchanged in the archive. The live Wikipedia article, on the other hand, can evolve at any time thanks to the collaboration of volunteer contributors. New and modified articles are planned to be added and archived in a regular cycle (at most once a year). Content is collected in the four national languages, i.e. from Wikipedia in German, French, Italian and Romansh. The collection of content in other languages is not excluded in the long term. However, the representative compilation does not claim to be exhaustive. In addition to the text, we include all images, graphics, audios, videos contained in the Wikipedia articles, the list of contributors and the applicable license terms, etc., including their metadata. We also have internal organizational procedures and measures in place to ensure that reports of potentially infringing content can be processed and removed quickly if necessary.

The contribution tells about how we, together with representatives from the Wikipedia community, have built a new kind of web collection. The collection is not integrated into Web Archive Switzerland – it is an independent web collection. The first one that we are allowed to make freely available. Conference attendees can benefit from our experience should they plan similar projects.

A Conceptual Model of Decentralized Storage for Community-Based Archives

Zakiya Collier, Jon Voss, Bergis Jules
Shift Collective, United States of America

Community-based archives hold some of the most valuable materials documenting the lives of marginalized people and they mostly exist independently of other traditional academic or government-run cultural heritage institutions. But while these archives continue to collect and preserve these histories, many of them face difficulties growing their operations, keeping their doors open, and enhancing their programming and collections activities because of a lack of funding opportunities and other ongoing resources common in larger institutions.

Beginning in 2023, the presenters began a three-year collaborative research and development project to develop use cases for a decentralized storage solution, and an ethical framework for engaging in this quickly developing technology.

With an eight-person research faculty and technical advisors, the first year of the project was dedicated to exploring the ethical, cultural, and technical needs of small cultural memory organizations that steward the most important historical collections for diverse communities around the world—including but not limited to precarious and hyperlocalized web archived content. In the first year, we focused on exploring affordable and sustainable digital storage and how a decentralized storage network might address particular stated needs for the tens of thousands of these organizations.

In the second year, the team is focused on designing a conceptual model for community-centered, non-extractive, affordable, and accessible long-term storage, using the Historypin platform as the front-end interface. The model suggests ways for larger, better-resourced cultural memory institutions to provide low to no-cost digital storage solutions for small organizations without extracting or removing digital assets, transferring ownership, dictating access terms, or even requiring access to community-owned collections at all.

This presentation will share our research to date and highlight the broader implications for the field.

Recommendations For Content Advisories In The UK Web Archive

Nicola Bingham
British Library, United Kingdom

The UK Web Archive plays a vital role in preserving digital content for societal benefit. However, it faces the challenge of addressing distressing and harmful material encountered by users. The need for content warnings was initially raised in March 2020, leading to ongoing discussions and assessments.

This conference paper outlines the strategic approach, ratified in September 2023, to address users' encounters with harmful or distressing content within the UK Web Archive. The ‘Content Advisory’ project’s primary goals were to mitigate the risks associated with offensive and controversial material in the archive while enhancing the overall user experience and maintaining appropriate content presentation. Rather than censorship, the focus is on contextualization.

The Content Advisory Group's recommendations focus on the use of content warnings and advisories, understanding associated risks, reviewing best practices from other web archives, and engaging with related projects.

The conference presentation will focus on the following areas:

1) Access Routes and Content Advisory Levels.

The UK Web Archive offers diverse access routes, making it challenging to ensure uniform content warnings. The majority of content comes from automated domain crawls, making precise warnings difficult. The presentation will outline the group’s exploration of various advisory placement options, including MARC records, Terms & Conditions pages, the UK Wayback Banner, calendar views, individual website advisories, and collection-level warnings.

2) Recommendations: Next Steps

Given limited resources, the Content Advisory paper recommended a comprehensive and continuous review of collection descriptions, the creation of collection scoping documents, a focus on Equality, Diversity, and Inclusion (EDI) goals, improved user education, and the introduction of a high-level content warning on the home page. It also suggested monitoring automated language-checking developments. This conference presentation will report on the results of a pilot project to review and revise a smaller number of our collection descriptions and will outline how this was implemented.

3) Related Groups and Projects

The Content Advisory Group drew insights from other relevant projects and institutions, emphasizing collaboration and alignment with institutional policies. The conference presentation will outline the work of related groups, projects and roles at the [host institution] including the Content Guidance Steering Group, Race Equality Action Plan, Collections Access Review and Approval Group (CARAG), new Black Studies Lead Curator, Metadata Lead for Equity and Inclusion, and the Legal Deposit User Forum.

Contribution and impact of paper:

This conference paper outlines the strategy to handle harmful content in the UK Web Archive. It presents practical recommendations and highlights the importance of collaboration with related projects and groups. The approach prioritizes user experience and contextualization, providing a valuable framework for managing challenging content in digital archives. The paper's impact lies in its potential to influence policy and practice within the web archive community, ensuring a more inclusive and user-friendly web archive experience.

Archiving The Black Web: Research & Access In Context - An Update

Zakiya Collier¹, Bergis Jules², Makiba Foster³¹Shift Collective, United States of America; ²College of Wooster, United States of America; ³Archiving the Black Web, United States of America

Archiving the Black Web (ATBW) WARC School is a web archiving training program for memory workers affiliated with Black collecting community archives, public libraries, and HBCUs (Historically Black Colleges and Universities). The WARC School is set to launch in fall 2024 with the intention of educating and training current LIS professionals and community-based memory workers to not only become builders but also future users of web archives that focus on Black culture. Placing our training program within the constellation of efforts to understand what exactly is “The Black Web,” we, as web archivists and memory workers, are collaborating with humanists and social scientists in researching and mapping The Black Web from experience and theory to web archive practice. In this talk, ATBW will share information about our strategy to develop web archiving training curriculum within the context of building web archive collections that center Black digital content. In our discussion of the research project, we will share how the scholarship contextualizes Black people and their experiences within the larger world wide web and thereby informs our work in training people to create the web archives related to Black history and culture.

SESSION#08: Tools

The BnL’s Migration From OpenWayback To A Hybrid PyWb-SolrWayback Engine Powered By S3 Storage

László Tóth
National Library of Luxembourg

During the year 2023, the National Library of Luxembourg has undertaken the task of migrating its existing web archives, totaling around 300 TB compressed WARC files served by OpenWayback using a static CDX index, to a high-performance hybrid infrastructure consisting of PyWb and SolrWayback using an OutbackCDX index server and state-or-the-art S3 object storage. The goal of this migration was threefold: to modernize the BnL’s offer to users in terms of accessibility, search speed and overall end-user experience, to use the latest web archiving tools and workflows available to date, and to provide a highly efficient and responsive storage solution for our web archives. Thus, we improve the three main pillars of our web archives: user experience, software and hardware. Our final solution sits atop 4 high-performance servers, two of which were initially used for indexing our collections. These machines, having more than 2.5 TB RAM and 192 cores in total, are geographically distributed and host the Solr cluster, SolrWayback, PyWb, OutbackCDX applications and are connected to our governmental S3 object storage network. In order to efficiently retrieve data from this storage, we developed custom modules for SolrWayback and PyWb that are able to stream WARC records directly from an S3 storage starting at any given offset up to any given length, without any additional I/O delay related to stream skipping. Thus, playback is as fast as it would be if the archives were read from local storage. To conclude, these features, though not available by default within the aforementioned applications, have been developed in an open-source spirit and are made available freely online for anyone to use.

WARC-GPT: An Open-Source Tool for Exploring Web Archives Using AI

Matteo Cargnelutti, Kristi Mukk, Clare Stanton
Library Innovation Lab, United States of America

This year the team at the Library Innovation Lab (home of Perma.cc) has been exploring how artificial intelligence changes our relationship to knowledge. Partially inspired by colleagues at last year’s IIPC conference, one of our initial questions was: “Can the techniques used to ground and augment the responses provided by Large Language Models be used to help explore web archive collections?”

That question led us to develop and release WARC-GPT: an experimental open-source Retrieval Augmented Generation tool for exploring collections of WARC files using AI.

WARC-GPT functions as a highly-customizable boilerplate the web archiving community can use to explore the intersection between web archiving and AI. Specifically, WARC-GPT is a RAG pipeline, which allows for the creation of a knowledge base out of a set of WARC files, which is later used to help answer questions asked to a Large Language Model (LLM) of the user’s choosing.

What would it mean for your team’s process if it could interact with a chatbot that had insight into what you’ve captured from the web? Would a chatbot with knowledge of your collection be of use for description work? While still an experimental tool, WARC-GPT is a step towards understanding questions like that. Our team will share our experience so far testing things out, decisions we've made around tools, and share how other organizations can do that same.

The Perma team believes the expansion of our project beyond the technology and service we’ve offered for many years is of interest to the IIPC community. All of our work is still rooted in our original service, focusing on authenticity, fidelity, and provenance - but built to be more expansive.

R You Validating These WARCs? Automating Our Validation- And Policy-Checking Processes With R

Lotte Wijsman, Jacob Takema
National Archives of the Netherlands

As the National Archives of the Netherlands we receive web archives from one or more producers instead of harvesting these ourselves. Subsequently, the quality of the web archives can differ. To be able to ensure a consistent quality and the long-term preservation of the web archives, we must ask ourselves if the archive is complete, technically sound, and conforms to our guidelines. Since 2021, national government agencies need to comply to the guideline on archiving governmental websites (2018) when harvesting websites. Amongst other requirements, the web archives should be daily harvested, conform to the ISO-28500 standard, contain full- and incremental harvest, and have a maximum size of 1GB. To ensure conformance to the WARC standard, we validate the WARC files with e.g. JHOVE.

Previously, we have presented our work on validation and the web archiving guideline at the WAC. In 2023, we had a poster on WARC validation that also included our future ambitions, because there is still much to improve. The output of WARC validation tools is not always easy to work with, especially when working with a lot of files, and the tools don’t always check everything we want checked (such as the size of the WARC). Furthermore, we considered conformance checking to be the next step beyond validation. This is why we went searching for a way to not only find something that fits our every need concerning validation and conformance checking, but also to automate these processes.

To accomplish this work, we have started to use R, a programming language suited for data analysis. Using R, we have worked on building an automated conformance checker, which not only validates mandatory properties of the WARC standard, but also optional yet (for us) important properties (e.g. payload digests and file size), and important aspects for long-term preservation (e.g. embedded file formats). Furthermore, we have implemented an automated conformance check to see if web archives conform to our guideline for archiving Dutch governmental websites (e.g. harvest frequency, full- and incremental harvests).

In our presentation, we will first provide the attendees with some background information on our previous work on the web archiving guideline and WARC-validation. Subsequently, we will share how we came to automated conformance checking and we will give a demo using R to show our prototype conformance checker.

Machine-Assisted Quality Assurance (QA) in Browsertrix Cloud

Tessa Walsh, Ilya Kreymer
Webrecorder, Canada

Manual and software-assisted workflows for quality assurance (QA) of crawls are an important and underdeveloped area of work for web archives (Reyes Ayala, 2014; Taylor, 2016). Presentations at previous Web Archiving Conferences such as last year’s talks by The National Archives (UK) and Library of Congress have focused on institutions’ internal practices and tools to facilitate understanding and assuring the quality of captures of websites and social media (Feissali & Bickford, 2023; Bicho, et al, 2023). A similar conversation was facilitated by the Digital Preservation Coalition in their 2023 online event “Web Archiving: How to Achieve Effective Quality Assurance.” These presentations and discussions show that there is a great deal of interest in and perceived need for tools to assist with performing quality assurance of crawls in the web archiving community.

This talk will discuss machine-assisted quality assurance features built in the open source Browsertrix Cloud project that have the potential to help a wide range of web archivists across different institutions gain an understanding of the quality and content of their crawls. We will discuss the goals of automated QA as an iterative process, first, to help users understand which pages may require user intervention, and then, how those might be fixed automatically. The talk will outline assisted QA features as they exist in Browsertrix Cloud at the time of the presentation, such as indicators of capture completeness, whether failed page captures resulted from issues with the website or crawler, and how/if they could be automatically fixed. The talk will provide examples of the types of issues in crawling that may be discovered and how they are surfaced to the user for possible intervention, discuss lessons learned in collecting user stories for and implementing the QA features, and point to possible next steps for further improvement.

As the presentation will discuss the assistive possibilities of software in aiding traditionally manual processes in web archiving, lessons learned are likely to apply widely to all such assistive uses of technology, including other conference themes such as the use of artificial intelligence and machine learning technologies in web archives.

References:

Bicho, Grace; Lyon, Meghan & Lehman, Amanda. The Human in the Machine: Sustaining a Quality Assurance Lifecycle at the Library of Congress, presentation, May 11, 2023; (https://digital.library.unt.edu/ark:/67531/metadc2143888/: accessed September 21, 2023), University of North Texas Libraries, UNT Digital Library, https://digital.library.unt.edu; crediting International Internet Preservation Consortium.

Feissali, Kourosh & Bickford, Jake. Open Auto QA at UK Government Web Archive, presentation, May 11, 2023; (https://digital.library.unt.edu/ark:/67531/metadc2143893/: accessed September 21, 2023), University of North Texas Libraries, UNT Digital Library, https://digital.library.unt.edu; crediting International Internet Preservation Consortium.

Reyes Ayala, Brenda; Phillips, Mark Edward & Ko, Lauren. Current Quality Assurance Practices in Web Archiving, paper, August 19, 2014; (https://digital.library.unt.edu/ark:/67531/metadc333026/: accessed September 21, 2023), University of North Texas Libraries, UNT Digital Library, https://digital.library.unt.edu; .

Taylor, Nicholas. Rethinking Web Archiving Quality Assurance for Impact, Scalability, and Sustainability, presentation, August 4, 2016; https://nullhandle.org/pdf/2016-08-04_rethinking_web_archiving_quality_assurance.pdf: accessed September 21, 2023).

SESSION#09: Communities II

Building and Promoting A National Web Archive Through Regional Cooperation: A French Experience

Anaïs Crinière-Boizet¹, Elise Girold²
¹National Library of France; ²National University Library of Strasbourg, France

For the past 20 years, as part of the cooperation with associated regional centres, the Bibliothèque nationale de France (BnF) has been able to call on a network of external contributors throughout France to complement national collections with regional and local websites. This collaboration began with the 2004 regional elections and has been strengthened since then.

This network of 26 printing legal deposit libraries is called upon for every electoral collection with a local dimension (municipal, departmental, regional and legislative elections). These libraries also have the opportunity, thanks to an organisation of regional collections and a collaborative curation tool, to select websites related to the political, economic, cultural and social life of their region. They were able to offer local selections during the Covid-19 pandemic. This organisation allows them to participate in the creation of tomorrow's heritage as part of the shared challenge of preserving regional heritage online. In return, they can promote local content, related to their region, through « guided tours » into the BnF search application : « Archives de l’internet ».

In addition, since 2014, French law has authorised remote access to BnF's web archives in these partner libraries. This remote access is currently available in 21 of those and allows them to highlight web archives to the general public as well as to students and researchers during events such as the European Heritage Days or workshops and training courses.

This two voices presentation will include a feedback from one of those libraries, the Bibliothèque nationale et universitaire de Strasbourg (Bnu). The Bnu has been involved in the electoral crawls since 2007, is in charge of a regional collection for the region Alsace since 2013 and has developped many services to accompany the work of researchers on web archives.

Web Archiving for Preserving the Digital Memory of Brazil

Jonas Ferrigolo Melo¹, Moises Rockembach^2

1University of Porto, Portugal; ²University of Coimbra, Portugal

The evolution of the web presents challenges in preserving Brazil's internet history and protecting relevant digital content. This study underscores the importance of web archiving for preserving this history, enabling access to historical information, and preventing data loss. The "arquivo.ong.br" project, an essential initiative in this context, exemplifies the significance of these efforts.The growth of digital documents, including text, databases, images, audio, video, and websites, has been recognized since the early 2000s.

In 2010, the National Archives of Brazil initiated the AN Digital program to preserve and provide access to digital archival documents, emphasizing the importance of preserving digital content. In 2021, the National Archives Council (CONARQ) established the Technical Advisory Chamber (CTC) for the Preservation of Websites and Social Media, aiming to preserve dynamic digital documents. Additionally, legislation seeks to criminalize the misappropriation of digital assets from official websites. Early research by Dantas (2014) highlighted the absence of web page collections in Brazilian institutions, leading to the creation of the Buscas.br collection.

Projects like "arquivo.ong.br" are critical for preserving Brazil's digital memory, highlighting the ongoing need for their development and enhancement. Other recent projects like ARQWEB and Graúna also focus on preserving web content. The Arqweb project, initiated in 2022 by the Brazilian Institute of Information in Science and Technology (IBICT), aims to archive partner institution websites and government sites. Similarly, the Graúna Project, developed by the NUPEF Institute, focuses on preserving the memory of the internet by archiving websites related to human rights, the environment, culture, and health.

The "arquivo.ong.br" project, affiliated with the Federal University of Rio Grande do Sul, serves as a important resource for preserving web archives, including presidential election websites and COVID-related content. It has accumulated around 50GB of diverse digital data. The project uses tools like Heritrix for automated web data collection and Pywb for user-friendly access and exploration of preserved files. Pywb allows users to search, browse past website versions, and interactively explore the archive, making it an essential source for researchers, academics, and anyone interested in Brazil's web history.

Digital Diaspora: Preserving the Jewish Internet

Hana Cooper
National Library of Israel

Since its establishment in 1892, the National Library of Israel has been dedicated to the collection and preservation of the historical, cultural and intellectual story of Israel and the Jewish people. This mission has undergone various transformations over the past centuries in tandem with the changes that the Jewish people and the state of Israel have experienced. The advent of the internet as an information store has expanded our understanding of community, nationality and culture at large. In particular, the diasporic quality of the Jewish identity is reflected well in the wide-ranging and early use of online spaces to the benefit of individuals and communities.

Over the past decades, the internet has enabled Jews in Israel and the diaspora to preserve and share their cultural heritage, practices, and stories. It is a diverse and dynamic space where stories, memories, values and identities are shared and preserved across languages and borders. From personal blogs and social media accounts to digital archives and educational resources, the Jewish Internet showcases the richness and complexity of a heritage that spans centuries and continents. Since the early 2000s, The National Library of Israel has recognized that the preservation of contemporary Jewish identity hinges on the preservation of this digital heritage. Consequently, a large assortment of methods and tools have been implemented over the last decades—which we are still attempting to refine and restructure in order to meet the everchanging digital landscape.

Alongside our in-house efforts, we are actively working on ways in which we might be able to foster collaborative partnerships with other cultural institutions, academic researchers, and Jewish communal organizations. Some of these collaborations will engage in curatorial work (by region/language and by topic), while others will address legal and ethical concerns. In the IIPC conference we would like to present the progress made so far by the National Library of Israel, and raise a number of strategic, ethical and practical dilemmas that we face going forward.

SESSION#10: Digital Preservation

The Potentials and Challenges for Researchers and Web Archives Using the Persistent Web IDentifier (PWID)

Caroline Nyvang, Eld Zierau
Royal Danish Library

In order for researchers to live up to good research practice, we need to be able to make persistent references to contents in web archives. In some cases, different types of Persistent Identifiers can be used. However, for web archive pages or element references, which needs to be resolvable for more than 50 years, the Persistent Web IDentifier (PWID) is often the best choice. Many referencing guidelines or standards recommend that references to web archives should be made via an archived URL. This is a challenge not only for closed web archives, but also for web archives that change addresses for their web archive data. For instance, this happened when the Irish web archive migrated their holdings from an Internet Memory Foundation (IMF) platform (http://collection.europarchive.org/nli/) to an Archive-IT web archive service (https://archive-it.org/home/nli). It will also be the case for web archives changing archive URLs due to changes related to the Wayback machine.

The PWID resolves many of the known issues with common identifiers as it is based on basic web archive metadata; web archive, archival time of web element, archived URL of web element and precision or inherited interpretation of the PWID, like page or part/file. Thus, once the web archive is identified, the archival time and archived URL can be used to find the resource since these metadata are present in WARC. Finally, the information about interpretation/precision of the resource can be used as a means to choose manifestation of the page and access to the resource. This means that resolving of a PWID does not rely on a separate registry of the contents of a web archive (which can be huge), since the WARC metadata can be indexed (e.g. in CDX or SOLR) and this index will be able to support the resolving. Furthermore, the design of the PWID has been based on bridge building between digital humanity researchers, web archivist, persistent identifier experts, Internet experts etc. in order to meet requirements of being human readable, persistent, technology agnostic, global, algorithmically resolvable and accepted as an URN.

Using the PWID, researchers will gain a way to persistently address web elements in a sustainable way. The web archives can benefit from the PWID, too, both in regards to the implementation of support for researchers, and in creation of the web archive when there are several manifestations of a web page. For example, the British Library web archive uses the PWID when archiving snapshots of web pages. Furthermore, since a PWID URN is a URI, it can be used as URI identifier as is e.g. required for WARC identifiers. The PWID can become even more useful for researchers when is incorporated in reference tools like Zotero etc.

The presenters will discuss their different perspectives as researcher within the humanities and as computer scientist and web archivist. The presentation will cover challenges and experiences from each perspective as well as future potentials in support and through expansion of the PWID URN definition.

Arquivo.pt CitationSaver: Preserving Citations for Online Documents

Pedro Gomes, Daniel Gomes
Arquivo.pt, Portugal

Scientific documents, whether books or articles, reference web addresses (URLs) to cite documents published online. In the case of scientific articles, the importance of these citations is even greater in order to maintain the integrity of an investigation because they often reference fundamental information to allow the reproducibility of an experiment or analysis. For example, links in a scientific article can cite datasets, software or web news that supported the research and which are not included in the text of the scientific article.

However, documents published online disappear very quickly. This means that the citations contained in a scientific document, which are fundamental to guaranteeing its scientific validity, become invalid.

Arquivo.pt is a research infrastructure that provides tools to preserve and exploit data from the web to meet the needs of scientists and ordinary citizens and our mission is to provide digital infrastructures to support the academic and scientific community. However, until now, Arquivo.pt has focused on collecting data from websites hosted under the .PT domain, which is not enough to guarantee the preservation of relevant content for the academic and scientific community.

In response to the need to preserve the integrity of scientific documents and other documents that cite documents published online, Arquivo.pt has created a new project called CitationSaver. CitationSaver, available at arquivo.pt/citationsaver, automatically extracts the links cited in a document and preserves their content (e.g. web pages cited in a book) so that they can be retrieved later from Arquivo.pt.

This presentation will detail the context that led to the need to create the CitationSaver service and how it works. The CitationSaver service allows users to help select and immediately preserve relevant information published online before it is altered.

In addition, we will demonstrate how, using the APIs provided by the Open Science ecosystem, we can automatically identify scientific documents and data published online to be preserved. For instance, we used the API from RCAAP (Open Access Scientific Repositories in Portugal) to get all scientific publications and used our system to extract more than 10 million URLs.

Integration of Bit Preservation for Web Archives, Using the Open Source BitRepository.org Framework

Rasmus Kristensen, Mathias Jensen, Colin Rosenthal, Eld Zierau
Royal Danish Library

The Royal Danish Library has used the open source BitRepository.org framework as basis for bit preservation of Danish cultural heritage for more than ten years. Until 2022 this was the case for all digital materials except the web archive materials. Until then the bit preservation of web archive materials relied on the NetarchiveSuite archival module. However, this module had several disadvantages, since it only supported bit preservation with two online copies and one checksum copy. A third copy therefore had to be a backup copy which was detached from the active bit preservation, where copies could be regularly checked and compared (via checksums). In the late 2010’s, the library wanted to modernize the bit preservation platform for the Danish web archive, which resulted in an integration the NetarchiveSuite and BitRepository.org Framework. This also enabled the possibility only to have one online copy of the web archive, to have three copies all included in the active bit preservation, to have numerous checksum copies and to enable better independence between the copies, and thus have reduced risk of incidents destroying all copies.

This presentation will present the capabilities of the BitRepository.org framework concerning how it can support advanced active bit preservation for web archives in general. The main theme in the presentation will be about the bit preservation, and how bitrepository.org enables use of storage of copies on all types of current and future media, and how it is technology agnostic in the sense that software and media technologies can be change rather easily over time. It will also be presented how Bitrepository.org framework supports daily bit preservation operations, and how it enables setup with high access possibilities as well as providing a basis for high operation security at all levels. Furthermore, it the experiences of integrating using the flexibility of setup will be presented, as well as the experiences with integration with NetarchiveSuite, and how it now supports S3, and thus can be integrated with many other web collection solutions.

SESSION#11: Planning

Public Tenders for web archiving services - The BnL’s approach to balancing autonomy and outsourcing for the Luxembourg Web Archive

Ben Els, Yves Maurer
National Library of Luxembourg

Drawing on the experience writing a tender for web archiving services twice, the BnL will present the overall strategy, process and lessons learned.

Piggybacking off of the 2020 Covid-crisis, the BnL was able to increase its efforts in archiving the Luxembourg web, by adding an additional domain crawl on a one-off basis. This lead to the plan of continuing to expand the archiving activities in the following years, which would naturally require a raise in yearly spending. Among different possibilities, the BnL decided to prepare a public tender for web archiving services.

This presentation covers the preparations for the first tender in 2021:

the different sections and services
(domain crawls, self-service archiving tool, daily crawls of news media websites)
tender specifications
(implementation timeframe, provider’s experience)
technical requirements
(crawling capacities, deduplication capacities)
weighting of the quality-price ratio

The lessons learned after the first tendering process and contract period were used to prepare for a second tender in 2023, which concurred with the recruitment of a software engineer, to further strengthen the BnL’s autonomy with an in-house crawling infrastructure. For certain high priority targets, the BnL is now able to operate high-quality crawls with a customized Browsertrix crawler, complimentary to the web archiving services included in the tender contract.

We will tie up the presentation with an overview of the different collections and archiving methods combined between the BnL’s service provider and in-house crawls, with an outlook into the future of our technical infrastructure and improvements in the areas of access and research.

Building a Roadmap for Web Archiving Organizational Sustainability in an American Research University Library

Ruth E. Bryan, Emily Collier
University of Kentucky Libraries, United States of America

The presenters, archivists in an academic university Library, launched a web archiving program for a public university in the United States in 2018 with a three-year Archive-it contract. The initial impetus for preserving websites stemmed from the legal mandate for the University Archives (a part of the Library’s Special Collections) to collect and preserve the university’s permanent records. These permanent records are defined in a records retention schedule and are increasingly created and disseminated only online. A secondary justification for initiating a web archiving program was to preserve websites and social media platforms of cultural importance in understanding the past and present of our state broadly, especially including individuals, families, communities, and organizations that are currently underrepresented in existing Special Collections primary sources. The COVID-19 pandemic accelerated the trend for these documents to also be created in primarily online formats.

During the first six years of our web archiving program, we have preserved 2.4 Terabytes (72,463,597 documents), established collection development and description policies and procedures, conducted technology research, started collaborating with the university’s central web management team, and developed toolkits and workflows for handling different types of online resources and quality assurance–all in a situation of minimal staffing (.25 to .50 full-time equivalent, temporary employees (or FTE) and .1 to .2 permanent FTE). We have laid the groundwork for an ongoing web archiving program through robust documentation built in anticipation of potential loss of resources, especially personnel.

In May 2024, we will submit a funding request to Library administrators for a third, three-year contract, which provides us with an opportunity to evaluate and plan for the ongoing organizational sustainability of web archiving in our specific legal, cultural, staffing, partnership, and technical infrastructure context. Our definition of “organizational sustainability” is based in Eschenfedler, et. al’s conception of “...the arrangements of people and work practices that keep digital projects and services over time, given ongoing challenges'' (2019, p. 183). We will use their nine-dimensional framework for digital cultural heritage organizational sustainability in conjunction with the Socio-Technical Sustainability Roadmap (STSR), an evaluation and planning methodology already used successfully by other Library colleagues to develop guidelines to support collaborative digital projects with university faculty. Our IIPC Web Archiving Conference presentation will report on our resulting Socio-Technical Sustainability Roadmap for web archiving, how we plan to implement it, what we learned along the way, and the new and ongoing collaborations and relationships that underpin it.

This presentation builds on previous work about our web archiving program successes and challenges that we have shared at US conferences in 2022-2023. IIPC conference attendees will be able to use our roadmap as an example of planning for and implementing organizational sustainability in a legally-mandated, relationship-rich, and minimal resource context.

Eschenfelder, K. R., Shankar, K., Williams, R. D., Salo, D., Zhang, M., & Langham, A. (2019). A nine dimensional framework for digital cultural heritage organizational sustainability. Online Information Review, 43(2), 182–196. https://doi.org/10.1108/OIR-11-2017-0318

STSR: https://sites.haa.pitt.edu/sustainabilityroadmap/

Using Collection Development Policy to Address Critical Collection Gaps

Rashi Joshi
Library of Congress, United States of America

In late 2020, the Library embarked on an evaluation of our then 20-year old web archiving program. The interdepartmental effort spanned several months and focused on addressing key aspects of the program such as resource allocation, acquisitions workflows, and longstanding collection gaps. This presentation will discuss two new policy approaches that were initiated in the spring of 2021 to address critical collection gaps.

Our web archives contain 4 PB of content and provide access to over 87 event and thematic collections on diverse topics. Despite the scale of the archives, an internal collection assessment of available metadata confirmed uneven collecting and persisting gaps for some subjects the Library covers and has assigned specialists for but for which there is no dedicated web archiving activity. To address collection gaps, we implemented two new policy approaches aiming to make the best use of limited crawling and staff resources while centering the Library’s mission to provide access to a universal collection of knowledge and creativity. First, we coordinated annual internal collecting priorities for web archiving that mapped to gap areas and published these through a Library-level collections policy committee. Second, we developed a new policy and formal process allowing for senior managers to easily initiate interdepartmental collections on high value topics to complement routine staff-driven proposals.

Since these approaches were implemented in 2021, they have proven to be an effective driving force for targeted submissions of new collection-level proposals. The Library has over 200 subject specialists whose contributions to web archiving vary greatly. Since 2021, subject specialists have submitted proposals that directly respond to annual priority areas for which we had previously not seen proposals in the past 20 years: art & design, mass communications, local history & genealogy, and several non-U.S. geographic regions underrepresented in the archives. Similarly, senior managers initiated several large collections on critical and current topics. For the management-initiated collections, the Collection Development Office acted as an incubator as it does for subject specialist-initiated proposals, but expanded on its role by drafting proposed collection scopes and providing them to managers as potential starting points that they could then revise as needed based on subject specialist feedback and develop into full project plans. We did this for the Coronavirus, Protests Against Racism, the Jan 6th Attack on the Capitol, and most recently Climate Change collections all of which were initiated after the start of the pandemic and represent high value, multiyear collections.

This presentation could be helpful to those institutions that have experienced a similar impasse in which their web archives have some critical, longstanding collection gaps but where staff and program resources are also scarce. Implementing policy approaches such as library-level priority setting and taking a more active collection development support role via staff positions that are at least partially dedicated to incubating digital content acquisitions proposals, may serve as a model for other organizations to help focus limited staff time and resources strategically.

Practice Meets Research: Developing a Training Workshop to Support Use of Web Archives in Arts & Humanities Research

Sara Day Thomson, Anna Talley, Alice Austin
University of Edinburgh, United Kingdom

This short presentation will provide an overview of a training workshop for arts and humanities PhD researchers in the UK developed through collaboration between digital archiving practitioners and academic researchers. This training workshop is organized by a UK HEI and funded by a national body that supports PhD study and relevant training. The curriculum reflects the importance of understanding collection practices to underpin research analysis and methodologies. The training in part responds to the insights of Jessica Ogden and Emily Maemura in their 2021 paper: ‘… our work points to the value of engaging with the curatorial infrastructure that surrounds the creation of web archives … and the importance of curators’ involvement in discussions at these early stages of research’. [1]

Based on the outcomes of similar practitioner-researcher collaborations, such as the BUDDAH project [2] and WARCnet [3], this training workshop primarily aims to raise awareness and encourage researcher use of web archives. The PhD collaborator, whose project makes extensive use of web archives as primary source material, recognised a need for training that would help fellow researchers to navigate the limitations of archived web content and the varied methodologies for both collecting and analysing web archives. At the same time, the digital archive practitioners were wrapping up involvement in an externally-funded project that provided an opportunity to explore the challenges to researcher use of the archived web, particularly in a national web archive context. Both experiences showed that while researchers might have a general interest in the archived web, researchers in arts and humanities fields show hesitancy to engage with such sources due to a perceived lack of “technical skills.”

Taking a cue from previous work in this area and recent experiences, the collaborative researcher-practitioner team developed a training workshop to help reduce this barrier to use. The workshop emphasizes hands-on exercises alongside case studies and introductory overviews, recognising practical experience as the best way to encounter and understand the opportunities and limitations presented by web archives. These practical exercises also provide an opportunity to demonstrate the effects of the tools most commonly used to capture and curate web archives.

This presentation will discuss the development of the training workshop, including the curriculum and outreach, as well as analysis of the outcomes. Though the workshop will not be delivered until November 2023, the collaboration has already helped to gather insights from researchers as well as raise the profile of web archives through a prominent postgraduate research funder in the UK. The team hopes to share lessons learned with WAC 2024 and invite input on how to improve the training and make an even greater impact in future.

[1] Ogden, J. , & Maemura, E. (2021). ‘Go Fish’: Conceptualising the challenges of engaging national web archives for digital research. International Journal of Digital Humanities, 2(1-3). DOI: https://doi.org/10.1007/s42803-021-00032-5

[2] Big UK Domain Data for the Arts and Humanities, AHRC-funded project (2014-2016). https://web.archive.org/web/20161023043903/http://buddah.projects.history.ac.uk/

[3] WARCnet project, Independent Research Fund Denmark-funded project (2020-2023). https://web.archive.org/web/20230901135947/https://cc.au.dk/en/warcnet

SESSION#12: Innovative Harvesting

Decentralized Web Archiving and Replay via InterPlanetary Archival Record Object (IPARO)

Sawood Alam
Internet Archive, United States of America

We propose InterPlanetary Archival Record Object (IPARO), a decentralized version tracking system using the existing primitives of InterPlanetary File System (IPFS) and InterPlanetary Name System (IPNS). While we focus primarily on the web archiving use-case, our proposed approach can be used in other applications that require versioning, such as a wiki or a collaborative code tracking system. Our proposed system does not rely on any centralized servers for archiving or replay, enabling any individual or organization to participate in the web archiving ecosystem and be discovered without the trouble of dealing with unnecessary traffic. The system continues to allow Memento aggregators to play their role from which both large and small archives can benefit and flourish.

An earlier attempt of decentralized web archiving was realized as InterPlanetary Wayback (IPWB), which ingested WARC response records into IPFS store and indexed their Content Identifiers (CIDs) in CDXJ files for decentralized storage retrieval in place of WARC file name, byte offset, and byte size, as used in traditional archival playback systems. The primary limitation of this system was centralized index, which was needed locally for the discovery of archived data in IPFS stores. Proposals to make IPNS history-aware required changes to the underlying systems and/or additional infrastructure, which failed to mobilize any implementations or adoption.

IPARO takes a self-contained linking approach to facilitate storage, discovery, and versioning while operating within the existing architecture of IPFS and IPNS. An IPARO is a container for every archival observation that is intended to be looked up and replayed independently. These objects contain an extensible set of headers, the data in a supported archival format (e.g., WARC, WACZ, and HAR), and optional trailers. The purpose of headers is to identify the media type of the data, to establish relationships with other objects, to describe interpretation of the data and any trailers, and to hold other associated metadata. By storing CIDs of one or more prior versions of the same resource in the header we form a linked-list of IPAROs that can be traversed backward in time, allowing discovery of prior versions from the most recent version. The most recent memento (and a version at a known time) can be discovered by querying IPNS for specific URIs. Multiple prior links in each IPARO make the system more resilient for discovery of versions prior to a missing/unretrievable record as well as enable more efficient reconstruction of TimeMaps.

Moreover, IPFS allows for custom block partitions when creating Merkle tree for underlying storage, which means slicing IPAROs at strategic places before storage can leverage built-in deduplication capabilities of IPFS. This can be utilized by identifying resource payloads and headers that change less frequently or have a finite number of permutations and isolating them from parts of the data or metadata that change often.

Furthermore, a trailer is added to include suitable nonce to force generation of CIDs with desired substrings in them. This can be helpful in grouping objects by certain tags, names, types, etc. per the application needs.

Server-Side Web Archiving with ReproZip-Web

Katherine Boss¹, Vicky Rampin¹, Rémi Rampin², Ilya Kreymer³
¹New York University Libraries, United States of America; ²New York University, United States of America; ³Webrecorder, United States of America

Complex websites face many archiving challenges. Even with high-fidelity capture tools, many sites remain difficult to crawl due to the use of highly dynamic, non-deterministic network access, such as to provide site-wide search. To fully archive such sites, encapsulating the web server so that it may be recreated in an emulated environment may be the most high-fidelity option. But encapsulating a single web server is often not enough , as most sites load resources from multiple servers, or also include external embeds from services like MapBox, Google Maps or YouTube. To fully archive sites with dynamic server and client-side components, we present an integrated tool that provides an overlay of high-fidelity server emulation coupled with a high-fidelity web archive.

ReproZip-Web is an open-source, grant-funded [1] web-archiving tool capable of server-side web archiving. It builds on the capture and client-side replay tools of Webrecorder to capture the front-end of a website, and the reproducibility software ReproZip to encapsulate the backend dynamic web server software and its dependencies. The output is a self-contained, isolated, and preservation-ready bundle, an .rpz file, with all the information needed to replay a website, including the source code, the computational environment (e.g. the operating system, software libraries) and the files used by the app (e.g. data, static files). Its lightweight nature makes it ideal for distribution and preservation.

This presentation will discuss the strengths and limitations of ReproZip-Web, outline ideal use-cases for this tool, and demonstrate how to trace and pack a dynamic site. We will also highlight new features in Webrecorder tools (ArchiveWeb.page and ReplayWeb.page) to allow capture and replay to differentiate and merge content from a live server and WACZ file, allowing for overlaying of preserved server and traditional web archive. Finally, we will discuss the infrastructure needed for memory institutions to provide access to these archived works for the long term.

[1] Institute of Museum and Library Services, “Preserving the Dynamic Web: Building a Production-level Tool to Save Data Journalism and Interactive Scholarship,” NLG-L Recipient, LG-250049-OLS-21, Aug. 2021. http://www.imls.gov/grants/awarded/lg-250049-ols-21.

A Test of Browser-based Crawls of Streaming Services' Interfaces

Andreas Lenander Ægidius
Royal Danish Library

This paper presents a test of browser-based Web crawling on a sample of streaming services’ web sites and web players. We are especially interested in their graphical user interfaces since the Royal Danish Library collects most of the content by other means. In a legal deposit setting and for the purposes of this test we argue that streaming services consist of three main parts: their catalogue, metadata, and the graphical user interfaces. We find that the collection of all three parts are essential in order to preserve and playback what we could call 'the streaming experience'. The goal of the test is to see if we can capture a representative sample of the contemporary streaming experience from the initial login to (momentary) playback of the contents.

Currently, the Danish Web archive (Netarkivet) implements browser-based crawl systems to optimize its collection of the Danish Web sphere (Myrvoll et al., n.d.). The test will run on a local instance of Browsertrix (Webrecorder, n.d.). This will let us login to services that require a local IP-address. Our sample includes streaming services for books, music, TV-series, and gaming.

In the streaming era, the very thing that defines it is what threatens to impede access to important media history and cultural heritage. Streaming services are transnational and they have paywalls while content catalogues and interfaces change constantly (Colbjørnsen et al., 2021). They challenge the collection and preservation of how they present and playback the available content. On a daily basis, Danes stream more TV (47 pct.) than they watch flow-TV (37 pct.) and six out of 10 Danes subscribe to Netflix (Kantar-Gallup, 2022). Streaming is a standard for many and no longer a first-mover activity, at least in the Nordic region of Europe (Lüders et al., 2021).

The Danish Web archive collects websites of streaming services as part of its quarterly cross-sectional crawls of the Danish Web sphere (The Royal Danish Library, n.d.). A recent analysis of its collection of web sites and interfaces concluded that the automated collection process provides insufficient documentation of the Danish streaming services (Aegidius and Andersen, in review).

This paper presents findings from a test of browser-based crawls of streaming services’ interfaces. We will discuss the most prominent sources of errors and how we may optimize the collection of national and international streaming services.

Selected References

Aegidius, A. L. & Andersen M. M. T. (in review) Collecting streaming services, Convergence: The International Journal of Research into New Media Technologies

Colbjørnsen, T., Tallerås K., & Øfsti, M. (2021) Contingent availability: a case-based approach to understanding availability in streaming services and cultural policy implications, International Journal of Cultural Policy, 27:7, 936-951, DOI: 10.1080/10286632.2020.1860030

Lüders, M., Sundet, V. S., & Colbjørnsen, T. (2021) Towards streaming as a dominant mode of media use? A user typology approach to music and television streaming. Nordicom Review, 42(1), 35–57. https://doi.org/10.2478/nor-2021-0011

Myrvoll A.K., Jackson A., O'Brien, B., et al. (n.d.) Browser-based crawling system for all. Available at: https://netpreserve.org/projects/browser-based-crawling/ (accessed 26 May 2023)

Crawling Toward Preservation of References in Digital Scholarship: ETDs to URLs to WACZs

Lauren Ko, Mark Phillips
University of North Texas Libraries, United States of America

The University of North Texas has been requiring born-digital Electronic Theses and Dissertations (ETD) of its students since 1999. During that time, over 9,000 of these documents have been archived in the UNT Digital Library for access and preservation.

Motivated by discussions at the 2023 Web Archiving Conference about the need to better curate works of digital scholarship with the URL-based references contained within, the UNT Libraries set out to address this problem for newly submitted ETDs. Mindful of the burdens already upon students submitting works of scholarship in attainment of a degree, we opted to implement a solution that added no additional requirements for authors and that could be repeated with each semester's new ETDs in a mostly automated way.

We began to experiment with the identification of URLs in the ETDs that could be archived and made a permanent part of the package added to the digital library. This would allow future users to better understand the context to some of the references in the document. In the first step of our workflow, we extracted URLs from the submitted PDF documents. This required experimentation with different programmatic approaches to converting the documents to plain text or HTML and parsing URLs from the resulting text. Some methods were more successful than others, but all were challenging due to the many ways that a URL could present itself (e. g. split over multiple lines, across pages, in footnotes, etc.). Next, using Browsertrix Crawler we archived the extracted URLs, saving the results for each ETD in a separate WACZ file. This WACZ file was added to the preservation package of the ETD and submitted to the UNT Digital Library. To view the archived URL content a user can download the WACZ file and use a service like ReplayWeb.Page (https://replayweb.page/) to view its content. The UNT Libraries are experimenting with an integrated viewer for WACZ content in their existing digital library infrastructure and how to make this option available to users in the near future.

In this presentation we expound on our workflow for building these web archives in the context of the ETDs from which the URLs are extracted, as they are packaged for preservation and viewed in our digital library alongside their originating documents. By sharing this work, we hope to continue the discussion on how to best preserve URLs within works of scholarship and offer steps that conference attendees may be interested in implementing at their own institutions.

PANELS

PANEL#01: “Can we capture this?”: Assessing Website Archivability Beyond Trial and Error

Meghan Lyon¹, Calum Wrench², Tom Storrar³, Nicholas Taylor^4

1Library of Congress, United States of America; ²MirrorWeb, United Kingdom; ³The National Archives, United Kingdom; ⁴Los Alamos National Laboratory, United States of America

The purpose of this presentation is to discuss the nature, implementation of, and attitudes towards “archivability” within the context of the web archiving community, and the intersection of that community with website developers and owners. Archivability can be defined as how capable or suitable a website is to being archived via web crawl, and whether the captured content will be re-playable and usable with the assistance of web archive replay software. Research and writing on archivability by Nicholas Taylor is part of our inspiration.1 The European Union Publications office also has their own definition of Archivability,2 and other well-known actors have recently put forth ideas into this space.3

Website design and technologies continue to advance at a rapid pace, though innovations often do not conform to a consistent standard. In many cases the concept of archivability seems far from the forefront of web developers intentions. This means that web archivists are often left in the web’s developmental wake, unable to maintain pace with their web archiving tools and techniques to be able to effectively archive modern complex websites.4,5,6

The appetite for discussion and emphasis on archivability of websites has increased across the web archiving community steadily over time.7,8 These concerns are not solely shared by web archivists either, as the topic has more recently gained traction with web developers, catalysed by the demise of Flash and the increasing popularity of complex modern JavaScript frameworks. In a recent presentation, creator of the popular Svelte JavaScript framework Rich Harris discussed how the heavy reliance on JavaScript to deliver content on the web has proved increasingly challenging for preservation.9 There is a real threat that much of the web will go unpreserved if measures are not taken, both by developers and archivists, to tackle the issue--with acknowledgement of dissenting opinions, like the blog post “Making Websites Archivable with Javascript” by Andy Jackson, which throw a wrench into this theory!10

The initial presentation aims to outline the basic tenets of archivability in the web archiving space, asking: What does it mean for a website to be archivable, and to what extent are we expanding that terminology, when websites can be such complex composite rich data entities? We will discuss existing frameworks for archivability, such as the National Archives UK’s “How to Make Your Website Archive Compliant”11 and The Library of Congress’ “Creating Preservable Websites”12 and divulge bespoke research into the current writing and practice around archivability. The presentation will be centred on results from a survey conducted on behalf of the Library of Congress and will strive to present robust guidelines and recommendations for both web archivists and site owners.

Our second presentation will briefly explain how we believe research on archivability can support quality assurance, curation, education, and outreach. We’ll also discuss the concept of archivability “at-scale” and how we hope to improve communication with stakeholders using the research presented today. Our third and tentative fourth panellists may also discuss how archivability plays out in their organization's web archiving programs. Largely, however, we’d like this panel to be a discussion, joining in the long-running conversation around archivability,13 and to invite robust Q&A from the audience.

Example questions

How do the legal/ethical permissions of crawling content affect an institution’s ability to recommend or pursue web archiving best practices when it comes to archivability? Do restrictions limit site owner contact, or prevent recommendations or actions being possible?

To what extent do archivability considerations affect nominations for web archiving, in the context of balance of user access (and archive fidelity) versus volume of data collection?

The United States doesn’t have legal deposit, but some of our panellists live in a country with legal deposit laws. Tell us a little more about how that works, particularly around setting guidelines or putting forth frameworks for archivability?

There’s a LOT of technical information surrounding archivability, how do we boil it down? Is this something the community will continue to have to do on the fly for our curators?

How effectively are we intervening upstream (e.g., with open-source content management system builders, national government web design systems, etc.) to affect the archivability of broad swaths of the web? How might the IIPC community be more effectively organized for that work?

What’s our benchmark on the extent to which the gap is closing or expanding on archivability? It’s not necessarily getting any easier to archive websites, but our tools are also improving.

How is web archiving advancing with web development and how do existing web archiving programs and strategies need to shift to keep up?

How do we or have we gauged the efficacy of our archivability advocacy efforts? To the extent that we have had traction, what framing has tended to be most effective?

The dynamic and/or personalized nature of the web raises the question of whether “archivability” even remains the most appropriate term. Along the lines of data and software management, should we be talking more about ‘reproducibility”?

Sources

1. Nicholas Taylor, https://nullhandle.org/web-archivability/index.html
2. EU Publications Office, Guidelines to Make Archivable Websites, https://op.europa.eu/en/web/web-tools/guidelines-to-make-archivable-websites
3. John Berlin, Mat Kelly, Michael L. Nelson, Michele C. Weigle, To Re-experience the Web: A Framework for the Transformation and Replay of Archived Web Pages, https://dl.acm.org/doi/10.1145/3589206
4. Vangelis Banos, Yannis Manolopoulos, Web Content Management Systems Archivability, 2015, https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=c24fbd45177db53fd53842b0da024c99ba0bd72e, and https://archiveready.com/
5. Justin F. Brunelle, Mat Kelly, Michele C. Weigle & Michael L. Nelson, The Impact of JavaScript on Archivability, 2015, https://link.springer.com/article/10.1007/s00799-015-0140-8
6. Nicola Bingham and Helena Byrne, Archival strategies for contemporary collecting in a world of big data, 2021, https://journals.sagepub.com/doi/10.1177/2053951721990409
7. Helen Hockx-Yu, How to Make Websites More Archivable? 2012. https://blogs.bl.uk/webarchive/2012/09/how-to-make-websites-more-archivable.html
8. Columbia University, Guidelines for Preservable Websites, https://library.columbia.edu/collections/web-archives/guidelines.html
9. Rich Harris on frameworks, the web, and the edge, 2023. https://youtu.be/uXCipjbcQfM?t=406
10. Andy Jackson, Making Websites Archivable with JavaScript, 2023. https://anjackson.net/2023/03/10/making-websites-archivable-with-javascript/
11. The National Archive UK, How to make your website archive compliant, https://www.nationalarchives.gov.uk/webarchive/archive-a-website/how-to-make-your-website-compliant/
12. Library of Congress Web Archiving Program, Creating Preservable Websites, https://www.loc.gov/programs/web-archiving/for-site-owners/creating-preservable-websites/
13. David Rosenthal, Moonalice plays Palo Alto, 2011. https://blog.dshr.org/2011/08/moonalice-plays-palo-alto.html

PANEL#02: Archiving social media in an age of APIcalypse

Frédéric Clavert¹, Anat Ben David², Beatrice Cannelli³, Pierre-Carl Langlais⁴, Benjamin Ooghe-Tabanou⁵, Jerôme Thièvre^6

1University of Luxembourg; ²Open University of Israel; ³School of Advanced Study, University of London, United Kingdom; ⁴OpSci, France; ⁵Sciences Po médialab, France; ⁶National Audiovisual Institute, France

During the first part of 2023, within a few months, two platforms have put all access to their APIs behind a paywall: Twitter (now X, announced in February, implemented in April for commercial use, in June for researchers) and Reddit (starting in April). Application programming interfaces (API) play an important role on the web, as they allow exchange of information or features. Twitter's free API allowed the building of popular third party (commercial) applications for instance, but also data harvesting for many research projects. In short, Twitter and Reddit killed an ecosystem that they had encouraged building and that participated in forging their (past?) popularity. It is not the first time that platforms are limiting access to their data : In 2015, LinkedIn restricted its access to data ; in 2015 and 2016, following the Cambridge Analytica scandal, Facebook shut down most of its API functionalities. Investigating the Facebook case in 2019, the researcher Axel Burns spoke of an APIcalypse. At the same time, Twitter already restricted the access to its data, with the consequence that the Library of Congress stopped archiving the integrality of the Twitter firehose.

The reasons to shut down free access to APIs are numerous: officially to limit toxic political uses of their API in the case of Facebook, to find a new business model (Twitter/X), to avoid large data harvesting for LLM training purposes in the case of Reddit.

In those three cases -- Facebook, Twitter, Reddit --, Academic research and archiving services are at the very least a collateral victim. The seizing of Twitter by Musk in 2022 brutally stopped an original policy that allowed researchers to reasonably collect data for their works. In 2016, the end of access to Facebook data forced many research projects to stop collecting data. Researchers (and archives) may be more than side victims: Axel Burns argued that Facebook's «actions in responding to the Cambridge Analytica scandal raise suspicions that they have instrumentalised it to actively frustrate critical, independent, public interest scrutiny by scholars.»

In 2023, the end of the free access to Twitter and Reddit data have also fostered reactions from researchers, for instance in France (LeMonde.fr, 2023), where a group of researchers and a former minister responsible for digital affairs argued that closing down APIs « opacifies key areas of civic dialogue and prevent them from being analyzed by the academic world » or through texts published by the coalition for independent technology research (2023a & b)(1).

It is obvious that this APIcalypse will affect web archiving, all the more that even beyond the use of APIs, Twitter set up in the last few months many small mechanisms that may prevent web archiving of a politician's X page for instance, even when not using its API. This round table will gather a moderator and speakers that are coming from the worlds of web archiving, of public and private research, to investigate the consequences of this APIcalypse for research and web archiving.

Moderator

Frédéric Clavert is an assistant professor at the Centre for Contemporary and Digital History. Interested in the link between social media and collective memory, he led a research project on the echoes of the Centenary of the Great War on Twitter. He initiated this round table.

Speakers

Anat Ben David is Associate Professor of Communication at the Open University of Israel. Her primary research interests are national web studies and digital sovereignty, web history and web archive research, and the politics of online platforms.

Beatrice Cannelli is a PhD Researcher in Digital Humanities, School of Advanced Study, University of London investigating the challenges to social media archiving.

Pierre-Carl Langlais is Head of research at OpSci, a private firm specialized in opinion studies based on social (web) data using Large Language Models.

Benjamin Ooghe-Tabanou is a research engineer specialized in web mining and social network analysis for social sciences, and he is the manager of the research engineers team at médialab Sciences Po (Paris), which has been working with most social network APIs over the past 12 years.

Jerôme Thièvre has a PhD in Computer Science and, he is the manager of the Web Archive team at "Institut National de l'Audiovisuel". Within the context of the French Web Legal Deposit, INA has been archiving tweets related to french audiovisual domain and news since 2014.

References

Bruns Axel, « After the ‘APIcalypse’: social media platforms and their fight against critical scholarly research », dans Information, Communication & Society, vol. 22, no 11, 2019, https://www.tandfonline.com/doi/full/10.1080/1369118X.2019.1637447 (consulté le 29/08/2023), p. 1544‑1566.

« “Les conversations sur les médias sociaux sont des expressions démocratiques qui ne sauraient être cachées à la recherche” », Le Monde.fr, 2023, https://www.lemonde.fr/idees/article/2023/06/16/les-conversations-sur-les-medias-sociaux-sont-des-expressions-democratiques-qui-ne-sauraient-etre-cachees-a-la-recherche_6177952_3232.html (consulté le 28/06/2023).

Coalition for Independent Technology Research, « Restricting Reddit Data Access Threatens Online Safety & Public-Interest Research », 2023, https://independenttechresearch.org/reddit-data-access-letter/ (consulté le 11/09/2023).

Coalition for Independent Technology Research, « Letter: Twitter’s New API Plans Will Devastate Public Interest Research », 2023, https://independenttechresearch.org/letter-twitters-new-api-plans-will-devastate-public-interest-research/ (consulté le 11/09/2023).

(1) We inform our reader that some of the listed speakers as well as the moderator of this roundtable have co-signed one or more of those texts.

PANEL#03: Striking the Balance: Empowering Web Archivists And Researchers In Accessible Web Archives

Caylin Smith¹, Leontien Talboom¹, Andrea Kocsis², Mark Bell3, Alice Austin^4

1Cambridge University Libraries, United Kingdom; ²Northeastern University London, United Kingdom; ³The National Archives, United Kingdom; ⁴University of Edinburgh, United Kingdom

In recent years, the question of providing access to web archives has become a prominent and complex issue. This debate has been the subject of in-depth examination by various projects, each contributing valuable insights into the matter. Notable among these initiatives are the Archives Unleashed project, Collections as Data, and the GLAM workbench. These endeavours have shed light on the multifaceted challenges and opportunities associated with making archived web content accessible to researchers and the broader public.

This panel discussion seeks to delve deeper into the practical dynamics that exist between the web archivists responsible for curating and facilitating access to these digital resources and the researchers who use them for their studies. The panel boasts a lineup of individuals with extensive experience in the field, several of whom have been actively involved in both the provision of access to archived web materials and the utilisation of these materials in their research endeavours.

A significant aspect of the panel's expertise lies in their involvement in previous projects that have drawn from web archives, such as the UK Web Archive and the UK Government Web Archive. Additionally, the discussion will explore projects like the Archive of Tomorrow and several pieces of work that explored the accessibility of the UK Government Web Archive, including a Data Study Group, work on using Notebooks as a form of access, and a book chapter on a research dashboard as an alternative way of access.

Throughout the panel, several topics will be explored. These include the practical challenges associated with providing access to web archives, navigating the complex web of legislative frameworks governing web content, handling sensitive materials, and implementing precautions when utilising such data. Furthermore, the discussion will examine the various formats through which access to web archives can be provided, offering a comprehensive view of the evolving strategies and technologies employed in this field.

Examples of questions to be discussed by the panel members:

When providing access to web archives, how much pre-processing should web archives be doing, especially when considering that the use of certain computational methods (such as sorting or algorithms) may lead to unwanted biases in the material?

What are seen as helpful tools or documentation by researchers? What other requirements for tools or documentation still need solutions?

Is the keyword search a useful tool when using web archives?

How should researchers be made aware of sensitive material in web archives?

Legislation can make access to web archives difficult. What are examples of ways around these challenges?

By bringing together experts who have grappled with the intricacies of web archiving from both sides - as curators and as researchers - this panel promises to offer invaluable insights into the current state of web archiving and the symbiotic relationship between those who preserve digital history and those who seek to explore it. Ultimately, the discussion aims to contribute to the ongoing dialogue surrounding web archives, their accessibility, and their crucial role in preserving digital heritage.

PANEL#04: Building Inclusive Web Archives Through Community-Oriented Programs

Jefferson Bailey¹, Makiba Foster², Sumitra Duncan³
¹Internet Archive, United States of America; ²The College of Wooster, United States of America; ³Frick Art Reference Library, United States of America

The web is now well established as a medium of record for primary sources that are central to the collecting mandate of libraries, archives, and heritage organizations. The web is also a source of many types of primary records that have no equivalent in analog form and represent historically valuable materials that document community life. The ease of publishing and sharing information online has enabled a range of voices to emerge from individuals and communities that may have had few instantiations in the printed documentary record of the past and are largely missing from physical archives. Collecting web-published primary sources is also possible at a vastly increased breadth of scale, and vastly decreased technical and staffing cost, than the collection of print materials. The library and archives profession has seized this opportunity by pursuing web archiving in ever-increasing numbers.

Yet this effort has been dominated by universities and national-scale organizations. Smaller, more locally-focused public libraries, local museums, community archives, and similar heritage organizations have, in the past, constituted an extremely small percentage of the national web archiving community -- a proportion far out of balance with their overall numbers and not in keeping with the richness of their existing physical archival collections. One need only look at IIPC’s membership, almost exclusively composed of national libraries/archives and large research universities, to see that the international web archiving community lacks diversity in institutional size or collecting focus. For instance, many public libraries have active local history collections and have traditionally collected print and analog materials that document their region, yet until recently, had not implemented web archiving programs. Similarly, large portions of the web richly document the lives and stories of groups underrepresented in traditional archival collections, yes these communities often fall outside of the collecting scope of the larger institutions. Finally, a lack of technical infrastructure or professional knowledge can also impede grassroots or community-driven efforts to preserve the web in order to enrich and diversify the archive.

This panel session will feature three large-scale programs that are pursuing community archiving through preserving web-published materials. The three programs represent different approaches to a similar goal -- how to preserve community memory, local history, and diversifying the archive through collaboration, grassroots oriented empowering, training, and enabling those best suited to documenting communities -- the community members themselves. The three programs featured in the panel are: 1) Community Webs, a program started in 2017 to give services and training to public libraries to allow them to document local history through web archiving. The program has over 200 public and local libraries involved in the program; 2) Archiving the Black Web, a program establishing a more equitable and accessible web archiving practice that can more effectively document the Black experience online; and 3) the Collaborative ART Archive (CARTA), a cooperative program of art libraries building collections of archived web-based content related to art history and contemporary art practice. Leads from each program will discuss the programs’ origins, accomplishments, current and future work, and the shared themes of how to diversify both the web archive and the people and organizations building web collections.

Short abstracts for each of the presentations:

Each program will give a brief presentation updating the current status of their work to prompt a discussion that will advocate for, and propose methods to enable, a greater representation of individuals, nations, cultures, and organizations in web archiving.

Community Webs: Community Webs was launched in 2017. Its mission is to advance the capacity for public libraries and other cultural heritage organizations to build archives of web-published primary sources documenting local history and underrepresented voices. The program achieves this mission by providing resources for professional training, technology services, networking, and in support of scholarly research use. Over 200 organizations are currently in the program and they have built hundreds of web archive collections documenting community memory, underrepresented groups, and local history.

Archiving the Black Web: The expansive growth of the web and social media and the wide use of these platforms by Black people presents significant opportunities for archivists and other memory workers interested in documenting the contemporary Black experience. But while web archiving practice and tools have grown over the past twenty-five years, it is a cost prohibitive archiving activity and presents access and resource challenges that prevent large sectors of the archives profession and especially Black collecting organizations from fully engaging in the practice. The Archiving the Black Web program is an urgent call to action to address these issues with the goal of establishing a more equitable and accessible web archiving practice that can more effectively document the Black experience online.

Collaborative ART Archive (CARTA): Internet Archive and the New York Art Resources Consortium (NYARC) have spearheaded this collaborative project aimed at capturing and preserving at-risk web-based art materials. CARTA is a collaborative entity of art libraries building collections of archived web-based content related to art history and contemporary art practice. Through this collaborative approach, the project leverages shared infrastructure, expertise and collecting activities amongst participating organizations, scaling the extent of web-published, born-digital materials preserved and accessible for art scholarship and research. The goals are to promote streamlined access to art reference and research resources, enable new types of scholarly use for art-related materials, and ensure that the art historical record of the 21st century is readily accessible far into the future.

Sample questions for a 30+ minute discussion section:

What motivated a community-driven approach to pursuing these programs?

What characteristics of web archiving practice or institutions led to the need to build these types of programs?

In developing programs with a community focus, what are successful and unsuccessful methods to foster collaboration, expansion, sustainability, and impact.

What can web archives and archivists do to ensure a diverse and inclusive community of practice?

What are unique attributes of the collections emerging from, and participants involved in, these types of programs?

How can IIPC as an organization encourage more community-based work and diversification of the records in the collective web archive?

WORKSHOPS

WORKSHOP#01: Training the Trainers - Helping Web Archiving Professionals Become Confident Trainers

Claire Newing¹, Ricardo Basílio², Lauren Baker³, Kody Willis⁴¹The National Archives, United Kingdom; ²Arquivo.pt, Portugal; ³Library of Congress, United States of America; ⁴Internet Archive, United States of America

The 'Training the Trainers' workshop aims to provide participants with concepts, methodologies, and materials to help them create and deliver training courses on web archiving in the context of their organizations and adapted to their communities.

Knowing how to archive the web is becoming an increasingly necessary skill for those who manage information in organizations (those responsible for institutional memory).

More and more people are realizing how important it is for organizations to keep track of the content they publish online. Whether for large organizations like a country's government or a small association, preserving their memory is increasingly valued. An organization's web presence through its website and social media channels is part of its digital heritage. The news about an organization published by newspapers, radio stations, podcasts and other sites is important for institutional history and so this content should be considered as part of web archiving.

The problem lies in the fact that, in practice, web archiving is a vaguely known activity and within the reach of few. It is often thought of as something for IT professionals. In addition, there are concerns about the legal implications. Countries' legal frameworks are sometimes unsuited to content published on the web. Access is also limited. Recorded content can only be shown in catalogues and digital libraries, where protocols such as Dublin Core or OAI-PMH are used. As a result, there is little investment in web archiving and insufficient response to the need for preservation.

In this session, we invite participants to become trainers. We challenge those with knowledge of web archiving to translate it into training activities. Basic web archiving is not that difficult. With a little training, anyone can be a web archivist and a promoter of the use of web archives by researchers.

The workshop is promoted by the IIPC Training Working Group, which has created training modules for initial training on web archiving, available for use by the community at https://netpreserve.org/web-archiving/training-materials/. The aim now is to go one step further, offering intermediate training content, exemplary training cases, tried and tested strategies so that there are more and more confident trainers and training offers on web archiving.

The first part will introduce the concepts and terminology of web archiving from a training perspective. What are the main ones? Digital heritage, cultural heritage, WARC format, time stamp, collection creation? If you had to choose one, which would you include in a training course?

We will then present two cases of training programs and share our experience: the IIPC TWG initial training program and a the ResPaDon project for researchers at the University of Lille.

In the second part, we'll carry out a practical web page recording exercise, focusing on aspects related to training and learning, as well as the tools used. What does it take for anyone to be able to record a page?

For this "hands on" part of the session, we will focus on browser-based recording, using the ArchiveWeb.page tool (available for all) to explore the questions that arise whenever a trainer wants to impart knowledge about web preservation.

In conclusion, we will give participants a set of recommendations or guidelines that should guide the web archiving trainer.

Participants will be able to use their own computer, if they have an Internet connection, or share the exercise with other participants.

At the end of the session, as expected learning outcomes, participants should be able to:

choose the most important concepts and terminology to include in a training session
be familiar with available training materials and various training experiences on web archiving
design a training program
include practical web archiving exercises, using available tools
demonstrate how small-scale web archiving can be integrated into larger projects, such as national archives or institutional archives.

WORKSHOP#02: Leveraging Parquet Files for Efficient Web Archive Collection Analytics

Sawood Alam¹, Mark Phillips²¹Internet Archive, United States of America; ²University of North Texas, United States of America

In this workshop/tutorial we intend to bring a hands-on experience for the participants to analyze a substantial web archive collection. The tutorial will include introduction to some existing archival collection summarization tools like CDX Summary and Archived Unleashed Toolkit, the process of converting CDX(J) files to Parquet files, numerous SQL queries to analyze those Parquet files for practical and common use-cases, and visualization of generated reports.

Since 2008, the End of Term (EOT) Web Archive has been gathering snapshots of the federal web, consisting of the publicly accessible ".gov" and ".mil" websites. In 2022, the End of Term team began to package these crawls into a public dataset which they released as part of the Amazon Open Data Partnership program. In total, over 460TB of WARC data was moved from local repositories at the Internet Archive and the University of North Texas Libraries. From the original WARC content, derivative datasets were created that address common use cases for web archives. These derivatives include WAT, WET, CDX, WARC Metadata Sidecar, and Parquet files. The Parquet files were generated primarily using the CDX file and their ZipNum index, which include many derived columns (such as the domain name or the TLD) that were otherwise not available directly in the CDX files as separate columns. Furthermore, the Parquet files can be extended to include additional columns (such as soft-404, language, and detected content-type) from the WARC Metadata Sidecar files. These files are publicly accessible from an Amazon S3 bucket for research and analysis.

The toolchain used to generate the derivative files in the EOT datasets were reused from the Common Crawl project. Moreover, the EOT datasets are organized in a similar structure as used by the Common Crawl dataset to make them a drop-in replacement for researchers who have used the Common Crawl datasets before. We plan to leverage EOT datasets in the workshop/tutorial for hands-on experience. Furthermore, tutorials created for the workshop will be added to the EOT dataset documentation for future reuse and to serve as a guide for researchers.

The Parquet format is a column-oriented data file format designed for efficient data storage and retrieval. It is used to provide a different way of accessing the data held in the CDX derivatives. The Parquet format is used in many big-data applications and is supported by a wide range of tools. This derivative allows for arbitrary querying of the dataset using standard query formats like SQL and can be helpful for users who want to better understand what content is in their web archive collections using tools and query languages they are familiar with.

CDX files are in a text-based columnized data format (similar to CSV files) that are sorted lexicographically. These are optimized for archival playback, but not necessarily for data analysis. For example, counting the number of mementos (captures) for a given TLD in a web archive collection would require processing the only rows that start with the given TLD, because the CDX files are primarily sorted by SURTs (also known as URL keys), which places the TLD at the very beginning of each line that allows binary search to locate desired lines in a large file. However, counting mementos of certain HTTP status code (say, "200 OK") would need processing the entire CDX dataset. Similarly, counting the number of captures for a given year would require traversing the whole CDX index.

On the contrary, in Parquet files the data is partitioned and stored column-wise. This means, querying on one column does not need processing of the bits of the other columns. Moreover, Parquet files store data in a binary format (as opposed to the text-based format used in CDX files) and apply run-length encoding and other compression techniques to optimize for storage space and IO operations. This reduces the processing time for most of the data analytics tasks as compared to corresponding CDX data. Parquet files support SQL-like queries and have an ecosystem of toolsets while CDX files are usually analyzed using traditional Unix text-processing CLI tools or scripts that operate on similar principles.

In addition to allowing analysis of individual web archive collections using Parquet files, we hope that this workshop would encourage web archives to convert their entire CDX data into Parquet files and expose API endpoints to perform queries against their holdings. Moreover, we anticipate a CDX API implementation in the future that can operate on Parquet files, replacing CDX files completely, if it proves to be more efficient both in storage and lookups suitable for archival playback.

WORKSHOP#03: Crafting Appraisal Strategies for the Curation of Web Archives

Melissa Wertheimer
Library of Congress, United States of America

Web archives preserve web-based evidence of events, stories, and the people and communities who create them. Web archiving is also a vital tool to build a diverse and authentic historical record through intentional digital curation. Information professionals determine the “what” and “when” of web archives: collection topics, seed URL lists, crawl durations, resource allocation, metadata, and more. Appraisal documentation - the “why” and “how” of web archives - reveals the intentions and processes behind the digital curation to ensure accountability in the preservation of born-digital cultural heritage.

Melissa Wertheimer will present an expanded 80-minute version of a 2022 National Digital Stewardship Alliance Digital Preservation Conference (“DigiPres”) workshop. The co-authors and co-presenters of the original 2022 version were Meghan Lyon (Library of Congress, United States) and Tori Maches (University of California at San Diego, United States). All three professionals come from traditional archives backgrounds where appraisal documentation, archival values, and appraisal methods are standard practice for repositories. The workshop will facilitate an experimental environment in which participants consider how such archival values and appraisal methods used for analog and hybrid special collections also apply to web archives curation.

The intended audience includes attendees who make curatorial and collection development decisions for their organization’s web archiving initiatives. These web archiving practitioners will roll up their sleeves and craft targeted appraisal strategies in writing for thematic and event-based web archive collections.

Attendees will explore the use of prose, decision trees, and rubrics as forms of appraisal documentation for web archive collections. They will practice the application of archival values such as intrinsic value, evidential value, and interrelatedness as well as appraisal methods such as sampling and technical appraisal to evaluate whether websites are both in scope for collecting and feasible to capture.

Workshop participants are encouraged to bring working materials for hypothetical or realized web archive collections, including seed lists and collection scopes, although workshop leaders will provide sample seed lists along with the workshop materials. Workshop leaders will provide Google Drive access to workshop materials that include a sample seed list for a thematic collection, sample seed list for an event-based collection, sample appraisal documentation in the form of a narrative, and sample appraisal documentation in the form of a rubric.

Attendees will gain a comprehensive overview of American and Canadian archival theory and a list of supporting resources for reference. Participants will also develop an understanding of the differences between collection development policies, collection scopes, and appraisal strategies for web archives. They will also learn to apply existing appraisal theories and archival values to web archives selection and curation, and to evaluate and apply different types of appraisal documentation to meet their needs. The workshop will be web archiving tool agnostic; these concepts are relevant regardless of which tools attendees might use to capture and preserve web content.

Participants will have the most ideal experience with their own laptop or tablet, open minds, passion for mapping theory to practice, and willingness to discuss and debate selection criteria and appraisal strategies and documentation with colleagues. The workshop includes a brief overview presentation followed by time for both individual work and group discussion.

WORKSHOP#04: Run Your Own Full Stack SolrWayback (2024)

Thomas Egense, Victor Harbo Johnston, Anders Klindt Myrvoll
Royal Danish Library

An in-person, updated, version of the ´21 and ‘23 WAC workshop Run Your Own Full Stack SolrWayback

https://netpreserve.org/event/wac2021-solrwayback-1/
https://netpreserve.org/ga2023/programme/abstracts/#workshop_06

This workshop will:

Explain the ecosystem for SolrWayback
(https://github.com/netarchivesuite/solrwayback)
Perform a walkthrough of installing and running the SolrWayback bundle. Participants are expected to follow the installation guide and will be helped whenever stuck.
Leave participants with a fully working stack for index, discovery and playback of WARC files
End with an open discussion of SolrWayback configuration and features

Prerequisites

Participants should have a Linux, Mac or Windows computer with Java installed. To see java is installed type this in a terminal: java -version
For windows computers administration-user may be required.
Downloading the latest release of SolrWayback Bundle from:https://github.com/netarchivesuite/solrwayback/releases beforehand is recommended.
Having institutional WARC files available is a plus, but sample files can be downloaded from https://archive.org/download/testWARCfiles
A mix of WARC-files from different harvests/years will showcase SolrWaybacks capabilities the best way possible.

Target audience

Web archivists and researchers with medium knowledge of web archiving and tools for exploring web archives. Basic technical knowledge of starting a program from the command line is required; the SolrWayback bundle is designed for easy deployment.

Background

SolrWayback 4 (https://github.com/netarchivesuite/solrwayback) is a major rewrite with a strong focus on improving usability. It provides real time full text search, discovery, statistics extraction & visualisation, data export and playback of webarchive material. SolrWayback uses Solr (https://solr.apache.org/) as the underlying search engine. The index is populated using Web Archive Discovery (https://github.com/ukwa/webarchive-discovery). The full stack is open source.

WORKSHOP#05: Unlocking Access: Navigating Paywalls and Ensuring Quality in Web Crawling (Behind Paywall Websites - Crawl, QA & More)

Anders Klindt Myrvoll¹, Thomas Martin Elkjær Smedebøl¹, Samuli Sairanen², Joel Nieminen², Antares Reich³, László Tóth^4

1Royal Danish Library; ²National Library of Finland; ³Austrian National Library; ⁴National Library of Luxembourg

In a digital age characterized by information abundance, access to online content remains a significant challenge. Many valuable resources are hidden behind paywalls and login screens, making it difficult for researchers, archivists, and data enthusiasts to retrieve, preserve, and analyze this content. This tutorial, led by experts from web archives across Europe, aims to empower participants with the knowledge and tools necessary to tackle these obstacles effectively. This tutorial provides a comprehensive guide to web crawling, quality assurance, and essential techniques for accessing content - focusing on content behind paywalls or log-in.

In recent years, institutions like the Austrian National Library, National Library of Luxembourg, Royal Danish Library, and National Library of Finland have been addressing the challenges posed by paywalls and restricted access to online content. Each institution has developed unique strategies and expertise in acquiring and preserving valuable online information. This workshop serves as an opportunity to pool this collective knowledge and provide hands-on training to those eager to venture into the world of web crawling.

Content of the Workshop

This tutorial will equip participants with the skills and knowledge required to navigate paywalls, conduct web crawls effectively, ensure data quality, and foster ongoing communication with site owners. The following key components will be covered:

Accessing Paywalled Content:

- Techniques to bypass paywalls and access restricted websites
- Negotiating with newspapers and publishers to obtain login credentials
- Strategies for requesting IP Authentication from site administrators
- Browser plugins and user agent customization to enhance access

Actually Crawling Content:

- Exploration of web crawling tools, including Heritrix and Browsertrix
- Utilizing Browsertrix Cloud and Browsertrix Crawler for efficient and scalable crawling
- Using Browsertrix Behaviors for harvesting special content, such as videos, podcasts and flipbooks
- Introduction to other essential tools for web harvesting

Quality Assurance of Content:

- Deduplication techniques and best practices
- Implementing dashboards for IP-validation to ensure data integrity
- Workshop segment on setting up the initial infrastructure and performing proxy at home

Communication with Site Owners

- Emphasizing the importance of communication with site owners
- Highlighting the direct correlation between effective communication and access privileges
- Strategies for maintaining ongoing relationships with content providers

There will be a short tutorial from each institution looking at different subjects from the list above.

Expected Learning Outcomes

Upon completing this tutorial, participants will have gained a robust skill set and deep understanding of the challenges and opportunities presented by paywall-protected websites. Specific learning outcomes include:

Proficiency in accessing paywalled content using various techniques
Better knowledge of how and when to use web crawling tools such as Heritrix and Browsertrix
Skills to ensure data quality through deduplication and visualization of IP-validation
Strategies for initiating and sustaining productive communication with site owners
The ability to apply these skills to unlock valuable content for research, archiving, and analysis

Target Audience

This tutorial is designed for anyone seeking to access and work with content behind paywalls or login screens. Whether you are a researcher, archivist, librarian, or data enthusiast, this tutorial will provide valuable insights and practical skills to overcome the challenges of restricted online access.

Technical Requirements

Participants are only required to bring a laptop equipped with an internet connection. This laptop will serve as their control interface for NAS, Heritrix, Browsertrix, and other relevant tools during the workshop.

WORKSHOP#06: Browser-Based Crawling For All: Introduction to Quality Assurance with Browsertrix Cloud

Andrew Jackson¹, Anders Klindt Myrvoll², Ilya Kreymer^3

1Digital Preservation Coalition, United Kingdom; ²Royal Danish Library; ³Webrecorder, United States of America

Through the IIPC-funded “Browser-based crawling system for all” project (https://netpreserve.org/projects/browser-based-crawling/), members have been working with Webrecorder to support the development of their new crawling tools: Browsertrix Crawler (a crawl system that automates web browsers to crawl the web) and Browsertrix Cloud (a user interface to control multiple Browsertrix Crawler crawls). The IIPC funding is particularly focussed on making sure IIPC members can make use of these tools.

This workshop will be led by Webrecorder, who will invite you to try out an instance of Browsertrix Cloud and explore how it might work for you, and how the latest QA features might help. IIPC members that have been part of the project will be on hand to discuss institutional issues like deployment and integration.

The workshop will start with a brief introduction to high-fidelity web archiving with Browsertrix Cloud. Participants will be given accounts and be able to create their own high-fidelity web archives using the Browsertrix Cloud crawling system during the workshop. Users will be presented with an interactive user interface to create crawls, allowing them to configure crawling options. We will discuss the crawling configuration options and how these affect the result of the crawl. Users will have the opportunity to run a crawl of their own for at least 30 minutes and to see the results.

After a quick break, we will then explore the latest Quality Assurance features of Browsertrix Cloud. This includes ‘patch crawling’ by using the ArchiveWeb.Page browser extension to archive difficult pages, and then integrating those results into a Browsertrix Cloud collection..

In the final wrap-up of the workshop, we will discuss challenges, what worked and what didn’t, what still needs improvement, etc.. We will also outline how participants can provide access to the web archives they created, either using standalone tools or by integrating them into their existing web archive collections. IIPC member institutions that have been testing Browsertrix Cloud thus far will share their experiences in this area.

The format of the workshop will be as follows:

Introduction to Browsertrix Cloud
Use Cases and Examples by IIPC project partners
Hands-On: Setup and Crawling with Browsertrix Cloud (Including Q&A / help while crawls are running)
Hands-On: Quality Assurance with Browsertrix Cloud
Wrap-Up: Final Q&A / Discuss Access 7 Integration of Browsertrix Cloud Into Existing Web Archiving Workflows with IIPC project partners

Webrecorder will lead the workshop and the hands-on portions of the workshop. The IIPC project partners will present and answer questions about use-cases at the beginning and around integration into their workflows at the end.

Participants should come prepared with a laptop and a couple of sites that they would like to crawl, especially those that are generally difficult to crawl by other means and require a ‘high fidelity approach’. (Examples include social media sites, sites that are behind a paywall, etc..) Ideally, the sites can be crawled during the course of 30 mins (though crawls can be interrupted if they run for too long)

This workshop is intended for curators and anyone wishing to create and use web archives and are ready to try Browsertrix Cloud on their own with guidance from the Webrecorder team. The workshop does not require any technical expertise besides basic familiarity with web archiving. The participants’ experiences and feedback will help shape not only the remainder of the IIPC project, but the long-term future of this new crawling toolset.

POSTERS

POSTER SESSION 1:

Many Hands Make Light(er) Work: Collaborative Web Lifecycle Management

Sara Day Thomson, Alice Austin, Stratos Filalithis, Bruce Darby
University of Edinburgh, United Kingdom

This talk will explore the development of a collaborative Web Lifecycle Management programme at a large UK HEI. The HEI first embarked on web archiving as part of an initiative to collect community responses to the Covid-19 pandemic. This initiative provided a valuable opportunity to explore closer working with a national web archive and to demonstrate the value of web archiving to the digital preservation of the institutional record. Further collaboration with the national web archive as part of an externally-funded project supported recruitment of a dedicated Web Archivist. This increased visibility and capacity allowed the institutional Archive to collaborate more meaningfully with the Website and Communications team who support web services for the institution’s many diverse communities. This talk will describe this collaboration and the shared roles and responsibilities for ensuring a sustainable and effective web service for users, from creation to long-term preservation.

The institutional web estate (both at the main URL and thousands of off-domain URLs) encompasses a vast array of valuable resources – from press releases to research project websites. In the absence of established pathways to the Archive for digital records, web archiving provides a strategy to capture some record of almost every activity and function of the institution, all in the context of their presence on the Web. As a result of successful pilot projects, support for the Web Lifecycle programme has grown across the institution, especially from individual school and college administrations and from corporate communications and marketing. The growing capacity in web archiving also better supports the integration of web-based works into the curation of hybrid collections. The art collection curators, in particular, have engaged with web archiving to enhance acquisition of new works and support teaching resources.

This Web Lifecycle programme partners closely the national web archive, using that infrastructure and guidance as a core strategy to support the preservation of most content in scope. As the capture of the institution’s web content falls within the legal collecting remit of the national web archive, this partnership allows the Archive to focus on quality assurance, troubleshooting, metadata creation, and creator relationships. This contribution improves the existing records in the national web archive and also supports institutional requirements without maintaining a local system or contracting a third-party vendor.

This talk will discuss the benefits of this collaborative approach to web archiving to meet institutional digital preservation requirements. It will also address some of the limitations and challenges, such as licencing and aligning information policies across a complex, devolved web estate. It will also present communication and advocacy strategies and how collaborative business cases helped secure buy-in from senior management. Ultimately, this talk will demonstrate how collaboration – across the institution and externally with a national web archive – catalyses progress to preserve one of the largest and most diverse and inclusive records of activity at a major HEI – its web estate.

How To Implement The Long-term Preservation Of The Web? The Journey Of The Publications Office Of The European Union

Corinne Frappart
Publications Office of the European Union, Luxembourg

The web is ephemerous. Its content has a short lifespan and web technologies evolve quickly. Archivists are conscious that capturing the live web is the very first step to keep a track of it for future generations. But this capture is complex; it requires specific tools, technical skills and an important storage space. That is probably why, although crawling methods and the organisation of access to the collections have seen many recent improvements, a lot of work remains to be done to tackle the long-term preservation aspect of the produced ARC/WARC files.

The Publications Office of the EU takes an interest in safeguarding its web archive files and in planning preservation actions. Being responsible for the preservation of the websites authored by the EU institutions, our aim is to mitigate the risks of obsolescence and loss of file usability, and to maintain a preserved copy of the collection in case of unforeseen issues with the version made available for public access.

At the 2023 IIPC conference, the Publications Office presented a set of observations and conclusions following a review of the state-of-the-art in long-term preservation of web archives. The present paper is the follow-up of this communication, as we now wish to share our experience with the following step of our project.

Putting this project into practice implies to adapt general principles into concrete measures that are adapted to our system, addressing the following aspects:

What are the exact features of our collections (number of crawls, number of files per crawl, compression, type of files, identifiers, etc)?
How should our archival information packages (AIP) be defined (ID, content, …)? Which additional content should be preserved besides the ARC/WARC files?
Which metadata do we need to extract from the files to enable search and discovery of content?
What are the minimal controls to put in place while ingesting the files into our long-term preservation platform?
Which preservation actions can be performed after the ingestion of the collections into our long-term preservation platform?

Our approach to address these questions will be treated in detail in the presentation, with the hope they can help the community in its reflection towards a better long-term web preservation.

Digital Resources – Slovak Web Archive in Context with the New Legislation

Jana Matúšková, Peter Hausleitner
University Library in Bratislava, Slovak Republic

POSTER

Archiving the web and its ever-changing content is a vital part of preserving intangible cultural heritage for future generations. This task is being undertaken in Slovakia by the University Library in Bratislava. The library’s Digital Resources Deposit workplace has been actively compiling selective, thematic, and extensive collections of Slovak websites within the project Digital Resources – Web harvesting and e-Born content archiving. In addition, it is also archiving original electronic online publications.

This contribution presents the experience and results achieved during the operation of the Digital Resources system, with particular focus on recent endeavours. It presents the conducted campaigns, additions to electronic series and monographs, new media types covered by the legislation, directory of the news portals provided by the Ministry of Culture of the Slovak republic and current state of its archiving.

The authors pay the main attention to the new Act on Publishers of Publications and on the Register in the Media and Audiovisual Sector (the Publications Act), effective in Slovakia since August 1st 2022. The new Slovak legislation partly takes into account the existence of the digital environment and extends the definition of a periodical publication to include electronic publications in multiple formats. A new feature is the definition of a news web portal. The Act obliges the University Library in Bratislava to archive news web portals. Starting in January 2023, fifteen selected news websites were archived once a week as part of a regular campaign. However, under the Act, the content of these web portals cannot be made available to the public without the publisher’s permission.

Why WARCs Are Complex But Not Messy

Iris Geldermans
National Library of the Netherlands

WARCs are at the heart of our work, but we do not give them enough credit. In presentations and conferences there is often the complaint that the WARC file is difficult to work with because it contains messy data.

In my presentation I wish to explain how a WARC file is built by giving examples on how live parts of a website are constructed and structured inside the file. I wish to make the point that while, yes, a WARC contains a lot of different data, making it complex, it is not a messy source. It is the source material, the website or the web, which is messy. The WARC itself is actually quite structured. It wraps each piece of the website or web, whether it is a image, video, text, response or script, in a consistent bit of metadata, such as type, the URI and date, creating structure within a source that consist of so many different elements! Thereby creating not only structure but also archiving why something is not there. It does this through noting a response for example.

Furthermore, the ‘messiness’ of a WARC is also determined by how we use Heritrix and the choices we make in the profile settings, not the WARC format itself. The more we use the broad domain crawl settings (with a high Trans Hops for example), the larger the chance you get a lot of websites and duplicates within a set of WARC files. At our library we use a different approach where we harvest websites one at the time; deep instead of broad.

I conclude my presentation by theorising that combining both approaches (broad to find domains and deep to harvest those domains one at the time) can lead to more consistent WARCs.

Report on the Scholarly Use of Web Archives Across Ireland: The Past, Present & Future(s)

Sharon Healy¹, Helena Byrne²
¹Independent Researcher, Ireland; ²British Library, United Kingdom

POSTER

This poster illustrates some of the key findings of the WARCnet Special Report ‘Scholarly Use of Web Archives Across Ireland: The Past, Present & Future(s’)’ that was published in August 2023. The purpose of WARCnet Special Report was to:

Examine the causes for the loss of digital heritage and how this relates to Ireland.

Offer an overview of the landscape of web archives based across Ireland, and their availability, and accessibility as resources for Irish based research.

Provide some insight into the awareness of, and engagement with web archives in Irish third-level academic institutions.

In this poster we offer an overview of the landscape of web archives based across Ireland. First, we discuss some of the main causes for the loss of Irish digital heritage. In doing so, we explore the relationship between legal deposit legislation and the preservation of national heritage, and observe how web archiving is a necessary activity for the preservation of digital heritage. Then we look at scholarly engagement with web archives and highlight some of the challenges experienced by the web archive user community, and assess the availability and accessibility of web archives based on the island of Ireland for conducting research on Irish based topics. To end, we briefly examine scholarly awareness and engagement with web archives in Irish academic institutions.

In doing so, we offer some perspectives which may be useful when it comes to providing support and incentives to assist scholars and educators in the use of the archived web for Irish based research and teaching. This case study will be of benefit not only for web archive users but also the wider web archiving community as many of the challenges faced by the Irish web archiving community, and Irish based researchers will not be unique to Ireland.

Blog to Bytes: Exploring the UK Web Archive’s Blog Posts Through Text Analysis

Helena Byrne, Carlos Lelkes-Raugal, Joan Francis
British Library, United Kingdom

POSTER

Blogging has been at times the only source for knowledge sharing within the web archive community. It is only in recent years that a large body of academic publications has become available. However, blogging is still a popular format for practitioners to share the latest updates on the technical as well as curatorial side of web archiving as well as reports back on web archive related events. In April 2023, the UK Web Archive celebrated ten years of Non-Print Legal Deposit through a blog post that reflected on how that legislation shaped the work we do.

The UK Web Archive blog is an important source for any practitioners and researchers interested in seeing how web archiving in the UK has changed over time. All British Library blogs have been archived by the UK Web Archive and are openly accessible through the UK Web Archive website. In 2018, there was a change to the requirements on how public sector organisations publish on the web in the UK. This change was to ensure that all postings were in line with accessibility guidance. This change to how we publish on the UK Web Archive blog is an opportunity to reflect on what has been previously discussed on the blog. This poster will illustrate what was discussed on the UK Web Archive Blog from November 2011 to December 2018.

Podcasts Collection At The Bibliothèque Nationale de France: From Experimentation To The Implementation Of a Functional Harvest

Nola N'Diaye, Clara Wiatrowski
National Library of France

POSTER

In June 2023, the Bibliothèque nationale de France launched its first podcast harvest of almost 200 selections produced by our correspondents. This unprecedented collection was the result of a workshop carried out in close collaboration with BnF's Department of Sound, Video and Multimedia and the IT Department. During this workshop, we were able to explore the possibilities of harvesting on several platforms that do not require identification (Anchor, Ausha, Soundcloud, Podcasts France, Podcloud...), but also to reflect on the first outlines of a collection policy and the technical configuration of a podcast collection at the Bibliothèque nationale de France.

This presentation aims to explain the transition from experimentation during a workshop to the creation and implementation of a functional harvest.

We propose to go back over the various stages of this process: making contact with certain players in the sector, developing a selection methodology for our colleagues, reflecting on the collection axes selected and finally the technical tests inherent in each collection set-up within the establishment.We'll also take a look at quality control and the results of this first collection, which was full of surprises and constant adjustments. Indeed, the launch of an harvest on a larger scale than during the workshop brought to light other characteristics and technical difficulties specific to the podcast object. We'd like to come back to this time of post-collection analysis, which was rich in lessons and invaluable for the next collections planned for 2024. Finally, this presentation aims to highlight the responsiveness and quality of the dialogue between the web legal deposit section, fellow librarians and IT department.

POSTER SESSION 2:

Fixing Broken Links with Arquivo404

Vasco Rato, Daniel Gomes
Arquivo.pt, Portugal

Link rot has been a prevalent problem since the early days of the web. As websites evolve over time, some of their URLs which used to reference valid information become broken. Thus, bookmarks, citations or other links to these URLs turn to reference an error page instead: the infamous HTTP error 404 “Not Found”. As a consequence, website users get frustrated when they receive a dead-end error message instead of the page they invested time finding, through search or browsing, and desired to visit.

The Arquivo.pt Arquivo404 service aims to mitigate the damages caused by the “404 Not Found” problem. It is a free, open-source software project that improves soft-error pages on any website by providing the link to a web-archived version of the missing page.

The website owner just needs to insert a single line of code in the page that generates the 404 error message. Arquivo404 uses the Memento protocol to find web-archived versions, allowing any Memento compliant web archives to be also searched (Cross-Origin Resource Sharing - CORS - must be enabled on the Memento service). When a user tries to access a page that is no longer available on a website, Arquivo404 automatically checks if there is a version of that webpage preserved in Arquivo.pt (or other web archives). If the webpage is web-archived, a link is presented so that the user may visit this version. If it was not archived, the default page error “Not Found” is presented.

The behaviour of Arquivo404 can easily be configured through a set of methods (https://github.com/arquivo/arquivo404#readme). For instance, to create a customised message in a different language, define the date range of the web-archived mementos or add web archives to be searched.

A real-world use case for Arquivo404 occurred when our organisation, the FCT: Foundation for Science and Technology, launched a radically new version of its website (www.fct.pt) in 2022 to replace the previous version launched in 2010. This migration originated Not Found errors delivered to users which were looking for information on the old website by following links hosted on external websites disseminated along 12 years. Arquivo404 was installed on the new website between June and July 2023. FCT.pt became the top external Referrer of Arquivo.pt followed by Wikipedia.

From January to September 2023, the websites which installed Arquivo404 enabled their users to find 1 882 distinct web-archived pages that otherwise would have been laconic error messages. The obtained results show that Arquivo404 is a useful tool for Internet users and raises awareness about the importance of web archiving.

From Theory to Practice: The First Steps in Social Media Archiving

Susanne van den Eijkel¹, Zefi Kavvadia², Lotte Wijsman^1

1National Archives of the Netherlands; ²International Institute of Social History, Netherlands

POSTER

Institutions all around the world are interested in creating social media archives. These archives can tell us more about public debates on certain topics (e.g. via hashtag content), they can complement existing collections of cultural heritage institutions, or they can represent online content produced by politicians and government officials. Therefore, the goals and purposes for archiving social media may differ, but the tools being used, the technical challenges faced, the ethical issues that may occur, and the ongoing changes in the platforms persist.

During last years’ Web Archiving Conference in Hilversum and a study day in Antwerp with Belgian and Dutch institutions, we gave a workshop on archiving social media to open up the conversation about different goals, purposes and approaches to institutional social media archiving. We felt that, while the approaches and methods used in practice are always subject to change, one's goals and purposes can persist through time. We asked our participants to use the social media archiving typology elements that we designed to create their own sequences describing an archiving actor and their characteristics, high-level purposes, and goals and requirements. The exercise was meant to help focus on the start of social media archiving, instead of the end (in other words, choosing the method or tool should preferably come after this design and planning process). We also asked participants to not only fill this in for their own cases, but to also work with an out-of-the-box scenario which we provided for them. This would allow them to look at other perspectives and consider: What are the differences and similarities and are there patterns to be discovered?

With this poster we want to invite delegates to have a discussion about social media archiving and how taking a step back and thinking about our goals and purposes can be beneficial. This poster helps us to continue the discussion on how we can find more "good practice" instead of focusing only on the best practice, and connect this with the need to discover and be aware of our contexts as organizations and individuals. The goal of our workshop was to provide more ways to address organizational issues and why these are important. Social media archiving is not only about using the best tools or acquiring technological skills. Sustainable and principled archiving of web content also needs us to take a step back and decide why and how we want to archive, which resources we have and which approach might fit our circumstances best.

This proposal fits best to the conference topic curation. It focuses especially on strategies for collecting from social media platforms and building collections and exhibits. Next to this conference topic, this proposal is part of the ongoing discussion in the international web archiving community about weighing the pros and cons of archiving social media content.

Some URLs Are Immortal, Most Are Ephemeral

Kritika Garg¹, Sawood Alam², Michele Weigle¹, Michael Nelson¹, Jake LaFountain², Mark Graham², Dietrich Ayala^3

1Old Dominion University, United States of America; ²Internet Archive, United States of America; ³Protocol Labs, United States of America

"How long does a web page last?" is often answered with "44 to 100 days", but the web has significantly changed since those numbers were first given in 1996. We examined how webpage lifespans have evolved, considering data with a sample of 27.3 million URLs archived from 1996 to 2021 by the Internet Archive's (IA) Wayback Machine.

We found that only 35% of our 27.3 million URLs were still active as of 2023, indicating that a significant portion of web content becomes inactive over time. Our preliminary analysis also suggests that these numbers are not significantly inflated with soft 404s, parked domains, and other phenomena. We encountered DNS failures for 30% of our dataset’s 7 million unique domains.

Surprisingly, almost half of the URLs initially archived in 1996–2000 were still active in 2023, suggesting the longevity of some early URLs. For example, popular sites like apple.com and nasa.gov continue to be active and will likely continue to exist as long as there is a web. On the other hand, some URLs had such ephemeral lifespans that they defied measurement. Nearly 30% of the URLs had very short lifespans, with either only one archived page or no "200 OK" mementos, suggesting a brief existence or limited archival interest. We also observed that archival interest correlates with URL activity: if a URL remains unarchived for an extended period, the URL has likely become inactive.

The average lifespan of a web page in our dataset is 1.6 years. However, this average conceals the bimodal nature of root URLs, where around 10% persist for less than a year, and nearly 20% thrive for over 20 years, resulting in an average lifespan of approximately 3.9 years. The lifespan of deep links is more ephemeral, with 50% of them becoming inactive within a year, resulting in an average lifespan of 1.3 years.

We examined web page half-life, which is the time it takes for half of the pages to disappear. Root URLs had a longer half-life of nine years compared to one year for deep links. URLs from different decades exhibited varying lifespans, with URLs from the 1990s having a substantial half-life of 15-20 years, while URLs from the early 2000s had a shorter half-life of 6–7 years, and URLs from 2003 to 2021 had a half-life ranging from 6 months to 3 years.

Using the IA as a source for sample URLs provides the only realistic, public option to study the evolution of the web at this scale and duration. However, there are well-known classes of pages that are not present in the Wayback Machine, and it should be emphasized that our findings only apply to publicly archivable pages. These findings provide a more nuanced understanding of web page longevity, emphasizing that while some URLs survive a very long time, most have an ephemeral lifespan.

Exploring Web Archives Using Hyphe, A Research Oriented Web Crawler

Sara Aubry¹, Benjamin Ooghe-Tabanou^2

1National Library of France; ²Sciences Po médialab, France

POSTER

Developed by Sciences Po médialab as an open source software (https://github.com/medialab/hyphe), Hyphe was designed to provide researchers and students with a research-oriented crawler to build and enrich corpora of websites. It uses links between them in order to map web territories and enables the study of community structures. A step-by-step methodology supports Hyphe users in curating and defining “webentities” in a way that is both granular and flexible by choosing single pages, a subdomain, a combination of websites, etc. The pages residing under these entities are then crawled, in order to extract the outgoing links and part of the textual contents. The most cited webentities can then be prospected manually in order to enrich the corpus before visualizing it in the form of a network and exporting it for cleaning in other tools such as Gephi.

A step-by-step iterative process supports Hyphe users in dynamically curating and defining “web entities” in a way that is both granular and flexible by choosing single pages, a subdomain, a combination of websites, etc. The pages residing under these entities are then crawled, in order to extract the outgoing links and part of the textual contents. The most cited “web entities” can then be prospected manually in order to enrich the corpus before visualizing it in the form of a network and exporting it for cleaning and analysis in other tools such as Gephi.

As part of the ResPaDon project, Sciences Po médialab and the National Library of France (BnF) organized, ran and evaluated an experiment based on the use of the Hyphe web crawler on Web archives: Hyphe has been extended to work with the Past Web. The “Archives de l'internet”, which is the name of BnF Web archives search application, and Hyphe are now able to work with one another. It allows the curation of both the live web and BnF's, INA's and the Internet Archive's Web archives.

After the Social Media Archiving Project in Belgium: Contributing to a Better Archiving Context for Small Heritage Institutions

Katrien Weyns
KADOC-KU Leuven, Belgium

"Collaboration is the key to sustainable social media archiving for small and medium-sized heritage institutions in Belgium," a conclusion reached by KADOC and meemoo after three years of researching best practices for social media archiving. Archiving social media is a complex task, and it's often too much for small institutions to handle on their own. Even when they try, the platforms themselves make it difficult to collect data.

Social media archiving is both time-consuming and knowledge-intensive. There are various social media platforms in use, and there's no single tool to archive them all efficiently. Heritage workers find themselves juggling multiple archiving tools. Furthermore, the context in which we archive constantly changes: new platforms emerge, features within existing platforms evolve, archiving tools get blocked, new tools are developed, and legislation is amended. Keeping up with these changes requires a dedicated team, but most small organizations lack even a dedicated web archivist. Thus, collaboration, where knowledge and experiences are shared, becomes essential to enhance social media archiving in Belgium; otherwise, valuable historical information may be lost over time.

In this poster session, we provide an update on the initiatives that followed the research project. We evaluate the impact of the created practitioner network on archiving practices within various contexts where heritage workers operate: public and private archives, heritage cells and libraries. Have the barriers to social media archiving been lowered? Is there a growing interest in archiving social media, leading to more archived content? What challenges remain to be addressed? The poster is structured by the different aspects of social media archiving: selection, data collection, replay and analyses, research and legal framework.

Exploring Polyvocality in Online Narratives of French Colonial Memory Through Web Archives: Methodological and Ethical Considerations

Sophie Gebeil¹, Véronique Ginouves², Christine Mussard³
¹Aix-Marseille University, France; ²Centre National de la Recherche Scientifique, AMU, France; ³Aix-Marseille University, France

POSTER

French web archives are crucial for studying the evolution of recent online narratives about polyvocal memories of colonization in the Maghreb and its postcolonial consequences. This poster outlines the methodologies and ethical considerations of two case studies conducted by historians and archivists within the European XXXX program. Firstly, it presents the outcomes of employing computational methods to investigate the recollections of the 1983 March for Equality and Against Racism within the online media landscape, through an interdisciplinary approach in collaboration with the X Lab of the XXX. Semantic network analysis on a corpus constructed in partnership with web archivists and focused on the year of the March’s thirtieth anniversary in 2013, underscores the instrumentalization of this historical event. It reveals the event's overshadowing by far-right discourses and excessive focus on Muslim populations, effectively pushing the Marchers and their demands into the background.

Secondly, the poster introduces a qualitative and historical examination of blogs archived by XXX, representing the voices of former repatriates from Algeria and formerly colonized Algerians who use the web to share ttheir memories, particularly related to school experiences. The analysis demonstrates how these online publications contribute to the reinterpretation of past realities characterized by separation and segregation. Lastly, the poster addresses the ethical considerations associated with web archives pertaining to the colonial era and the March of 1983. This discussion focuses on the challenges of disseminating research findings in a context marked by strong calls for the decolonization of archives.

Quality Assurance In Webarchiving

Trienka Rohrbach
National Library of the Netherlands

What is webarchiving when the quality of the harvests is not checked? There are websites that don’t exist anymore, websites that are only partly crawled or have missing images. There are so many problems that occur during webarchiving. Therefor the quality of the harvests has to be subject of investigation. Controlling the quality of the harvests has to be part of the whole procedure of harvesting websites. Especially because of the way the KB | National Library of the Netherlands harvests websites as we are not allowed to do a domain crawl in the Netherlands yet. Therefor we have a selective collection that is composed by our collection specialists.

After harvesting, we look at the different harvests, large ones, small ones and review: Is the harvested website still the website we intended to harvest or has it changed to another website with another subject? Does the website even still exists?! We encountered all of those problems and wanted to do something about it, because we want to improve the quality of our webarchiving results.
So we developed a quality assurance procedure that starts with a query on the database every 14 days. With the outcome of that query our quality assurance officers can determine which harvests need to be checked and adjust the ones where adjustment is necessary. They divide them into groups: small harvests, large ones and those in between. And for all of those different kinds of harvests we developed a different way of handling those.

The large ones (for example) are controlled on limits of data, time and number of elements harvested. Another part of this procedure is that we find harvests that exclude important content of the website or we see that the website is not about a windmill anymore but is now selling wooden shoes. And that was not what our collection specialist wanted to harvest.

The small ones are also controlled on limits of data, time and number of elements harvested of course, but we also check if the URL is still the right one and (what sometimes happens with small websites) if it is not becoming a sub-website of a capital website.

Then there are websites where we harvest a large amount of by-catch and/or encounter crawlertraps. So through working with excludes we are able to rule out that by-catch and/or the crawlertraps.

On a poster I will show the different ways we improve the quality of our harvest results through, datadump, workflows, the settings of Heritrix, the Web Curator Tool and other tips and tricks.

‘The least cool club in the world’: Building Capacity to Deal with Challenging Crawls at the UK Government Web Archive’s Regex Club

Jake Bickford
The National Archives, United Kingdom

POSTER

Crawler traps are a familiar issue for the web archiving community, leading to crawls taking longer than necessary, capturing erroneous or useless material, creating an unnecessary environmental and financial burden, and potentially missing in-scope content. But dealing with crawler traps requires specialist knowledge: of the type of web technology that causes them, the behaviour of the crawler, the specifics of the target site, and of the sort of regex patterns that can effectively prevent the crawler from falling into them. This poster will illustrate one archive’s efforts to build a sustainable workflow for the identification and mitigation of these problems.

At the UK Government Web Archive the knowledge required to address these issues has been built up informally over time, and as such is unevenly distributed throughout the team. For the last year we have instituted a project aimed at formalising the way in which we address challenging crawls, and sharing the knowledge required to do so. This has centred on a regular Crawl Log Review workshop (informally known as ‘Regex Club’) where problematic crawls are discussed, their logs analysed, and appropriate reject regex patterns are identified and tested. This has become a forum for knowledge sharing and capacity building. To support this process a new workflow has been developed using simple Python scripts to extract the information from various systems required to assess each crawl, and to measure the results of our interventions. A wiki is used to record our findings and to develop a typology of challenging crawls so that existing patterns can be adapted and re-used where appropriate.

This poster will share our workflow, our approach to running the workshops, the tools we use, common issues and solutions, and some examples of the impact of our work. It will provide an opportunity to discuss this important issue with the wider web archiving community, share experiences and good practice, and hopefully inspire some attendees to start a ‘Regex Club’ of their own.

LIGHTNING TALKS

Generative AI In Streamlining Web Archiving Workflows

Lok Hei Lui
University of Toronto, Canada

The recent advancements in generative AI have facilitated its application across diverse sectors, most notably in computer programming. Using computer programming techniques could perform automation tasks and incorporate them with application programming interfaces (APIs), and ultimately achieve productivity. This also applies to working in web archiving tasks, especially with a large number of resources that need to be archived. Notably, professionals in the GLAM sector (Galleries, Libraries, Archives, and Museums) might not possess the coding knowledge to implement complex coding strategies for the above methodologies. Generative AI services are helpful in lowering the threshold to accomplish these coding tasks.

In this short presentation, as a student assistant working in a library, the presenter will showcase how to take advantage of generative AI to streamline the workflow and perform mass web archiving on the Archive-It platform. With an advanced beginner background in computer programming languages (Python) and leveraging the programming code generation with generative AI, the presenter crafted a solution for a government reports web archiving project in his workplace. The initial challenge was the inability of the Archive-It platform to archive the designated web resources. However, a workaround solution was developed by integrating phrasing libraries, loops, and APIs, and nearly 15,000 PDF reports published by a government statistics bureau were crawled and stored in the Archive-It platform. The solution also includes a CSV file generated by the scripts indicating the metadata for each archived PDF file.

The presenter will walk through the logical progression leading to this solution's inception. The overarching goal is to furnish a paradigm that empowers individuals with limited or no programming experience. This will equip them with the requisite starting point and foundational understanding to perform automation and API integration with computer programming languages, largely with the help of manipulating generative AI technologies.

Scaling Web Archiving: The Challenges of Deduplication

Alex Dempsey
Internet Archive, United States of America

Scaling our web archiving efforts presents a host of intricate challenges. Amongst these is the pursuit of capturing unique content while filtering out the redundant—a necessary stride towards optimizing storage, enhancing performance, and ensuring cost-efficiency. Deduplication, although just one facet of the scaling matrix, casts a significant influence on our overarching crawling methodology.

At its core, the concept of deduplication is straightforward:

Compute a hash of the acquired payload
Verify if this hash has been previously recorded
In the case of redundancy, record a revisit in the WARC

However, as we scale, multifaceted issues arise.

Efficiency with Volume: Handling terabytes of CDX presents challenges. What strategies can we employ to perform lookups efficiently?

Distributed Cluster Synchronization: Browser-based crawling in particular calls for horizontal scaling with parallel crawlers. In a vast distributed system, ensuring all crawlers timely witness each other's contributions is crucial. How do we effectively mitigate synchronization delays?

Scope Diversification: Is deduplication global, or confined to different scopes like time, collection, or crawl?

Hash Check Limitations: The hashing model is not foolproof. For instance, a minor discrepancy in a lengthy video stream results in a new hash, although the video remains largely unchanged. Such mishaps can lead to unnecessary storage consumption.

Browser-Driven Crawling Dynamics: Browser-based crawlers, by simulating user interactions, let browsers arbitrate network requests. This can lead to higher fidelity capture. However, this very strength can be a double-edged sword. Intense web page activity can flood revisit logs, bloating CDX indexes and amplifying storage and replay costs.

Harnessing Diverse Toolsets: Different content often requires different capture technology. How does our deduplication strategy work across different tools?

This presentation will provide insights into recent endeavors undertaken to confront these challenges. Leveraging advancements in tools like Brozzler, Heritrix, and more, we'll explore the tangible solutions and enhancements that have been incorporated to overcome the highlighted obstacles.

WARC-ing Legacy Archived Web Sites

Annabel Walz
Archive of Social Democracy (Friedrich-Ebert-Stiftung), Germany

Our institution started web archiving a small curated set of websites in 1999. Since 2018 we use crawlers that write the web sites to WARC files. But during the years before several web archiving tools have been used, that saved the archived websites in different forms. We have now started a project to consolidate our web archive by converting the older web sites to WARC. The goal is to be able to present the entire content of the web archive via a single access point to users and to have a common ground for future preservation needs and processes.

The proposed 5-minute talk would present the process of developing a workflow and adapting existing tools for converting the older archived web sites to the WARC format.

The legacy archived web sites in our archive have been copied/crawled with (at least) four different software products: Teleport Pro, Offline Explorer, HTTrack and Offline Web Archiv. Each of these has a specific output. Generally, the files that the website consisted of are stored in a folder structure, but the specific structure differs and partly the html files have been adapted in a software specific way during the archiving process.

There are software tools that allow for the conversion from the output of HTTrack to ARC (httrack2arc) or WARC (httrack2warc) and for conversion of directories of web documents to WARC (warcit). But in all cases the tools have to be adapted and combined with other scripts to enable the conversion for the specific folder structures.

The talk would outline the starting point and the motivation for our conversion project, then address the problem of identifying possible tools and the problems with using these tools, the adaptation of the existing tools and additional development of scripts and the developed workflow for the conversion.

The project will still be ongoing in April 2024, so the talk will not be able to present final results, but focus on the process and interim results.

Describing Collections with Datasheets for Datasets

Emily Maemura¹, Helena Byrne²
¹University of Illinois, United States of America; ²British Library, United Kingdom

Significant work in web archives scholarship has focused on addressing the description and provenance of collections and their data, yet there is no standard approach to writing and publishing documentation. Outside of the cultural heritage domain, similar work is emerging on describing large datasets, and in particular the challenge of documentation is becoming critical for machine learning. Notably, Gebru et al. (2018, 2021) propose developing “Datasheets for Datasets,” a document answering a standard set of questions about a dataset concerning its creation and use, including: Motivation; Composition; Collection Process; Preprocessing / Cleaning / Labeling; Use; Distribution; and, Maintenance.

This lightning talk reports on the findings of a collaborative project exploring if the Datasheets for Datasets framework could be applied to inactive UK Web Archive collections published as datasets. A series of card sorting workshops were held in the UK during Spring 2023, with three workshops focused on information professionals and one workshop focused on researchers. An additional workshop was held at the 2023 IIPC Web Archiving Conference at Hilversum, Netherlands with web archiving experts from around the world. During a two-hour workshop, the participants were asked to prioritize what information to include in dataset documentation based on and abridged version of the datasheets with 31 questions (Microsoft Research, 2022). Working in pairs, participants discussed each question and assigned a priority ranking using the MoSCoW method, sorting the questions into categories of Must, Should, Could and Won’t have.

Initial findings from these workshops illustrate the range of perspectives on how to prioritize documentation information. For instance, there was only absolute consensus on one question: “Dataset name”. Over eighty percent of the participants believed that it is essential to document “Dataset version number or date” and “Describe any applicable intellectual property (IP) licenses, copyright, fees, terms of use, export controls, or other regulatory restrictions that apply to this dataset or individual data points”. After this, there were more mixed results on what was essential to include in a datasheet describing web archive collections, and discussions of how responsibilities for documentation should be assigned. Based on the findings of the workshops, a datasheet template for UK Web Archive collections is being developed and tested, and additional feedback from the lightning talk presentation would aid in its implementation.

References

Gebru, T., Morgenstern, J., Vecchione, B., Vaughan, J. W., Wallach, H., Daumé III, H., & Crawford, K. (2018). Datasheets for Datasets. ArXiv:1803.09010 [Cs]. http://arxiv.org/abs/1803.09010

Gebru, T., Morgenstern, J., Vecchione, B., Vaughan, J. W., Wallach, H., III, H. D., & Crawford, K. (2021). Datasheets for datasets. Communications of the ACM, 64(12), 86–92. https://doi.org/10.1145/3458723

Microsoft Research (2022). Aether Data Documentation Template. Retried from https://www.microsoft.com/en-us/research/uploads/prod/2022/07/aether-datadoc-082522.pdf

Visualizing the Web History of the Pandemic COVID-19 outbreak in México: Last Stage

Carolina Silva Bretón
National Library of México / Bibliographical Research Institute, Mexico

This paper discusses the challenges faced during the project entitled: Proyecto PAPIIT IT400121 Preservación digital de contenidos publicados en portales web y redes sociales. Del acopio a la difusión de colecciones digitales sobre Covid-19 en México.

This project focussed in collecting, archiving and making accessible web pages and tweets related with the pandemic COVID-19 outbreak in México. The leading researchers, Perla Rodriguez (from UNAM) and Joel Blanco (from ENCRyM), had decided to use the suite of tools of Webrecorder in order to manually collect hashtags and web pages that they have selected. As a result, they assembled a collection of 31 items (WARC and JSON files).

Based on the goals of the PAPIIT Project, this collection had to be accessible and visible through the digital repository of the Bibliographic and Information Institute of the National University of México (IIBI-UNAM). Therefore, my role as a Web Designer and digital preservationist from the National Library, was to add a plugin of Webrecorder in the IIBI´s platform. However, we realized that this implementation would not allow us to create the look and functionality that the project aimed for.Thus, my role changed, and I became the responsible for researching and implementing the most viable technology for replay/ visualize WARC and JSON files.

This presentation will describe the decision making process, the issues with regard to interface designing, the technology that was selected, the ethical and technical problems in its implementation, as well as the pros and cons of using Archipielago.

Towards A Formal Registry Of Web Archives For Persistent And Sustainable Identification

Eld Zierau¹, Jon Tonnessen², Anders Myrvoll^1

1Royal Danish Library; ²National Library of Norway

The aim of this workshop is to come up with a plan and common understanding for establishment of a formal registry for web archives, based on the community input.

The purpose of such a registry is to make it possible to identify web archives from which web materials have been used in research, either as part of collections or as specific references to web pages. No matter how a web element is referenced, it need to be traceable in the future where it was originally archived and then where it can be found in that point in the future. Using archive URLs, we already see that web archives shifts to new Wayback machines where the URL therefore changes, there are also examples of web archives where the placement of the resources are moved to a new domain, as e.g. the Irish web archive.

Another benefit would be that the Persistent Web IDentifier automatic can be made automatically resolvable and constructable, but there are most likely many other cases where a formal registry can point to generic patterns for services, e.g. CDX summaries and special service calls.

There have been attempts to make non-formal registries of web archives for example within IIPC (e.g. https://en.wikipedia.org/wiki/List_of_Web_archiving_initiatives and https://netpreserve.org/about-us/members/). The challenge is that there are no unique way of identifying the actual archive by such registries, and it has no formal history track that could make it possible to use for old references to the web archive. This is bound to be a challenge, if have a 50 year horizon.

The workshop will present what the contents of such a registry could look like, which will be an elaborated version of a table with columns:

Web archive identifier
which must be the persistent identifier for the archive, usually a number like the ISNI (International Standard Name Identifier) ISO standard, which does not carry any humanly readable description of the archive
Start date of table entry’s active period
This is to allow that readable identifiers or other information can change over time, but where it becomes visible which period the information was the active information
Optional End date of table entry’s active period
Assisting Start date, if there are more periods
Web archive readable identifier
which must be unique compared to any other web archive readable identifier, e.g. netarkivet.dk for the Danish web archive, which is actually likely to change at some stage
Pattern for web pages in archive
This will only be relevant for web archives that have archive URLs based on archival time and archived URLs like https://web.archive.org/web/<numeric archival date>/<archived URL> for Internet archive
“Other things”
like link to IIPC description e.g. https://netpreserve.org/about-us/members/internet-archive/

There is many subjects to discuss concerning such a registry:

What else could the registry be relevant for?
What else could be relevant to have in such a registry?
How to verify a new or modified registrant?
Should it be manage by for instance IANA (Internet Assigned Numbers Authority)?
How is a web archive identifier assigned?
Who is checking that a Web archive readable identifier?
Should there be possibilities to have more than one Web archive readable identifier?
Should there be possibilities to have overlapping time periods?

The workshop will discuss these subjects in brainstorms and in group work setups.

It is probably too ambitious to get a formal registry as a first step, and in order to make it workable, it needs to be inclusive for all web archives, - and in order to make tools like an automatic PWID resolver tool work, it should cover as many web archives as possible. This rises the next questions:

Could it start by being managed within IIPC context? Other contexts?
How to make it possible that all web archives can be registered?
What about cases like internet archive, where materials can be referenced, either in archive.org or in archive-it.org?
Would it be ok to start by an informal registration with known web archives and use their domain (e.g. netarkivet.dk) as the Web archive readable identifier?
Other ideas?

These questions will be taken in plenum discussions.

To sum up, the program will look something like:

Presentation of goal, purpose and ideas for look of a formal registry
Brainstorm session with discussion about questions 1, 2 above
Group work about questions 3-8 followed by feedback and discussion
Plenum discussions about question 9-12

Digital Storytelling: Creating An Exhibition Of Web-born And Mobile Narratives

Ian Cooke, Giulia Carla Rossi
British Library, United Kingdom

The exhibition Digital Storytelling: Innovations Beyond the Page ran at the British Library from June to October 2023. The exhibition focused on how new technologies provide opportunities for writers to tell stories in new ways, and how these new stories emphasise the role of the reader within the narrative. Among the 11 works on display there were examples of writing for the web, which had been included in the UK Web Archive. The exhibition drew on research conducted at the British Library on collecting ‘emerging formats’, and was also intended to support the Library’s understanding of access and collection management of innovative digital publications.

This presentation introduces the exhibition, describes some of the choices and challenges in developing a physical installation of online works, and reflects on early learning from the exhibition. Although archived web copies of works weren’t used in the exhibition, this topic is relevant to the wider conference theme of ‘Web Archives in Context’, through an assessment of the requirements of the physical space for display and interaction, and also through the development of interpretive work to support the display of web-born publications. This included the creation of ‘playthrough’ films to demonstrate user interaction with the publications, which are intended to form part of the collecting of contextual information to support access to innovative digital publications over the long term.

Partners