SESSIONS

FULL-TEXT SEARCH FOR WEB ARCHIVES
BESOCIAL: SOCIAL MEDIA ARCHIVING AT KBR IN BELGIUM
TEACHING THE WHYS AND HOWS OF CREATING AND USING WEB ARCHIVES
VIDEO/STREAM ARCHIVING
DESIGN, BUILD, USE: BUILDING A COMPUTATIONAL RESEARCH PLATFORM FOR WEB ARCHIVES
ADVANCING QUALITY ASSURANCE FOR WEB ARCHIVES: PUTTING THEORY INTO PRACTICE
SERVING RESEARCHERS WITH PUBLIC WEB ARCHIVE DATASETS IN THE CLOUD
RESEARCHING WEB ARCHIVES: ACCESS & TOOLS
ELECTRONIC LITERATURE AND DIGITAL ART: APPROACHES TO DOCUMENTATION AND COLLECTING
RESEARCH USE OF THE NATIONAL WEB ARCHIVES
PULLING TOGETHER: BUILDING COLLABORATIVE WEB COLLECTIONS
RAPID RESPONSE COLLECTING: WHAT ARE THE NEW WORKFLOWS AND CHALLENGES?
SAVING UKRAINIAN CULTURAL HERITAGE ONLINE

POSTERS

Full-text Search for Web Archives

Andrew Jackson¹, Anders Klindt Myrvoll², Toke Eskildsen², Thomas Egense², Ben O'Brien³

¹UK Web Archive, British Library, ²The Danish Web Archive, Netarkivet at The Royal Danish Library, ³National Library of New Zealand

This session will focus on what full-text search can bring to web archives, and the challenges involved. We will present the perspectives of three different institutions, based on their experiences with tools like Web Archive Discovery toolkit and SolrWayback. The Q&A will then be opened up to hear from a wider range of institutions, covering a wider range of tools and experiences.

#1: The State of Full-Text Search at the UK Web Archive

Andrew Jackson

The UK Web Archive has been developing the Web Archive Discovery toolkit and has experimented with a number of different user interfaces for the resulting Solr indexes, including SolrWayback. This presentation will outline our experience so far, and the challenges that remain when trying to provide full-text search for very large collections.

#2: SolrWayback at the Royal Danish Library - Key Findings, Experiences and Future Aspects

Anders Klindt Myrvoll, Toke Eskildsen, Thomas Egense

The Royal Danish Library has been developing SolrWayback for search, discovery and playback of our web archive, using UKWAs Web Archive Discovery toolkit for making the Solr-index. Users are internal web archive curators and technicians as well as internal and external researchers using SolrWayback to get the most out of our web archive. This presentation will highlight key findings and experiences using SolrWayback since it was released in 2021 and look at some of the future aspects of this ecosystem like indexing and playback of Twitter content.

#3: Searching for a Full-Text Pilot

Ben O'Brien

This presentation will show the first steps of our journey to providing full-text search access to the NZ Web Archive. The National Library of New Zealand has a comprehensive history of web archiving and has long desired to provide enhanced levels of access to the archive, particularly for researchers. While our web archiving programme includes curated content and domain crawls, the size of the .nz domain space is relatively small compared to other countries, which we hope will reduce some of the initial financial and technical barriers to implementing full-text search. In developing our full-text search pilot, we are working closely with Victoria University of Wellington to evaluate the pilot’s feasibility in providing a successful user experience for searching web archive material. The Library has included in the pilot a harvested copy of a well used electronic text collection, previously hosted by the University. This will enable a great opportunity to test our implementation and collaborate with users who have a dedicated research focus. The journey so far has included an options analysis of the tools used within the web archiving community for full-text search, as well as the benefits and considerations of using public and private cloud infrastructure. This has led us to exploring the use of the Web Archive Discovery Toolkit, and such front-end interfaces as Warclight and SolrWayback. Our intention is to implement the same lightweight software stack within AWS and our organisational private cloud provider and test the feasibility of both for the pilot, while also exploring how each environment and solution would scale to deliver our entire web archive in the future.

BESOCIAL: Social Media Archiving at KBR in Belgium

Fien Messens¹, Peter Mechant², Lise-Anne Denis³, Eva Rolin⁴, Pieter Heyvaert⁵, Patrick Watrin⁴, Julie M. Birkholz^1,6 and Friedel Geeraert¹

¹KBR (Royal Library of Belgium), ²MICT, Ghent University, ³CRIDS, University of Namur, ⁴CENTAL, UCLouvain, ⁵IDLAB, Ghent University, ⁶GhentCDH, Ghent University

The purpose of this Q&A session is to provide insight into the social media archiving activities at KBR, the Royal Library of Belgium. Together with the Universities of Namur, Ghent and Louvain, KBR is piloting social media archiving in the context of the BESOCIAL research project. The presentations included in this session will focus on the selection policy, the access platform that is being developed, the legal analysis of the European copyright legislation and a concrete case study based on archived social media around the Gorman-Rijneveld translation controversy.

#1: What to Select and How to Harvest? The Operational Side of Social Media Archiving

Fien Messens & Pieter Hayvaert

The amount of digital data we produce every day is truly mind-boggling. Content produced on social media fuels this data deluge. However, social media content is very ephemeral and is often not archived by universities or heritage institutional stakeholders, creating serious challenges for (digital) scholars who want to use archived social media content as a data resource. BESOCIAL, a cross-university collaborative and interdisciplinary project (coordinated by KBR) tries to close this gap by developing a strategy for archiving and preserving social media in Belgium. With the creation of such a research project comes also the key question: how do you create a corpus, and how do you make it as transparent and representative as possible? At the IIPC Web Archiving Conference we would like to discuss the selection strategy of the BESOCIAL project, the harvest process, and how these steps are made operational within the national library of Belgium. Within the BESOCIAL project a combination of a top-down and bottom-up approach for the selection of relevant social media content was implemented. The top-down approach covers outlining the selection criteria that constitute the scope of the social media collection related to Belgian cultural heritage. The bottom-up approach was setting up a crowdsourcing campaign to ask members of the Belgian public to recommend social media handles and hashtags to include in the collection. Once the relevant content was selected, we used existing tools to harvest the content periodically. We used two tools: Social Feed Manager for Twitter and Instaloader for Instagram. At the conference we would like to discuss the advantages and disadvantages of the different tools and what data we have harvested so far. For example, Instagram does not offer an API that we can use to harvest the data which complicates the whole process. The data storage space needed per social media platform heavily depends on the type of content on these platforms.

#2: Archiving Belgian Social Media: How to Obtain a Representative Corpus and How to Represent Them Via an Interface?

Eva Rolin

For the BESOCIAL project, we have set up a platform for archiving social network content, which raises a number of technical questions, including the following: how do we obtain a representative corpus of Belgian users and what type of access should we offer for this type of collection? The first question is crucial since our goal is to archive only Belgian content and this question is far from trivial since very often users’ locations can be either missing or made up. To do so, we have set up an algorithm that allows to extend a set of seed accounts and filter them according to the origin of the people, regardless of their location. As for the second question, concerning the access interface. It is based on a development led by the MIIL (Media innovation & intelligibility Lab) with whom we are now partners. This interface allows navigation through the archived content using filters, simple or complex searches. It also allows the data to be visualised using graphs.

#3: The European Copyright Law as an Obstacle To Social Media Archiving

Lise-Anne Denis

Within the BESOCIAL project, the legal aspects of social media archiving were analysed in order to ensure compliance with both European and national law. Among these legal aspects, we studied copyright law, on which we will focus for this presentation as it currently represents an obstacle to social media archiving. Social media content, including pictures or texts, is very often protected by a copyright in European law as it only requires originality in order to be automatically granted to the author. A copyright entails that the authors (or whoever owns the copyright) will have to give their authorisation for the reproduction of the content (such as its collection for the archives) or its communication to the public. Although some exceptions exist and provide cases where the author’s authorisation is not needed, including for heritage preservation, there is a principle in copyright law which imposes to have a strict interpretation of these exceptions. This means that they have to match a situation perfectly, word for word, in order to apply. And that is where the issue is when it comes to the archiving of born-digital content: no exception has been adapted to take into account such archiving activities and they cannot be applied in this case. This is a big obstacle for SMA as it obliges to obtain the authors’ authorisations before being able to archive protected social media content. Such a requirement is obviously unrealistic if a mass collection of content is envisaged. Even if the European directives regulating copyright sometimes specifically prohibit the application of the exceptions to digital content, it has not prevented some EU Member States to extend some exceptions to archiving of digital content in their national law. It would however be preferable that this matter is handled at the European level in order to formally take into account social media archiving as a growing and important practice. In this presentation, we will present how the current European copyright legal regime applies to SMA and show how it should evolve in order to be more adapted to the reality of archiving practices.

#4: Key Actors, Events and Discourses in the Gorman-Rijneveld Translation Controversy on Twitter

Peter Mechant

On 20/01/2021, Amanda Gorman performed her now famous poem ‘The Hill We Climb’ (2021) at the inauguration of Joe Biden. Almost a month later, publisher Meulenhoff announced that the Dutch writer Marieke Lucas Rijneveld was selected to translate Gorman’s poem. This decision gave rise to criticism by activists and quickly became the focal point of a deluge of columns in mainstream media and heated discussions on Twitter. In this session we investigate to what extent a Twitter dataset about the Gorman-Rijneveld translation controversy can be ‘re-used’ to determine and outline the main actors, events and different discourses in this translation controversy. The Gorman-Rijneveld controversy is then used to demonstrate how a social media data collection can support digital humanities scholars in researching a societal event. To identify key actors and groups a network was generated and network analysis was conducted using the NetworkX python module. We visualised the network using Gephi. In order to grasp key moments and to understand the chronology of events, we created Tableau visualisations. To identify different discourses we used topic modeling by means of Dariah-DE Topics Explorer. This enabled us to visualise topic-document distributions as a heatmap, identifying different emphases in discourse among the groups in the network. We unearthed a dynamic of events in which an activist publishes an opinion piece with a provoking title, which generates outrage. Several key actors then tweet about the events, disseminating a certain framing of the events, namely ‘Rijneveld as the victim of reverse racism’. Next, key actors are heavily retweeted by people in their own group, which demonstrates the in-group mentality on Twitter. Discourses about the event ranged from outrage about alleged racism to disappointment and a more nuanced debate about the choice of a suitable translator, to criticism by activists and people of color. From a broader perspective, the translation controversy also demonstrates how Twitter serves as a secondary gatekeeping channel, reinforcing and echoing voices in mainstream media.

Teaching the Whys and Hows of Creating and Using Web Archives

Lauren Baker¹, Claire Newing², Maria Ryan³, Tim Ribaric⁴, Ingeborg Rudomino⁵, Karolina Holub⁵, Zhiwu Xie⁶, Kirsty Fife⁷

¹Library of Congress, ²The National Archives (UK), ³National Library of Ireland, ⁴Brock University, ⁵National and University Library in Zagreb, ⁶Virginia Tech Libraries, ⁷Manchester Metropolitan University

To document our times and understand our past requires skills to create and use web archives. Through training programs, we aim for more people to know about web archives, know how to preserve the web, and know how to use web archives. This panel will explore how to translate the many reasons why we web archive into the practicalities of teaching web archiving. Presenters will share their experiences to engage students, archives and library professionals, and community groups and will address the conference themes: futures past – exploring the possibilities for web archiving training to build greater web archive literacy; outreach – demonstrating how to share web archiving skills with learners in various contexts; and research – how to approach teaching computational analysis of web archives. This session offers various perspectives on teaching web archiving in order to inspire newcomers to get started and to encourage experienced practitioners to share their knowledge.

#1: Leveraging Computational Notebooks to Teach Web Archives to a Crowd of Non-Programmers

Tim Ribaric

Web Archives will eventually need to be one of the primary methods that people utilize when performing research in our digital age. Information is created and destroyed on the web as a matter of course and undisputedly this brief lifespan can only be captured via the process of web archiving. We've seen many tools developed over the years that allow us to capture these ephemeral expressions; the Wayback machine, webrecorder.io, and WARC files quickly come to mind. However a new challenge develops: how do we now use these new data sources? How can we make meaningful investigations into these large collections of data? Enter computational notebooks hosted on the Google Colab environment. These powerful tools provide an easy to use environment where both code and analysis can be presented in a single, accessible web page. The consequence is that researchers are able to perform sophisticated analysis without an extensive knowledge of writing code. The challenge however becomes, how to encourage learners to engage with the material and to experiment if they are unfamiliar or uncomfortable with programming. This session will investigate this challenge.

#2: Training Activities in the Croatian Web Archive

Karolina Holub & Ingeborg Rudomino

The National and University Library in Zagreb (NSK) began archiving the Croatian web in 2004 when, in collaboration with the University of Zagreb University Computing Centre (SRCE), the Croatian Web Archive (HAW) was established. From the very beginning of web archiving in Croatia HAW's team was included in the National Centre for Continuing Professional Development of Librarians (CSSU). Centre is a programme of lifelong learning and continuous professional development of librarians and information professionals. The courses are intended for librarians from all types of libraries, assistant librarians, students of library and information science and wider GLAM community. The introductory web archiving course consists of several general topics such as tools for archiving Croatian web, purpose of web archiving, introduction and description of current workflow i.e. acquisition, cataloguing, archiving and enabling access as well as guidance (instructions) on searching the archived content. At the course, various methods of performance are used like lectures, demonstrations and exercises for the target groups that are interested in web resources and web archiving in general. Through courses, participants are acquiring knowledge on the basics of cataloguing web resources, identification of web resources, methods of web archiving, selection of open source tools for archiving, opportunities to search and reference these types of resources, to create thematic collections etc. Recent activities are aimed at the wider community in terms of conducting short trainings for public librarians in order to help them create their own local history web collection. Alongside that, a training of NSK subject specialists also began in order to include them in web archiving activities.

#3: Continuing Education to Advance Web Archiving (CEDWARC)

Zhiwu Xie

Supported by the IMLS LB21 program from 2018 to 2021, the CEDWARC project develops a continuing education curriculum to teach library and archive professionals advanced web archiving and analysis techniques through an in-person workshop and a self-paced online workshop. The goals of this project are to 1) train library and archive professionals to effectively use innovative web archiving tools to answer research questions and as a result, 2) enable new web archiving services based on these tools. The curriculum consists of 6 modules. Built on the Web Archiving Fundamentals module, participants are taken on a problem-solving tour through five additional modules, including Storytelling, Social Feed Manager, ArchiveSpark, Archives Unleashed, and Events Archiving. Facilitators of the workshop include web archiving experts and tool developers from Virginia Tech, Los Alamos National Lab, Old Dominion University, George Washington University, the Internet Archive, and the University of Waterloo. The project started with a series of planning meetings including an in-person meeting at 2018 ACM/IEEE Joint Conference on Digital Libraries (JCDL) in Fort Worth, TX, when the structure and the content of the curriculum was finalized. The curriculum had gone through two rounds of dry runs, one during JCDL 2019 in UIUC library attended by 22 library staff and iSchool students and the second in the fall semester 2019 as part of Virginia Tech’s Computer Science undergraduate course CS6604 “Digital Libraries”, participated by 26 undergraduate students. The in-person workshop was offered on Oct 28, 2019 at George Washington University library with 39 attendees, 13 of which were supported by travel grants. The feedback indicated that following command line instructions of the lab session in real-time can be challenging. This prompted us to move all modules to pre-recorded format for learners to study at own pace. The time limit posed on these modules was also eliminated. The online portion of the workshop was offered in Oct 2021 and had 111 registrants from around the world, many have at least sampled several modules. The training program videos and slides are now available online at https://cedwarc.github.io/.

#4: Supporting Grassroots Communities in Developing Web Archiving Skills

Kirsty Fife

This presentation will explore the experience of disseminating skills and knowledge about web archiving via a series of online workshops delivered during 2020. Workshops were targeted at activist and grassroots organisations and operated on a sliding scale basis, aiming to create alternatives to financially inaccessible professional training schemes.

Video/Stream Archiving

Andreas Lenander Aegidius¹, Anders Klindt Myrvoll², Sawood Alam³, Bill O'Connor³, Corentin Barreau³, Kenji Nagahashi³, Vangelis Banos³, Karim Ratib³, Owen Lampe³, Mark Graham³

¹Department for Digital Cultural Heritage, Digital Kulturarv at The Royal Danish Library, ²The Danish Web Archive, Netarkivet at The Royal Danish Library, ³Wayback Machine, Internet Archive

This Q&A session presents two talks on archiving videos and streaming data. The first talk is based on a survey of web archiving of streaming services in the Danish Web Archive. It reviews the prevalence of archived elements and metadata from prominent streaming services. It argues for an urgent need for optimization of stream archiving. The second talk is a case study of video archiving practices at the Wayback Machine of the Internet Archive. It describes the architectural pipeline of the video archiving and playback process and software used. Finally, it acknowledges the need for standardization and interoperability in the video archiving and playback space across web archives.

#1: Collection of Streaming Content

Andreas Lenander Aegidius & Anders Klindt Myrvoll

Today, the media companies that rely on streaming services are constantly publishing new versions of their software, and thereby altering the user’s experience. This is part of a highly competitive market, which is evolving as quickly, if not faster, than the web and app ecologies that support it. The Internet Archive, as well as national libraries in countries with mandatory deposit legislation, document the cultural industries online and receive or collect their published content. However, no National web archive nor the Internet Archive hold a copy of the Netflix catalogue for research purposes. Nor its websites or its many apps that frame users’ access points to the Netflix catalogue across different devices. This paper presents findings from a survey of web archiving of streaming services, in the form of published content, metadata, and user interfaces as collected in the Danish web archive 2005-2021. I apply a broad yet flexible definition of streaming (Spilker & Colbjørnsen, 2021; Herbert et al., 2019), which states that streaming has various dimensions and is best understood comparatively. Utilizing an open explorative approach proposed by Fage-Butler et al. (2022), I map the evolution of the term ‘streaming’ (incl. semantic variants: stream*) in the Danish Web Archive. First, this will show when and how often the term ‘streaming’ occurred on the Danish Web. Second, which actors (websites) on the Danish Web used the term ‘streaming’ the most, and how their use has evolved. Third, I review to what extent 50 prominent streaming services in Denmark have been collected. I measure the number of elements archived per streaming service: websites, web players, the published media content, its metadata, and device-specific apps. The existing collection and documentation of streaming services and their elements are lacking. These findings point to an urgent need to optimize how we document and collect streaming and its elements to research the impact of streaming across national and international web domains. This paper further unlocks the archived Web as a source with its own particular features and as empirical data (cf. Brügger et al., 2020).

#2: Video Archiving and Playback in the Wayback Machine

Sawood Alam, Bill O'Connor, Corentin Barreau, Kenji Nagahashi, Vangelis Banos, Karim Ratib, Owen Lampe, Mark Graham

At the Internet Archive (IA) we collect static and dynamic lists of seeds from various sources (like Save Page Now, Wikipedia EventStream, Cloudflare, etc.) for archiving. Some of these seeds include web pages with videos on them. Those URLs are curated based on certain criteria to identify potential videos that should be archived or excluded. Candidate video page URLs for archiving are placed in a queue (currently using Kafka) to be consumed by a separate process. We maintain a persistent database of videos we have already archived, which is used both for status tracking as well as a seen-check system to avoid duplicate downloads of large media files that usually do not change. We use youtube-dl (or one of its forks) to download videos and their metadata. We archive the container HTML page, associated video metadata, any transcriptions, thumbnails, and at least one of the many video files with different resolutions and formats. These pieces are stored in separate WARC records (some with “response” type and others as “metadata”). Some popular video streaming services do not have static links to embed video files, which makes it difficult to identify and serve video files corresponding to their container HTML pages on archival replay. To glue related pieces together for replay we are currently using a key-value store, but exploring ways to get away with an additional index. We are using a custom video player and perform necessary rewriting in the container HTML page for a more reliable video playback experience. We create a daily summary of metadata of videos that we have archived and load it in a custom-built Video Archiving Insights dashboard to identify any issues or biases, which are utilized as a feedback loop for quality assurance and to enhance our curation criteria and archiving strategies. We are always looking forward to ways to improve the system that works at scale as well as means to interoperate.

Design, Build, Use: Building a Computational Research Platform for Web Archives

Ian Milligan¹, Jefferson Bailey², Nick Ruest³, Samantha Fritz¹, Valérie Schafer⁴and Frédéric Clavert⁴

¹University of Waterloo, ²Internet Archive, ³York University, ⁴C2DH, University of Luxembourg

Since 2017, the Archives Unleashed Project has developed web archive data analysis tools and platforms. In 2020, Archives Unleashed joined forces with the Internet Archive to develop the Archives Research Compute Hub (ARCH), a way to facilitate an entry point for interacting with and conducting analysis on web archive collections. ARCH transforms collections into interpretable derivative datasets that can fit into existing scholarly workflows. In the past, scholarship with web archives has been limited, in part because the collections are large and complex and because existing tools for working with large sets of data do not map well to web archives or are difficult to use for non-technical users. Also, few technical projects have intentionally been designed in an approach that includes useful feedback and engagement between developers, archivists, and researchers. Our project accordingly now sponsors researchers to use our tools and provide feedback to inform an iterative development process and scholars are encouraged to publish scholarship emerging from this process in order to inspire confidence in others to use web and large digital archives in their research. Our panel, “Design, Build, Use” takes attendees through the process underpinning the ARCH interface: from its inception, its construction, to its use by a sponsored researcher.

#1: “Design: A Practical History of Supporting Computational Research”

Jefferson Bailey

This presentation will provide background on past and current efforts within Internet Archive (IA) to support computational research use of its 40+ petabyte web archive, approaching this topic through description of both conceptual, program and technical challenges and how IA, often working in collaboration with uses and other technical partners, has pursued service and tool development in this area. The talk will outline different service models in use within the community for supporting computational research services for historical data, program evolutions, support structures, engineering work, and will detail these areas through discussion of numerous specific projects. Points of discussion include: researcher support scenarios, program design and sustainability, data limitations, affordances, and complexities, extraction, derivation, and access methods, infrastructure and tools, and the interplay of computational research services and collection development and enhancement. The presentation will provide a history of supporting computational research services and how this history has informed various program designs and how it specifically informed the development of ARCH and what it means for the strategic development of ARCH in the future.

#2: “Build: The Archives Research Compute Hub from Idea to Platform”

Ian Milligan, Nick Ruest, Samantha Fritz

This presentation introduces the Archives Research Compute Hub (ARCH) interface in detail and explains the building process, including user experience testing, technical decisions, and continual improvement. ARCH grew out of the earlier “Archives Unleashed Cloud,” which was a standalone system that demonstrated how a web browser interface could power the underlying Apache Spark-based Archives Unleashed Toolkit. Inspired by this foundational work, ARCH will become an integrated interface in the Internet Archive’s Archive-It interface. What is ARCH? ARCH allows users to delve into the rich data within web archival collections for further research. Users can generate and download over a dozen datasets within the interface, including domain frequency statistics, hyperlink network graphs, extracted full-text, and metadata about binary objects within a collection. ARCH also provides several in-browser visualizations that present a glimpse into collection content. By being located in the Internet Archive data center, ARCH has quick access to the petabytes of content collected there. We approached the building of this platform through a variety of stages. The design process for ARCH has involved a variety of interconnected stages from sketching a wireframe, to connecting back-end processes with a user interface design, and conducting a multi-staged user testing process to continually assess user sentiment and impact with functionality and interface improvements. In our presentation, we walk the audience through the process of building ARCH: from conducting a needs assessment and sketching and testing mock interfaces, to the integration of the platform into a production environment. We will also provide lessons learned and cautionary notes for others seeking to build analytics platforms.

#3: “Use”: When Personas Become Real Users…

Valérie Schafer & Frédéric Clavert

Building upon our experience with ARCH, our study related to the IIPC Novel Coronavirus collection, as well as upon the first months of research we conducted as a cohort team in the Archives Unleashed Project, we will provide feedback related to users’ needs and achievements. Ian Milligan distinguished in his paper “You shouldn’t Need to be a Web Historian to Use Web Archives: Lowering Barriers to Access Through Community and Infrastructure” (WARCnet paper, Aarhus, 2020), three personas: a computational humanist, a digital humanist, and a conventional historian. As an heterogeneous team, mirroring in some ways the personas distinguished by Ian Milligan, we will underline the successes and failures we experienced, the technical layers and levels we unfolded, our experience of collective work which also needs to take interdisciplinarity and heterogeneity (of technical skills, interests, availability, digital literacy) into account, the value of mentorship and our iterative process with data and research questions. We will finally shortly discuss the many pros and few cons in lowering barriers to access web archives (e.g. How to make access easiest without hiding the complexity of web archives?).

Serving Researchers With Public Web Archive Datasets in the Cloud

Mark Phillips¹, Sawood Alam², Sebastian Nagel³, Benjamin Lee⁴ and Trevor Owens⁵

¹University of North Texas Libraries, ²Internet Archive, ³Common Crawl, ⁴University of Washington, Computer Science and Engineering, ⁵Library of Congress

This Q&A Session presents three projects that have leveraged cloud-based storage and compute resources for working with web archives. The presenters in this session are at different points in the process from active production using cloud storage and compute for crawling, processing, and hosting of data (Common Crawl), to active work moving data into the cloud for large-scale research access (End of Term), and finally researchers directly interacting with extracted or derivative datasets from web archives that leverage hosted compute for data pipelines. This session will present this series of activities with the main goal of serving researcher needs with datasets hosted in the cloud.

#1: Hosting the End of Term Web Archive Data in the Cloud

Mark Phillips & Sawood Alam

The End of Term (EOT) Web Archive which is composed of member institutions across the United States who have come together every four years since 2008 to complete a large-scale crawl of the .gov domain in the United States to document the transition in the Executive Branch of the Federal Government in the United States. In years when a presidential transition did not occur, these crawls served as a systematic crawl of the .gov domain in what has become a longitudinal dataset of crawls. In 2021 the EOT team began working to provide easier computational access to the web archive by hosting a copy of the WARC files and derivative WAT, WET, and CDXJ files in the Amazon S3 Storage Service as part of Amazon’s Open Data Sponsorship Program. In addition to these common formats in the web archive community, the EOT team modeled our work on the structure and layout of the Common Crawl datasets including their use of the columnar storage format Parquet to represent CDX data in a way that enables access with query languages like SQL. At the completion of this effort it is expected that we will move over 500TB of primary WARC content and derivative formats into the cloud. By making use of formats, structures, and methods developed by Common Crawl, the hope is that we will be able to leverage an existing community of researchers along with their tools and approaches to further reuse of these datasets. This presentation will discuss the decision to host the web content in AWS, the layout used to organize the crawl data into 2008, 2012, 2016, and 2020 datasets and further into different groups based on the original crawling institution. Additionally we will discuss the tools used for this work and show some examples of the Parquet format. Our hope is that this work can be used as a model for other web archives interested in hosting their data in the cloud for greater access and reuse.

#2: Common Crawl – Experiences From 10 Years in the Cloud

Sebastian Nagel

For ten years the Common Crawl dataset has been hosted as part of Amazon Web Services’ Open Data Sponsorships program. The AWS cloud is not just a hosting platform – using cloud services and computing resources it's a matter of minutes or hours for users of web archives and derivative data to launch their processing workflows. This is significant for our mission to enable research, education, and innovation based on web data. This presentation gives an overview of how the Common Crawl web data is used in and outside the cloud. Starting with a short outline of data collection, provided data formats and supportive resources, we'll look how users actually work with the data – covered research topics, derived datasets, tutorials, code, examples produced by the community, the popularity of data sets and formats, processing tools and platforms, and the challenges to manage a user community which includes academia and commercial companies, data scientists and programmers with different levels of experience.

#3: PDF Work with LOC Web Archives

Benjamin Lee & Trevor Owens

Official government publications are key sources for understanding the history of societies. Web publishing has fundamentally changed the scale and processes by which governments produce and disseminate information. Notably, web archiving programs have captured massive troves of government publications. For example, hundreds of millions of unique U.S. Government documents posted to the web in PDF form have been archived by libraries to date. Yet, these PDFs remain largely underutilized and understudied in part due to the challenges surrounding the development of scalable pipelines for searching and analyzing them. This presentation describes an ongoing research effort to utilize a Library of Congress dataset of 1,000 government PDFs in order to offer initial approaches for searching and analyzing these PDFs at scale. In addition to demonstrating the utility of PDF metadata, this work offers computationally-efficient machine learning approaches to search and discovery that utilize the textual and visual features of the PDFs as well. In this presentation, we will detail how these methods can be operationalized at scale in order to support systems for navigating millions of PDFs. We will conclude by describing how novel methods from interactive machine learning such as open faceted search can be utilized in order to empower end-users to dynamically navigate web archives according to topics and concepts that interest them most.

Advancing Quality Assurance for Web Archives: Putting Theory Into Practice

Grace Thomas¹, Meghan Lyon¹ and Brenda Reyes Ayala²

¹Library of Congress, ²University of Alberta

Web archivists managing web archive collections at any scale know that quality assurance is an enormous, inescapable task. It is also ripe for innovation, automation, and new theories to guide the practice forward. This panel proposes to share such advances in evaluating and performing quality assurance for web archives. Brenda Reyes Ayala, University of Alberta, will share a novel approach for assessing quality of web archive captures and Grace Thomas and Meghan Lyon, Library of Congress, will detail how the approach comes to life in daily work to maximize the impact of quality assurance while saving staff time. The panelists hope both sessions will spark fresh ideas for the community in assessing the effectiveness of their own quality assurance workflows.

#1: A Grounded Theory of Information Quality for Web Archives: Dimensions and Applications

Brenda Reyes Ayala

In this work, I present a theory of Information Quality (IQ) for web archives created using the Grounded Theory (GT) Methodology. This theory was created by analyzing support tickets submitted by clients of the Internet Archive's Archive-It (AIT), the popular subscription-based web archiving service that helps organizations build and manage their own web archives. The resulting theory consists of three dimensions of quality: correspondence, relevance, and archivability. The dimension of correspondence, defined as the degree of similarity or resemblance between the original website and the archived website, is the most important facet of quality in web archives. Correspondence has three sub-dimensions: visual correspondence, interactional correspondence, and completeness. The second dimension, relevance, is defined as the pertinence of the contents of an archived website to the original website and consists of two sub-dimensions: topic relevance and functional relevance. The last dimension of IQ is archivability, which is the degree to which the intrinsic properties of a website make it easier or more difficult to archive. Archivability differs from correspondence and relevance because it cannot be directly measured until the website is actually archived. Any proposed archivability measurement that is taken before the website is archived is a probability measure, that is, an estimate of the likelihood that a website will be well preserved. This theory is human-centered, grounded in how users and creators of web archives perceive their quality, independent of the technology currently in use to create web archives. The presentation will also suggest ways that institutions involved in web archiving can operationalise these dimensions in order to measure the quality of their archived websites and move towards automated or semi-automated Quality Assurance (QA) processes.

#2: Building a Sustainable Quality Assurance Lifecycle at the Library of Congress

Grace Thomas & Meghan Lyon

Over the past two years, the Library of Congress Web Archiving Team (WAT) has moved steadily toward a semi-automated model of quality assurance (QA) for web archives. The WAT reviews data from our crawl vendor, MirrorWeb, at each step of the archive process, including pre- and post-crawl. Although the WAT is fortunate to be an eight FTE team, only four perform daily work on QA and there are 15,000+ seeds in crawl at any given time. Automation was a key component of the next generation QA workflow for the WAT, as well as thinking critically about the WAT’s role in QA alongside recommending librarians in other Library units who build and maintain the web archive collections. In 2020, the WAT started developing automated procedures by looking at the number of bytes associated with a seed and the number of hops the crawler followed from a seed throughout an entire crawl. These two data points, paired with the Heritrix seeds-report, and collection management data from our curatorial tool, Digiboard, suddenly gave us a surprisingly detailed view into the crawls that we hadn’t had prior. After undergoing workflow development, this has become the core process of post-crawl QA that the team performs, which we hope to improve further by utilizing visual analysis tools. Having a base procedure in use for technical QA was critical, but it still didn’t allow the recommending librarians a streamlined way to review their captures for quality, according to their definitions. The WAT turned to Brenda Reyes Ayala’s framework for quality measurements of web archives based in Grounded Theory in order to divide out QA tasks. Of the framework’s three dimensions that determine quality, we found that recommending librarians could best evaluate captures based on Correspondence, while the WAT’s technical processes covered Relevance and Archivability. This session intends to share details of WAT’s comprehensive QA lifecycle, specifically the workflows we perform daily and ones under development. The pre-recorded talk may include a combination of screenshots, workflow diagrams, or live demonstrations.

Researching Web Archives: Access & Tools

Chase Dooley¹, Jaime Mears², Youssef Eldakar³, Ben O’Brien⁴, Olga Holownia⁵, Dorothée Benhamou-Suesser⁶, Jennifer Morival⁷

¹Web Archives Team, Library of Congress, ²LC Labs, Library of Congress, ³Bibliotheca Alexandrina, ⁴National Library of New Zealand, ⁵IIPC/CLIR, ⁶BnF, ⁷University of Lille

This session will focus on the efforts of several institutions in providing access to web archives and derivative data that encourage researcher engagement with archived web material. Across the IIPC community institutions have varying levels of expertise and resources available to facilitate research of web archives. One of the goals of the IIPC Research Working Group (RWG) has been gathering and sharing information about research related activities among the IIPC members as well as supporting use of the IIPC collaborative collections. We will present activities from different ends of this spectrum. From the Library of Congress, who collect vast amounts of data and have dedicated personnel to do exciting things with it, to a collaboration between the IIPC CDG and Bibliotheca Alexandrina to bring data to the infrastructure and expertise.

#1: Republishing IIPC Collections Through Alternative Interfaces for Researcher Access

Youssef Eldakar

Through the Content Development Group, the IIPC has for several years been harvesting thematic web archive collections, curated by participating member institutions as a collaborative effort. These collections, which span topics such as online news, World War One Commemoration, the Olympic Games, and the Covid-19 pandemic, are made available through a conventional wayback interface via the Archive-It service. Recognizing the value of the collections as research datasets, earlier in 2021, the IIPC and Bibliotheca Alexandrina started working together to experiment with the idea of republishing the collections through alternative access interfaces, hosted at Bibliotheca Alexandrina, that offer added functionality beyond conventional web archive playback to accommodate research use cases. These alternative interfaces are, namely, SolrWayback, which offers a variety of search functionalities to supplement basic web archive retrieval, and LinkGate, which offers a temporal graph visualization interface into a web archive. (Both SolrWayback and LinkGate are developed by IIPC member institutions.) In this presentation, we report on what has been accomplished so far, go through a light technical overview of integrating the collections and republishing through SolrWayback and LinkGate, and show examples from the republished collections to demonstrate the additional functionality offered by the alternative access paradigms.

#2: (In dex)terity and Innovation: Experimenting with Web Archive Datasets at the Library of Congress

Chase Dooley & Jaime Mears

Over the past several years, the Library of Congress has been steadily publishing derivative datasets generated from our web archives indexes for public use, including format samples from dot gov domains, memes, and web comics. The latest (and biggest) release was all of the United States Election Web Archive Index Datasets from 2000-2018. In this talk, Chase Dooley of the Web Archiving Team and Jaime Mears of LC Labs will discuss what’s been made available and how researchers can explore hundreds of GBs of internet history. https://labs.loc.gov/work/experiments/webarchive-datasets/

#3: Expanding the uses of web archives by research communities: the ResPaDon project

Dorothée Benhamou-Suesser & Jennifer Morival

The ResPadon Project aims to expand the use of web archives by research communities. It is a two-year project supported by national funding, undertaken by the BnF (National Library of France) and the University of Lille, in partnership with Sciences Po and Campus Condorcet. It brings together researchers and librarians to promote and facilitate a broader academic use of web archives by reducing the technical and methodological barriers researchers may encounter. Underlying this approach, a usage analysis led by GERiiCO, a research team in Information Science, should allow for a better understanding of the needs of researchers with varying profiles. Moreover, two experiments will be carried out during the project. The first one aims to adapt a well-known piece of software developed by the Medialab of Sciences Po, the web crawler and corpus curation tool Hyphe, to the BnF Wayback Machine. A “datasprint” that was held at the beginning of April 2022, brought together different research communities, data engineers and archivists to explore new ways of comparing the live web and the archived web. The second experiment involves the development of an access point to the BnF web archives collections at the University of Lille libraries, which will provide secure access to web archives, as well as TDM tools to explore the 2022 presidential election corpus. The plan is to conduct tests of the access point with researchers, in order to identify the services and the skills required to support researchers in their use of web archives.

Electronic Literature and Digital Art: Approaches to Documentation and Collecting

Ian Cooke¹, Giulia Carla Rossi¹, Tegan Pyke², Natalie Kane³, Stephen McConnachie⁴, Bostjan Spetic⁵, Borut Kumperscak⁵

¹British Library, ²Cardiff Metropolitan University, ³Victoria & Albert Museum, ⁴British Film Institute, ⁵Computer Museum in Ljubljana

This panel will present different approaches to collecting and documenting complex digital objects on the web. From defining quality assurance criteria and collecting contextual information for electronic literature to a decision model to assess acquisition of born-digital and hybrid objects, this session will explore the challenges faced by cultural heritage institutions in collecting, preserving and providing access to complex digital objects.

#1: New Media Writing Prize Collection

Giulia Carla Rossi & Tegan Pyke

The New Media Writing Prize Collection is a special collection in the UK Web Archive, created using web archiving tools to capture instances of the online interactive works that were shortlisted or won the Prize since its launch. It is part of the work conducted by the six UK Legal Deposit Libraries on emerging formats, which includes curating collections of web-based interactive narratives in the UK Web Archive. This presentation will focus on the quality assurance criteria adopted to assess the quality of the captures within this specific collection, which expand on technical criteria to include considerations on the narrative and literary quality of digital interactive publications. It will also touch on documentation for emerging formats, in particular collecting and creating contextual information around complex digital publications.

#2: Preserving and Sharing Born Digital Report

Natalie Kane & Stephen McConnachie

This presentation will introduce a decision model for preserving and sharing born-digital objects in the context of cultural heritage institutions. This decision model, authored by McConnachie and Tom Ensom, the result of the AHRC funded Preserving and Sharing Born Digital Objects Across the National Collection led by Kane, from the Towards a National Collection project, aims to aid organisations in the decision making process when assessing an object entering a collection, traversing technical constraints such as digital data, software and web content, data protection requirements, collections policy, and intellectual property rights. This presentation will also show where curatorial decision making and intervention is key in these decision making processes, a key outcome and recommendation of the Preserving and Sharing Born Digital report, published in January 2022.

#3 Proposal of a new microformat for deep archiving of the web

Bostjan Spetic & Borut Kumperscak

Some web projects are too important to be archived only using snapshots of the crawler. They deserve deep-archiving, which should include website code, database dump, static files and any other dependencies required to re-create a live user experience. There are many technical and operational challenges to achieving this systematically, but in this presentation we will focus on the lack of standardization that makes it hard to follow basic museum processes when working with such objects. We will argue that rather than cataloging individual assets, or the package as such, we should instead register the fully functional replica, equipped with a virtual object label in a proposed standardized format. We will suggest these interactive web objects should be understood similarly to old musical instruments, where preservation implies regular use to ensure they stay in working condition. We will use actual examples that our museum worked on in the past year to showcase simple and functional microformat ‘museums.txt’ for cross-referencing ‘live’ replicas with the catalog entry and all related museum procedures.

Research use of the National Web Archives

Jason Webber¹, Liam Markey², Sara Abdollahi³, Márton Nemeth^4,Gyula Kalcsó⁵

¹British Library, ²University of Liverpool, ³L3S Research Center, ⁴National Széchényi Library, Hungary, ⁵Digital Humanities Center, National Széchényi Library, Hungary

#1: Mediating Militarism: Chronicling 100 years of British 'Military Victimhood from Print to Digital 1918 - 2018

Liam Markey

In the UK the armistice/remembrance period has been a feature of national life since the end of the first world war. Public attitudes to this annual event have not remained static, in fact there has been many shifts over the last century. In this presentation Liam will demonstrate how he has examined UK newspapers with different political affiliations and the UK web archive and found evidence for the changing public attitudes. This work highlights how researchers can utilize multiple resources in order to answer a specific research question.

#2: Building Event Collections from Web Archives

Sara Abdollahi

Sara's research with the UK Web Archive aims to test whether the target websites within an event based collection can be ranked. If a user is looking at a time based event collection, how might they know which are the key resources to look at? Are some more relevant than others? Sara's project uses knowledge graphs to create a formula that ranks the websites within a collection. Some event collections within the UK Web Archive have hundreds or even thousands of targets and, if successful, a ranking system would be boon to researchers.

#3: Data Extraction and Visualization of Harvested WARC Files of Thematic Collection on Ukrainian War at the National Széchényi Library

Márton Nemeth & Gyula Kalcsó

National Széchényi Library is collecting news from 75 news portals in Hungarian language (from Hungary and from the neighboring countries) related to the Russian-Ukrainian war conflict since 21 February 2022. From the beginning of March another thematic collection has established by websites and social media sites about Hungarians in the Transcarpathian region in Ukraine. It is currently containing more than 1000 seed URLs. Harvests are running 1-2 times per week. The harvested content as big data is being analyzed in a project just formed by the web archiving team and the newly established digital humanities research centre within the national library. The data that can be extracted from harvested WARC files can form an important basis for various data visualisations. In the current project, the goal was to create an animated word cloud that shows the change in frequency of words during the war. To do this, the Digital Humanities Centre will first extract NLP-processable text from the WARC files in two steps: a Python module called warcio will be used to extract the HTML parts of the WARC files, and then a module called justext will be run on the HTML files to free the texts from the so-called boilerplate. The resulting texts are then used to run the emtsv (e-magyar) modules (tokenisation, lemmatisation, morphological analysis, clarification) on them. The resulting output is processed with Linux commands in an appropriate way (filtering out unnecessary POS-tagged lemmas and other unnecessary elements, aggregating and sorting the data according to the different POS-tags) to obtain frequency lists. Word clouds can then be created from the data of the harvests carried out at various times, which can be animated in a suitable way to visualise the changes of the vocabulary of Hungarian news portals during the war.

Pulling Together: Building Collaborative Web Collections

Alex Thurman¹, Nicola Bingham², Miranda Siler³, Lori Donovan⁴, Madeline Carruthers⁴, Sumitra Duncan⁵, Alice Austin⁶, Eilidh MacGlone⁷ and Leontien Talboom⁸

¹Columbia University Libraries, ²British Library, ³Ivy Plus Libraries Confederation, ⁴Internet Archive, ⁵Frick Art Reference Library, ⁶Edinburgh University, ⁷National Library of Scotland, ⁸Cambridge University Library

This session brings together speakers from several leading GLAM organizations to discuss how institutions with diverse collecting remits and institutional frameworks are working together to build collaborative web archive collections leveraging shared infrastructure and enhanced by broader expertise and greater diversity of perspectives. Speakers will address both the benefits and challenges of collaborative working and will describe in detail the work of four new and established projects with very different collecting scopes.

Projects & Presenters

Archive of Tomorrow (AoT) / Alice Austin(Edinburgh University), Eilidh MacGlone (National Library of Scotland), Leontien Talboom (Cambridge University Library)

Collaborative ART Archive (CARTA) / Madeline Carruthers (Internet Archive), Lori Donovan (Internet Archive), Sumitra Duncan (NYARC/Frick Art Reference Library)

Ivy Plus Libraries Confederation (IPLC) Web Collection Program / Miranda Siler (IPLC)

International Internet Preservation Consortium Collaborative Collections (IIPC) / Nicola Bingham (British Library) & Alex Thurman (Columbia University Libraries)

#1: Archive of Tomorrow (AoT)

Eilidh MacGlone & Alice Austin

The Archive of Tomorrow project, funded by the Wellcome Trust, will explore and preserve online information and misinformation about health and the Covid-19 pandemic. The project aims include the formation of a research-ready ‘Public Health Discourse’ collection of around 10,000 websites within the UK Web Archive, giving researchers and members of the public access to a wide representation of diverse online sources. Led by the National Library of Scotland, the project is of a collaborative nature with partners in Cambridge University Library, Edinburgh University Library and the Bodleian Libraries, Oxford, and we hope to continue this collaborative approach by inviting potential researchers and other key stakeholders to contribute to the development of the project’s collecting framework and the resulting collection. This talk will discuss the benefits and challenges of such collaborative working, with a focus on how discussions have been conducted regarding the development of the collecting framework.

#2: Collaborative ART Archive (CARTA)

Lori Donovan, Madeline Carruthers, Sumitra Duncan

Collaborative ART Archive (CARTA), a project managed by the Internet Archive and the New York Art Resources Consortium (NYARC), with participation from a growing cohort of dozens of art and museum libraries in the US and internationally, enables members to collectively capture, preserve, and provide access to at-risk web-based art materials. The amount of art-related web content and its ephemerality has increased as many galleries, arts organizations, and artists have been impacted by the pandemic. This has highlighted the importance of a model that leverages shared infrastructure, expertise and collecting activities amongst art libraries to scale the extent of web-published, born-digital materials preserved and accessible for art scholarship and research. Presenters will discuss the various goals of CARTA members and the impacts of this work on professional practice for the art and museum library field, web archivists, and end users, including art and digital humanities researchers.

#3: Collaborative Collection Development: Challenges and Opportunities

Miranda Siler

The Ivy Plus Libraries Confederation (IPLC) Web Collecting Program is a collaborative collection development effort to build curated, thematic collections of freely available, but at-risk, web content in order to support research at participating Libraries and beyond. Participating Libraries are: Brown University, the University of Chicago, Columbia University, Cornell University, Dartmouth College, Duke University, Harvard University, Johns Hopkins University, Massachusetts Institute of Technology, the University of Pennsylvania, Princeton University, Stanford University, and Yale University, although collections occasionally bring in outside experts from other institutions. Seeds for the IPLC collections are chosen by subject-specialist librarians from participating institutions working together across institutional lines. The Confederation currently hosts 30 public collections with content in over 100 languages and covering subjects such as art, science, and politics. While this collaborative work has been highly successful on a collection-level, the program as a whole can at times feel fractured, due to the specificity of the subjects covered. Selector groups rarely know what is being collected by other groups, which can lead to metadata quirks and data budget concerns, among other things. This talk will focus on one of the current IPLC goals to bring these thematically diverse collections into conversation with one another, in a way that is not only interesting and useful for potential researchers, but also asks curators to look at their own individual collections with fresh eyes.

#4: IIPC Collaborative Collections

Nicola Bingham & Alex Thurman

The International Internet Preservation Consortium (IIPC) is a network of web archiving experts representing member organizations from over 35 countries, including national, university and regional libraries and archives. Founded in 2003, the IIPC in 2015 started its Content Development Working Group (CDWG) to develop a formal collaborative collections initiative charged with building public web archive collections on topics of high interest to IIPC members, broader than any one member's responsibility or mandate, and with added research value due to the broader perspective realized by their multinational and multilingual scope. The initiative has since produced rich collections on events such as the commemoration of the centennial of World War One, seven Olympic Games, and the COVID-19 pandemic, as well as limited-duration thematic collections on Artificial Intelligence, Climate Change, and Online News, and long-term open-ended collections of the websites of Intergovernmental Organizations and National Olympics/Paralympics Committees. The current co-chairs of the CDWG will present an overview of the IIPC's collaborative collections initiative with attention to the program's background and rationale, workflow examples and lessons learned from the realization of its collections, challenges posed by data budgets and dependence on all-volunteer labor, and goals for engaging more CDWG participants in the harvesting and QA tasks and diversifying the harvesting tools used beyond Archive-It.

Rapid Response Collecting: What Are the New Workflows and Challenges?

Tom J. Smyth¹, Melissa Wertheimer²

¹Library and Archives Canada, ²Library of Congress

These sessions discuss the implementation or elaboration of “Rapid response” web archival curation methodologies, including new workflows, challenges, use cases, and also creating program efficiencies in order to better respond to the need to document unplanned and unforeseen, but national historic events.

#1: Managing Evolving Scopes and Elaborating our Events-Based Program Methodologies

Tom J. Smyth

Since 2013, the Web and Social Media Preservation Program team at Library and Archives Canada has developed web archival curation methodologies for reacting to and documenting what we formerly called "Events-based collections". However, it quickly became apparent that these events-based collections could absorb considerable unplanned and unforeseen resources and we needed new ways to create efficiencies. COVID-19 has unfortunately proven us right in the need to create efficiencies for documenting events, and has also definitively demonstrated that web archiving truly is an immediate action for information rescue and preservation of critical born-digital documentary heritage -- which will become a primary source for future research. This talk will delve into Library and Archives Canada’s (LAC) "rapid response" methodologies in web archival curation, and outline how we created efficiencies by moving from a reactive to a pro-active approach. It will also report on the current state of our national COVID-19 web archive.

#2: When Time and Resources are of the Essence: Archival Appraisal and the Library of Congress Coronavirus Web Archive

Melissa Wertheimer

When a major event unfolds and your job is to document it through web archives, how do you begin? Will you be able to answer “why web archiving” is the best method? Melissa Wertheimer, Music Reference Specialist at the Library of Congress, will share her answers to these questions with attendees in a case study about the Library of Congress Coronavirus Web Archive. To address the need for this collection in a resource-constrained environment, Melissa created an appraisal rubric that all curatorial team members applied to their selections for this monumental collection. Melissa will discuss how and why she applied existing theories and methods of archival appraisal to create the rubric, its impact on team workflows, and its ongoing use to curate a collection that was of the essence to create and maintain.

Saving Ukrainian Cultural Heritage Online

Quinn Dombrowski¹, Anna Kijas², Sebastian Majstorovic³, Ilya Kreymer⁴, Kim Martin⁵, Erica Peaslee⁶ and Dena Strong⁷

¹Stanford University, ²Tufts University, ³Austrian Centre for Digital Humanities and Cultural Heritage, ⁴Webrecorder, ⁵University of Guelph, ⁶Centurion Solutions, LLC., ⁷University of Illinois

This Q&A panel will focus on three presentations from volunteers with Saving Ukrainian Cultural Heritage Online (SUCHO, sucho.org), an impromptu organization founded March 1st to attempt to archive as many at-risk Ukrainian cultural heritage websites as possible during the war. In the course of three weeks, SUCHO has amassed over 1,300 volunteers and has archived upwards of 3,000 websites, using a combination of technical approaches, including creating high-fidelity archive files using Webrecorder, sending URLs to the Wayback Machine, and writing scrapers for specific platforms (Omeka, DSpace) and individual sites. These presentations will offer three different perspectives on emergency, volunteer-driven web archiving.

#1:A Crash Course in Web Archiving & Community

Quinn Dombrowski, Anna Kijas, Sebastian Majstorovic

SUCHO began as a tweet about a data rescue event for Ukrainian music collections (modeled after events held after Trump’s election in 2016), and four days later it was an international collaboration between 400 strangers. This talk will provide context for the rest of the panel by covering how the project started, and tracing its development through its crucial early weeks. The three co-organizers of SUCHO will reflect on some of the key decisions made by organizers, such as choice of tooling (prioritizing Webrecorder to support highly distributed archiving for anyone vs. systems like ArchiveIt that are tied into accounts and external infrastructure), choice of scope (cultural heritage sites, defined very broadly), use of Slack and Google Docs, and the creation of tutorials and synchronous onboarding sessions in the early days. The organizers will talk about the balance between supporting volunteers’ own initiative on one hand, and not letting the project become entirely unwieldy on the other hand. They will also discuss the role of personality and work style – on one hand, their similarity to one another made it easy to work together, even coming together as (near-)strangers. On the other hand, perspectives from other volunteers with different work styles were important in shaping the project into something more comprehensible and tractable for more people. Finally, the group will reflect on the role that institutions like universities and consortia play in a project like SUCHO, compared to the role that concerned individuals can play.

#2: From Raspberry Pi to Cloud: Distributed Web Archiving with Webrecorder

Ilya Kreymer

One of the key goals of the Webrecorder project has been to enable anyone to create web archives on their own. The SUCHO project has helped this goal become a reality, by bringing together a group of over 1000+ volunteers, each contributing their part to the shared effort to archive Ukraine’s cultural heritage. As it turned out, almost all key Webrecorder tools ended up playing a role in the SUCHO workflow to support this effort.This presentation will cover how different Webrecorder tools provided different ways for volunteers of varying skills to get involved. From running Browsertrix Crawler on anything from Raspberry PI to high cpu cloud machines, sharing crawler configs and writing documentation, Browsertrix Crawler was a first foray into web crawling for many technical users. Volunteers less comfortable with command-line were initially able to contribute by using the ArchiveWeb.page extension, and later, by using the brand new Browsertrix Cloud instance. ReplayWeb.page allowed for quick QA and the WACZ format allowed for all web archives to be stored on static storage as open data and easily replicated. The presentation will cover how each of these Webrecorder tools facilitated different use cases, highlighting Browsertrix Crawler, the new Browsertrix Cloud system, and the WACZ format, and how they can better support community web archiving efforts. At the end, we will also include a very quick demo of the new Browsertrix Cloud browser-based crawling system, also supported in part by the IIPC Tools Group “Browser-Based Crawling System for all” project.

#3: SUCHO Community Coordination

Kim Martin, Erica Peaslee, Dena Strong

Community

The challenges of coordination in large organizations are well-known – and that's assuming your large organization is a long-running one with formal hiring processes and training and a community of teammates who already know each other. Now imagine what it's like when you have 1300 volunteers who came together at the drop of a Tweet across 20 time zones in less than a week! Managing that information pipeline became Dena's more-than-full-time job for 2 weeks, where we were inventing processes in the morning, posting documentation within an hour, holding video trainings the following day, and revising and re-communicating all of it regularly across 15 channels as we invented ways to make the workflow smoother.

Status/Monitoring

A highly-distributed archiving project can be challenging enough. Using the model to race against time during an active war required looking outside the bounds of the group spreadsheets and Slack channels to bring in information and open-source intelligence to try and get ahead of any on-the-ground factors that could bring sites down. With respect to documenting and preserving practices used in cultural heritage and a dynamic, global team, Erica implemented practices and priorisation matrices used in emergency management, modified for this unique project.

Metadata

When collecting thousands of links and images from the web, organization has to be top of mind. When SUCHO started, volunteers were scraping the web and uploading items individually to a SUCHO collection on the Internet Archive. After about a week, the need for metadata that would help future users locate (and piece back together) cultural heritage collections became clear. Alex Wingate took the lead on a metadata template that got the ball rolling, then we started a small but mighty metadata team to provide detailed descriptions for the first 14783 items in the SUCHO collection. I’ll talk briefly here about leading this experience, focussing on concerns around language, provenance, and collaborative decision-making.

POSTERS

Guided Tours Into Web Archives

Anaïs Crinière-Boizet, Isabelle Degrange, Vladimir Tybin

Bibliothèque nationale de France

Since 2008, BnF has developed a new way to promote its web archives to the public: the guided tours. 19 thematic guided tours have been added to our Wayback Machine, “les Archives de l’internet”, on various topics such as the elections on the internet, e-government services, science, art, travels, cooking and more recently the Covid-19 pandemic and the French region of Lorraine. A guided tour is a thematic and edited selection of archived pages from our collections dispatched into several themes (approximately ten). Each archive is accompanied by a short documentary description. These guided tours are conceived in cooperation with our network of contributors who participate in the selection of contents that are to be crawled. There are two kinds of contributors: internal (at BnF) and external (from 26 libraries in regions and overseas territories but also researchers). After this editorial work, the guided tour is implemented in our Wayback Machine during the release of a new version. Our web archives, because they are submitted to the French legislation, can only be accessed at BnF and in 21 libraries in region. And that also goes for our guided tours. That is why we published on the BnF website the text version (PDF) of these tours without any image. As for the Covid-19 guided tour in 2021, we decided to publish as well a slideshow of screenshots from our collections, for which we asked the websites owners' authorizations, to illustrate the text versions of guided tours.

Exhibiting Web Memories From Arquivo.pt With Free Tools

Ricardo Basílio, Daniel Gomes

Arquivo.pt

Arquivo.pt has 28 million websites accessible and searchable by text, image and URL. However, the huge amount of information and the easy access does not guarantee the community's interest. How to show the community the potential and even the charm of historical web contents? Arquivo.pt was created in 2007 and its mission is to collect contents from the Portuguese Web. Contents may be searched by text, image and URL and are fully accessible. It is necessary to capture the attention of target communities to use the service either for research or for the valorisation of the historical memory of institutions. To meet this challenge, Arquivo.pt has been creating online exhibitions by theme or by type of organisation, such as press, radio, municipalities, R&D units, schools, museums. Each exhibition is followed by a dissemination campaign and the attraction of collaborations. For example, the exhibition "Memory of Art Festivals and Events" (https://arteparasempre.wordpress.com) was made on Wordpress.com free version and presents a set of websites preserved by Arquivo.pt. It is the result of collaboration with the Calouste Gulbenkian Foundation Art Library, a lead institution on Art in Portugal, and ROSSIO, a Digital Humanities infrastructure led by the Faculty of Social Sciences of the New University of Lisbon. This initiative generated interaction with the artists' community and contributed to the improvement of the preservation of art websites and art events. This presentation shows how the Arquivo.pt online exhibitions (arquivo.pt/expos) were developed. It explains how low-cost tools, accessible to any person or organisation, were used to present preserved historical Web contents. It shows the impact of the exhibitions on the dissemination of Arquivo.pt, the involvement of communities and the improvement of the quality of preserved web sets.

Web Archiving the Olympic & Paralympic Games

Helena Byrne

British Library

The International Internet Preservation Consortium (IIPC) first started to develop collaborative collections in 2010. The very first collection was on the 2010 Winter Olympics hosted in Vancouver, Canada. 2022 marks ten years since the IIPC started to collect on the Paralympic Games alongside Olympic events. So far the IIPC has curated web archive collections on seven Olympic Games and six Paralympic Games. The most recent of these was Beijing 2022. The collection period for this latest edition of the Games ran from 20th January to 20th March 2022. The structure of collaborative collections within the IIPC has changed over time. The early Olympic/Paralympic collections from 2010 and 2012 were overseen by the IIPC Access Working Group. It wasn’t until late 2014 that the IIPC Content Development Group (CDG) was formed. Since 2015 The CDG has subscribed to Archive-IT to do its crawling and make all the collections publicly available: https://archive-it.org/home/IIPC Winter Games are smaller than the summer editions and this is reflected in the target numbers in the collections. The 2010 Winter Olympics collection has almost 200 URLs while the recent 2022 Winter Olympics and Paralympics Games collection has over 800. In contrast the combined total of URLS for the 2012 Olympics and Paralympic Games collection was nearly 3,000 URLs and the recent 2020 Olympics and Paralympic Games (held in 2021) collection was over 1,000 URLs. This poster will reflect on what has been previously collected in the IIPC Olympic/Paralympic collections.

Arquivo.pt as Open Data Provider

Daniel Gomes, Ricardo Basilio

Arquivo.pt

The Directive (EU) 2019/1024 of the European Parliament and of the Council of 20 June 2019 on open data and the re-use of public sector information stipulates that “To facilitate re-use, public sector bodies should, where possible and appropriate, make documents, including those published on websites, available through an open and machine-readable format and together with their metadata, at the best level of precision and granularity, in a format that ensures interoperability”. The Portuguese Law No. 68/2021 of 2021-08-26 approves the general principles on open data and transposes the European Directive. Arquivo.pt is a public service that has the mission of preserving documents published on Internet sites to enable their long-term open access and provides interoperable electronic interfaces (APIs) for their automatic processing. The Administrative Modernization Agency, IP (AMA) is the public institute that carries out the duties in the areas of administrative modernization and simplification and electronic administration. Arquivo.pt has been collaborating with AMA with the aim of improving the preservation of Public Administration websites. AMA recognized Arquivo.pt as a public service and open data provider and awarded its certification seal on the Open Data Portal. In 2021, Arquivo.pt provides open access to over 10 billion files (721 TB) from 27 million websites. Besides the original web artefacts preserved at Arquivo.pt, this service generated 27 open datasets derived from its activities, which are now available in open access so that they can be reused (available at http://arquivo.pt/dadosabertos). The open data preserved by Arquivo.pt can be explored through the search interface, automatically through API (https://arquivo.pt/api) or by reusing derived datasets. Any citizen can access the open data resulting from these historical archives and, for example, search for official information published on the websites of successive governments.

The Past Web: A Book to Support Web Archiving

Daniel Gomes

Arquivo.pt

Since the publication of “Web Archiving” in 2006, a book has not been published that reflects the state-of-the-art in the area of web preservation and the research that has been conducted on web archives. The main goal of the new book “The Past Web: Exploring Web Archives” was to create a new, up-to-date resource to educate more people in the field of web preservation and to make web archives known to researchers and academics. As such, the book is primarily aimed at the academic and scientific communities, and presents the most innovative methods for exploring information from the past preserved by web archives. On the other hand, this book is a valuable tool to train and integrate new web archivists. Daniel Gomes, manager of Arquivo.pt led the book’s editorial team, which also included the field specialists Elena Demidova, Jane Winters and Thomas Risse. In total, the book resulted from the contributions of 40 authors from around the world who are experts in web archiving. The book is divided into 6 parts where we find various resources for exploring pages archived from the Internet since the 1990s. We can also learn how to preserve our collective memory in the Digital Era, which strategies to use when selecting online content, and what impact web archives have on preserving historical information. There is an urgent need to include web archives in teaching plans and this knowledge brings a great competitive advantage especially for students of Humanities and Social Sciences. The book aims to support professors in their mission to transmit innovative and adequate knowledge for the digital literacy required to train professionals for the 21st century. An innovative detail of this book is that all its cited links have been preserved by Arquivo.pt in order to ensure that the references remain valid over time.

Archiving COVID-19 Memory Websites: "COVID-19 Images and Stories" and Other Sites

Tyng-Ruey Chuang, Chia-Hsun Ally Wang

Academia Sinica

Since the start of the pandemic in early 2020, numerous COVID-19 memory websites has sprung up around the global. Many are collaborative in nature: Members of the public are encouraged to contribute their witness images and stories about COVID-19 to the sites. A sample of these COVID-19 memory websites includes: COVID-19 memories (https://covidmemory.lu/) hosted at Luxembourg, Corona Memory (https://www.corona-memory.ch/) hosted at the Switzerland, and COVID-19 Images and Stories (https://th.covid19.commons.tw/) hosted at Taiwan, among others. Archiving these websites for current and future research as well as for long-lasting community memory, however, presents interesting issues. On the one hand, we want to preserve the essential materials from the contributors (original images from them, for example) more than the look-and-feel of the site. On the other hand, we wish also to keep background information about the community (including contributor contacts, for example) but this may not be possible without working closely with the website operator. We will report on the why and how in the archiving of COVID-19 Images and Stories ^[1], a website we also operate. We will discuss overlapping issues across collaborative systems for community memories and curatorial systems for digital collections (Omeka S ^[2], for example). Last but not the least, we will show how research data repositories (depositar ^{[3] [4]}, for example) can facilitate long-term preservation and accessibility of web archives (e.g. in realizing the FAIR data principles ^[5]).

References

^[1] Various contributors. (2020 – ). COVID-19 影像與敘述 (COVID-19 Images and Stories). Website: <https://th.covid19.commons.tw/>.

^[2] Digital Scholar project team. (2008 – ). Omeka S. <https://omeka.org/s/>

^[3] The depositar team. (2018 – ). What is depositar? <https://data.depositar.io/about>; documentation: <https://docs.depositar.io/>; code: <https://github.com/depositar/>.

^[4] Tyng-Ruey Chuang, Cheng-Jen Lee, Chia-Hsun Wang, Yu-Huang Wang. (2021). Experience in Moving Toward An Open Repository For All. Open Repositories 2021 Presentation.

^[5] FORCE11. (2020). The FAIR Data Principles. <https://force11.org/info/the-fair-data-principles/>s

The Evolving Treatment of Wayback Machine Evidence by U.S. Federal Courts

Nicholas Taylor

nullhandle.org

How have courts regarded evidence from web archives? What can their outlier interpretations of evidence from web archives tell us about their evolving perspective? What are the consistent gaps in their understanding that we could help to address through outreach and education? Once in a while, a noteworthy ruling involving evidence from web archives or a piece of legal commentary about web archives will circulate with interest among the web archiving community via social media. However, these occasional, circumscribed updates offer an incomplete picture of overall trends and the treatment of evidence from web archives. The community would benefit from a more systematic review, to better apprehend how courts understand web archives and how that understanding has changed over time. Based on an analysis of more than two hundred U.S. federal court cases whose rulings reference the Internet Archive Wayback Machine (IAWM), I will summarize the evolving treatment of evidence from web archives in that jurisdiction. This will include a review of strategies for the authentication and admission of evidence from IAWM, their relative success, forum-specific observations, and highlights of a few particularly noteworthy cases. Session attendees will hopefully come away with a better understanding of what it takes to have evidence from web archives accepted in U.S. federal courts.

Archiving Source Code in Scholarly Content: One in Five Articles References GitHub

Emily Escamilla¹, Talya Cooper², Vicky Rampin², Martin Klein³, Michele C. Weigle¹, Michael L. Nelson¹

¹Old Dominion University, ²New York University, ³Los Alamos National Laboratory

The definition of scholarly content has expanded to include the data and source code that contribute to a publication. Major archiving efforts to preserve scholarly content in PDF form (LOCKSS, CLOCKSS, and Portico) are well underway, but no analogous effort has yet emerged to preserve the data and code referenced in those PDFs, particularly the scholarly code hosted online on Git Hosting Platforms (GHPs). Similarly, the Software Heritage Foundation is working to archive public source code, but there is value in archiving the contemporary look and feel of the GHPs, including issue threads, pull requests, and wikis while maintaining their original URLs. For academic projects where reproducibility matters, this ephemera adds important context. To understand and quantify the scope of this problem, we analyzed the use of GHP URIs in the arXiv corpus from April 2007 to December 2021 including the number and frequency of GHP URIs. We found that GitHub, GitLab, SourceForge, and Bitbucket were collectively linked to 206 times in 2008 and 74,227 times in 2021. In total, there were 217,106 URIs to GitHub, SourceForge, Bitbucket, and GitLab repositories across the 1.56 million publications in the arXiv corpus. In 2021, one out of five publications included a URI to GitHub. The complexity of GHPs like GitHub is not amenable to conventional Web archiving techniques. Therefore, the growing use of GHPs in scholarly publications points to an urgent and growing need for dedicated efforts to archive the holdings of GHPs to preserve research code and its scholarly ephemera.

WARC Collection Summarization

Sawood Alam

Internet Archive

Items in the Internet Archive’s Petabox collections of various media types like image, video, audio, book, etc. have rich metadata, representative thumbnails, and interactive hero elements. However, web collections, primarily containing WARC files and their corresponding CDX files, often look opaque. We created an open-source CLI tool called "CDX Summary" ^[1] to process sorted CDX files and generate reports. These summary reports give insights on various dimensions of CDX records/captures, such as, total number of mementos, number of unique original resources, distribution of various media types and their HTTP status codes, path and query segment counts, temporal spread, and capture frequencies of top TLDs, hosts, and URIs. We also implemented a uniform sampling algorithm to select a given number of random memento URIs (i.e., URI-Ms) with 200 OK HTML responses that can be utilized for quality assurance purposes or as a representative sample for the collection of WARC files. Our tool can generate both comprehensive and brief reports in JSON format as well as human readable textual representation. We ran our tool on a selected set of public web collections in Petabox, stored resulting JSON files in their corresponding collections, and made them accessible publicly (with the hope that they might be useful for researchers). Furthermore, we implemented a custom Web Component that can load CDX Summary report JSON files and render them in interactive HTML representations. Finally, we integrated this Web Component into the collection/item views of the main site of the Internet Archive, so that patrons can access rich and interactive information when they visit a web collection/item in Petabox. We also found our tool useful for crawl operators as it helped us identify numerous issues in some of our crawls that would have otherwise gone unnoticed.

^[1] https://github.com/internetarchive/cdx-summary/

ComparingAccess Patterns of Robots and Humans in Web Archives

Himarsha R. Jayanetti¹, Kritika Garg¹, Sawood Alam², Michael L. Nelson¹, Michele C. Weigle¹

¹Old Dominion University, ²Wayback Machine, Internet Archive

In 2013, AlNoamany et al. of our research group (WS-DL, ODU) studied the access patterns of humans and robots in the Internet Archive (IA) using the Wayback Machine’s anonymized server access logs from 2012. We extend this work by comparing these previous results to an analysis of server access logs in 2019 from the IA's Wayback Machine and the Portuguese Web Archive (Arquivo.pt). This comparison is based on robot vs. human requests, session data (session length, session duration, and inter-request time), user access patterns, and temporal analysis. We used a variety of heuristics to classify sessions as a robot or human, including browsing speed, loading images, requesting robots.txt, and User-Agent strings. AlNoamany et al. determined that in the 2012 IA access logs, humans were outnumbered by robots by 10:1 in terms of sessions, 5:4 in terms of raw HTTP accesses, and 4:1 in terms of megabytes transferred. The four web archive user access patterns established in 2013 are single-page access, access to the same page at multiple archive times, access to distinct web archive pages at about the same archive time, and access to a list of archived pages for a certain URL (TimeMaps). We are investigating whether similar user access patterns still persist, whether any new patterns have emerged, and how user access patterns have evolved over time.

Optimizing Archival Replay by Eliminating Unnecessary Traffic to Web Archives

Kritika Garg¹, Himarsha R. Jayanetti¹, Sawood Alam², Michele C. Weigle¹, Michael L. Nelson¹

¹Old Dominion University, ²Wayback Machine, Internet Archive

We discovered that some replayed web pages cause recurrent requests that lead to unnecessary traffic for the web archive. We looked at the network traffic on numerous archived web pages and found, for example, an archived page that made 945 requests per minute on average. These requests are not visible to the user, so if a user leaves such an archived page running in the background, they would be unaware that their browser would continue to generate traffic to the web archive. We found that web pages that require regular updates (for example, radio, sports, etc.) and contain an image carousel, widget, etc., are more likely to make recurrent requests. If the resources requested by the web page are not archived, some web archives may patch the archive by requesting the resources from the live web. . If the requested resources are not available on the live web, the resources cannot be archived, and the responses remain HTTP 404. Some archive pages would continue to poll the server as frequently as they did on the live web, while some pages would poll the server even more frequently, if their requests are 404, creating a high amount of unnecessary traffic. On a large scale, web pages like these could potentially cause security issues such as denial of service attacks. Significant computational, network, and storage resources are required for web archives to archive and then successfully replay pages as they were on the live web, and these resources should not be spent on unnecessary HTTP traffic. Our proposed solution is to optimize archival replay using HTTP Cache-Control response headers. We implemented a simplified scenario where we cache HTTP 404 responses to avoid these recurring requests.

Access Functionalities and Different Automatically Generated or Manually Created Metadata: Worldwide Experiences

Sanaz Baghestani¹, Saeed Rezaei Sharifabadi¹, Fariborz Khosravi¹

¹Encyclopaedia Islamica Foundation, ²Alzahra University, Iran, ³National Library and Archives of Iran

'Access section’ is an important part of developing a web archiving policy while establishing a web archive. According to Harrod's librarians' glossary of terms used in librarianship, documentation, and the book crafts, and reference book, the word ‘access’ in information retrieval means, 1) permission and opportunity to use a document; 2) the approach to any means of storing information, e.g. index, bibliography, catalogue, computer terminal. In order to propose a framework for access to the National Web Archives of Iran, we have taken into account both meanings of ‘access’, surveyed it in national web archives round the world, and analyzed the result to have an overview of the worldwide experience. Considering the fact that methods for creating metadata and description are not one size fits all, for the second meaning of ‘access’ in web archives, national web archives that are IIPC members were chosen (more than 30 national web archives) and their search user interfaces were explored (the web archives that had free and remote access to the metadata or the archive itself). We also searched some keywords or browsed a topic, and then looked into the result. In this poster, we share and visualize the data that is gathered and analyzed for the second meaning of access and represent a comprehensive sketch of worldwide experience about web archives access functionalities and different metadata or description (both automatically generated and manually created).

Arquivo404

Vasco Rato, Daniel Gomes

Arquivo.pt

Arquivo.pt is a research infrastructure that provides tools to explore historical web data. It has been developed since 2007 to incrementally improve access to its collections and respond to the needs of scientists as well as common citizens. Web pages are constantly changing, often leading to URLs becoming obsolete. When this happens, users’ bookmarks, paper citations and other references to the old URL become broken, pointing towards a 404 page instead. The new Arquivo.pt Arquivo404 service available at https://arquivo.pt/arquivo404 aims to tackle this issue. It’s a free and open-source project that improves 404 pages of any website by providing the link to an archived version of the missing page, if it exists. Arquivo404 uses the Memento protocol to check for archived versions, allowing any Memento compliant web archives to be used on the search. This presentation will show use cases of the Arquivo404 service, detail the technologies it uses and provide some insight on the configurations it allows, namely the addition of other web archives for the search. We believe that Arquivo404 is an enhancement to any website, benefiting both its owner and its users at virtually no cost, while at the same time divulging the importance of web archiving to a greater public.

SavePageNow

Pedro Gomes, Daniel Gomes

Arquivo.pt

Arquivo.pt is a research infrastructure that provides tools to explore historical web data. It has been developed since 2007 to incrementally improve access to its collections and respond to the needs of scientists as well as common citizens. The new Arquivo.pt SavePageNow service available at https://arquivo.pt/services/savepagenow?l=en allows a page to be archived at the exact moment when the user browses it. This way, SavePageNow allows anyone to save a Web page to be preserved by Arquivo.pt. It is only necessary to enter a web page’s address and browse through its contents. For example, a publication on a blog marking the 30th anniversary of the Internet in Portugal was saved with SavePageNow and preserved at Arquivo.pt. This way, anyone using SavePageNow is contributing to the contents published on the Internet not being lost. With this new service, users can archive web pages as they are displayed for use as a trusted citation in the future. To capture several pages during one session, users just need to browse through the pages so that all the visited content is archived. The archived content is then automatically integrated into the Arquivo.pt collections. This presentation will detail the technology used to develop SavePageNow, its user interface and workflow, and how we can use the API to save pages automatically. We believe that the SavePageNow service enables users to contribute to select and immediately preserve relevant information published online in high-quality before it is changed, for instance from social networks.

Archiving Cryptocurrencies

Pedro Gomes, Daniel Gomes

Arquivo.pt

Since 2008 the Cryptocurrency market has revolutionised the world by innovating and expanding into other areas (e.g., finance and art). However, with this rapid expansion, many projects are created every day, giving rise to a wide and varied range of websites, technologies and scams. Markets follow financing stages and it is during an initial stage of euphoria that more projects are created. We believe that as the Cryptocurrency market stabilises, projects/websites are disappearing because funding diminishes or runs out. Arquivo.pt is a research infrastructure that preserves historical web data mainly related to Portugal. However, it also includes thematic collections of international interest. Arquivo.pt initiated a new web archive collection that preserves web content that documents Cryptocurrency activities. This presentation will discuss the information sources selected to create this collection, the information processing process and how we can use Arquivo.pt to explore the archived documents. This work will also produce a new open dataset with information documenting each cryptocurrency project, including its original URL and link to the corresponding web-archived version in Arquivo.pt. We believe that by creating this new dataset related to cryptocurrencies and by preserving all the corresponding web content, it has the potential to originate innovative scientific contributions in several areas such as Economy or Digital Humanities.

Arquivo.pt Training Initiative

Daniel Gomes, Ricardo Basilio

Arquivo.pt

The Café with Arquivo.pt (arquivo.pt/cafe) is a training initiative open to the community and held online, since the beginning of the general lockdown period due to the Covid-19 pandemic in March 2020. As suggested by the name “Café”, these webinars adopt an informal style where participants can interact with the speakers and the Arquivo.pt team. The aim is to provide innovative training and strengthen ties with the community. Café with Arquivo.pt began as a response to the impossibility of face-to-face training. The initiative has been successful with 23 sessions held so far. The experience gained allows now to consider online training as a complement to reach specific audiences or geographically distant. Even after the pandemic, training through webinars is to continue. Arquivo.pt is a public service accessible to any citizen through the Web. Its training offer (arquivo.pt/forma) aims to reach everyone, from the user with basic knowledge of the Web to the expert user. This presentation shows how the training open to the community through webinars operates, the topics covered and the number of participants and the degree of satisfaction (e.g., in 2021 they had 538 participants and 84% of satisfaction). Two cases are presented where training was done in collaboration with external organisations. The first was a cycle of webinars dedicated to art events, in collaboration with the Art Library of Fundação Calouste Gulbenkian, a lead institution in the domain of Art, aimed at the artists community. The second was a training in collaboration with the City Council of Lisbon, aimed at citizens with basic skills in using the Web.

Web Archival in the Era of JavaScript

Ayush Goel¹, Jingyuan Zhu¹, Ravi Netravali², Harsha V. Madhyastha¹

¹University of Michigan, ²Princeton University

The widespread presence of JavaScript on the web slows down a web archive's rate of crawling pages and consumes more storage. In addition, since JavaScript's execution can differ across loads of a page, archived pages often render incorrectly for users. We identify two principles to address these problems. 1) When a user loads an archived page, JavaScripts on the page must execute as they would on the type of client using which the page was crawled. 2) On archived pages, much of the JavaScript code can be safely discarded without impacting page fidelity because it either implements functionality that will not work or will never be executed in any user's load of the page. By applying these principles to a large corpus of archived pages, we eliminated virtually all JavaScript-induced errors when serving these pages, while discarding over 80% of all JavaScript bytes and improving crawling throughput by roughly 40%.

“Enter Here”: Hyperspace and the Web

James Kessenides

Yale University Library

The early days of the web are full of sites containing the simple direction, “Enter Here.” Whether more blunt or more subtle in design, whether situated on a site more popular or academic, commercial or educational, the instruction sprang from the sheer novelty of the web as space -- a virtual landscape of new perceptual experiences requiring some guidance. Today, it may seem quaint, a relic of a time before infinite scroll, continuous updating, and so many other current hallmarks of the web. But “Enter Here” deserves a second look. Revisiting it through and with examples from web archives as its primary source base, this presentation will consider three things: first, the web as “hyperspace,” or as, to borrow from Fredric Jameson, a challenging new perceptual environment in which the sense of volume is limitless and disorientation results; second, the introduction of "Enter Here" to deal with the web as hyperspace, the assorted valences of “Enter Here,” and the timing of when we might be able to say “Enter Here” receded from web design; and, third, the changed attitude toward the hyperspace of the web over time, as implied by the gradual obsolescence of “Enter Here.” This will primarily be an exercise in raising a set of questions rather than providing answers, and it will consider the challenges of using web archives to understand changing perceptions of the web in the absence of web users themselves -- itself a broader question in the use of web archives. Ultimately, the goal will be to model one way of approaching web archives to understand our past, present, and future experiences of the web as space.

A Review of Third-Party Software for Archiving Discord

Kirk Mudle

New York University

This project is an evaluation of third-party software tools for archiving Discord, the VoIP instant messaging and digital distribution platform. Discord has become one of the most widely used communication services and community platforms for both video game developers and gaming communities. The ability to preserve accurate records of Discord servers is crucial for understanding the development, social, and cultural history of many contemporary video games. Due to the platform’s unique properties and invite-only access system, Discord cannot be indexed by conventional search engines, archival tools, or web crawlers. In response, several independent software developers have created tools for the purpose of saving content and metadata from Discord servers. I tested three of these tools to 1.) evaluate their appropriateness for use in a web-archiving workflow, and 2.) clarify the technical challenges an Archive will face when ingesting, storing, and providing access to archived versions of Discord servers. My poster compares the basic functions, technical properties, use or installation restrictions, and advanced functionalities of each tool. This analysis revealed three technical challenges not fully addressed by any of the evaluated tools: linked media files and attachments within servers, the lack of access to user activity data, and the dynamic nature of Discord itself. By evaluating the functions and limitations of existing Discord archiving software, this poster lays the groundwork for further research into the curatorial, legal, and ethical challenges related to the preservation and future interpretation of archived versions of Discord servers.

Web Archiving as Entertainment

Travis Reid, Michael L. Nelson, Michele C. Weigle

Old Dominion University

We want to make web archiving entertaining so that it can be enjoyed like a spectator sport. Currently we are working on applying gaming concepts to the web archiving process and on integrating video games with web archiving. We are creating web archiving live streams and gaming focused live streams that can be uploaded to video game live streaming platforms like Twitch, Facebook Gaming, and YouTube. Livestreaming the crawling and replay of web archives removes some of the mystery and makes it transparent to third parties. The gaming focused live streams will have gameplay that is influenced by the web archiving and replay performance from the web archiving live stream. So far, we applied the gaming concept of speed runs to web archiving and integrated a few video games with an automated web archiving live stream. We recorded a demo that starts with a web archiving speedrun where we gave a set of seed URIs to Brozzler and Browswertrix Crawler to see which crawler would finish archiving the set first. Then we used Selenium to apply the crawler performance results (speed) to character traits in the Gun Mayhem 2 More Mayhem video game. A viewer could then watch the in-game characters battle for top crawler.

The Triad COVID-19 Collection

Jessica Dame

University of North Carolina at Greensboro

The Martha Blakeney Hodges Special Collections and University Archives at the University of North Carolina at Greensboro is web archiving COVID-19 content in the Piedmont Triad. The Triad COVID-19 Collection aims to capture how the Triad community is using and experiencing the web through the global pandemic. Web archiving began in May 2020 and as of April 2022 includes 152 unique pieces of web content. The scope of the collection includes websites, born-digital documents, and videos created by county government, regional hospitals, K-12 schools, universities, non-profit organizations, community landmarks, and community initiatives. Content captured includes information about the spread of infection, regional containment efforts, modified services and closures, and mask projects. Using a rapid response method to archive web content, the Triad COVID-19 Collection includes unique pieces of crawled web content that will inform future research and histories of the Piedmont Triad.

SESSIONS

POSTERS

Full-text Search for Web Archives

Andrew Jackson1, Anders Klindt Myrvoll2, Toke Eskildsen2, Thomas Egense2, Ben O'Brien3

#1: The State of Full-Text Search at the UK Web Archive

#2: SolrWayback at the Royal Danish Library - Key Findings, Experiences and Future Aspects

#3: Searching for a Full-Text Pilot

BESOCIAL: Social Media Archiving at KBR in Belgium

Fien Messens1, Peter Mechant2, Lise-Anne Denis3, Eva Rolin4, Pieter Heyvaert5, Patrick Watrin4, Julie M. Birkholz1,6 and Friedel Geeraert1

#1: What to Select and How to Harvest? The Operational Side of Social Media Archiving

#2: Archiving Belgian Social Media: How to Obtain a Representative Corpus and How to Represent Them Via an Interface?

#3: The European Copyright Law as an Obstacle To Social Media Archiving

#4: Key Actors, Events and Discourses in the Gorman-Rijneveld Translation Controversy on Twitter

Teaching the Whys and Hows of Creating and Using Web Archives

Lauren Baker1, Claire Newing2, Maria Ryan3, Tim Ribaric4, Ingeborg Rudomino5, Karolina Holub5, Zhiwu Xie6, Kirsty Fife7

#1: Leveraging Computational Notebooks to Teach Web Archives to a Crowd of Non-Programmers

#2: Training Activities in the Croatian Web Archive

#3: Continuing Education to Advance Web Archiving (CEDWARC)

#4: Supporting Grassroots Communities in Developing Web Archiving Skills

Video/Stream Archiving

Andreas Lenander Aegidius1, Anders Klindt Myrvoll2, Sawood Alam3, Bill O'Connor3, Corentin Barreau3, Kenji Nagahashi3, Vangelis Banos3, Karim Ratib3, Owen Lampe3, Mark Graham3

#1: Collection of Streaming Content

#2: Video Archiving and Playback in the Wayback Machine

Design, Build, Use: Building a Computational Research Platform for Web Archives

Ian Milligan1, Jefferson Bailey2, Nick Ruest3, Samantha Fritz1, Valérie Schafer4 and Frédéric Clavert4

#1: “Design: A Practical History of Supporting Computational Research”

#2: “Build: The Archives Research Compute Hub from Idea to Platform”

#3: “Use”: When Personas Become Real Users…

Serving Researchers With Public Web Archive Datasets in the Cloud

Mark Phillips1, Sawood Alam2, Sebastian Nagel3, Benjamin Lee4 and Trevor Owens5

#1: Hosting the End of Term Web Archive Data in the Cloud

#2: Common Crawl – Experiences From 10 Years in the Cloud

#3: PDF Work with LOC Web Archives

Advancing Quality Assurance for Web Archives: Putting Theory Into Practice

Grace Thomas1, Meghan Lyon1 and Brenda Reyes Ayala2

#1: A Grounded Theory of Information Quality for Web Archives: Dimensions and Applications

#2: Building a Sustainable Quality Assurance Lifecycle at the Library of Congress

Researching Web Archives: Access & Tools

Chase Dooley1, Jaime Mears2, Youssef Eldakar3, Ben O’Brien4, Olga Holownia5, Dorothée Benhamou-Suesser6, Jennifer Morival7

#1: Republishing IIPC Collections Through Alternative Interfaces for Researcher Access

#2: (In dex)terity and Innovation: Experimenting with Web Archive Datasets at the Library of Congress

#3: Expanding the uses of web archives by research communities: the ResPaDon project

Electronic Literature and Digital Art: Approaches to Documentation and Collecting

Ian Cooke1, Giulia Carla Rossi1, Tegan Pyke2, Natalie Kane3, Stephen McConnachie4, Bostjan Spetic5, Borut Kumperscak5

#1: New Media Writing Prize Collection

#2: Preserving and Sharing Born Digital Report

#3 Proposal of a new microformat for deep archiving of the web

Research use of the National Web Archives

Jason Webber1, Liam Markey2, Sara Abdollahi3, Márton Nemeth4, Gyula Kalcsó5

#1: Mediating Militarism: Chronicling 100 years of British 'Military Victimhood from Print to Digital 1918 - 2018

#2: Building Event Collections from Web Archives

#3: Data Extraction and Visualization of Harvested WARC Files of Thematic Collection on Ukrainian War at the National Széchényi Library

Pulling Together: Building Collaborative Web Collections

Alex Thurman1, Nicola Bingham2, Miranda Siler3, Lori Donovan4, Madeline Carruthers4, Sumitra Duncan5, Alice Austin6, Eilidh MacGlone7 and Leontien Talboom8

#1: Archive of Tomorrow (AoT)

#2: Collaborative ART Archive (CARTA)

#3: Collaborative Collection Development: Challenges and Opportunities

#4: IIPC Collaborative Collections

Rapid Response Collecting: What Are the New Workflows and Challenges?

Tom J. Smyth1, Melissa Wertheimer2

#1: Managing Evolving Scopes and Elaborating our Events-Based Program Methodologies

#2: When Time and Resources are of the Essence: Archival Appraisal and the Library of Congress Coronavirus Web Archive

Saving Ukrainian Cultural Heritage Online

Quinn Dombrowski1, Anna Kijas2, Sebastian Majstorovic3, Ilya Kreymer4, Kim Martin5, Erica Peaslee6 and Dena Strong7

#1:A Crash Course in Web Archiving & Community

#2: From Raspberry Pi to Cloud: Distributed Web Archiving with Webrecorder

#3: SUCHO Community Coordination

Community

Status/Monitoring

Metadata

POSTERS

Guided Tours Into Web Archives

Anaïs Crinière-Boizet, Isabelle Degrange, Vladimir Tybin

Exhibiting Web Memories From Arquivo.pt With Free Tools

Ricardo Basílio, Daniel Gomes

Web Archiving the Olympic & Paralympic Games

Helena Byrne

Arquivo.pt as Open Data Provider

Daniel Gomes, Ricardo Basilio

The Past Web: A Book to Support Web Archiving

Andrew Jackson¹, Anders Klindt Myrvoll², Toke Eskildsen², Thomas Egense², Ben O'Brien³

Fien Messens¹, Peter Mechant², Lise-Anne Denis³, Eva Rolin⁴, Pieter Heyvaert⁵, Patrick Watrin⁴, Julie M. Birkholz^1,6 and Friedel Geeraert¹

Lauren Baker¹, Claire Newing², Maria Ryan³, Tim Ribaric⁴, Ingeborg Rudomino⁵, Karolina Holub⁵, Zhiwu Xie⁶, Kirsty Fife⁷

Andreas Lenander Aegidius¹, Anders Klindt Myrvoll², Sawood Alam³, Bill O'Connor³, Corentin Barreau³, Kenji Nagahashi³, Vangelis Banos³, Karim Ratib³, Owen Lampe³, Mark Graham³

Ian Milligan¹, Jefferson Bailey², Nick Ruest³, Samantha Fritz¹, Valérie Schafer⁴and Frédéric Clavert⁴

Mark Phillips¹, Sawood Alam², Sebastian Nagel³, Benjamin Lee⁴ and Trevor Owens⁵

Grace Thomas¹, Meghan Lyon¹ and Brenda Reyes Ayala²

Chase Dooley¹, Jaime Mears², Youssef Eldakar³, Ben O’Brien⁴, Olga Holownia⁵, Dorothée Benhamou-Suesser⁶, Jennifer Morival⁷

Ian Cooke¹, Giulia Carla Rossi¹, Tegan Pyke², Natalie Kane³, Stephen McConnachie⁴, Bostjan Spetic⁵, Borut Kumperscak⁵

Jason Webber¹, Liam Markey², Sara Abdollahi³, Márton Nemeth^4,Gyula Kalcsó⁵

Alex Thurman¹, Nicola Bingham², Miranda Siler³, Lori Donovan⁴, Madeline Carruthers⁴, Sumitra Duncan⁵, Alice Austin⁶, Eilidh MacGlone⁷ and Leontien Talboom⁸

Tom J. Smyth¹, Melissa Wertheimer²

Quinn Dombrowski¹, Anna Kijas², Sebastian Majstorovic³, Ilya Kreymer⁴, Kim Martin⁵, Erica Peaslee⁶ and Dena Strong⁷

Emily Escamilla¹, Talya Cooper², Vicky Rampin², Martin Klein³, Michele C. Weigle¹, Michael L. Nelson¹

Himarsha R. Jayanetti¹, Kritika Garg¹, Sawood Alam², Michael L. Nelson¹, Michele C. Weigle¹

Kritika Garg¹, Himarsha R. Jayanetti¹, Sawood Alam², Michele C. Weigle¹, Michael L. Nelson¹

Sanaz Baghestani¹, Saeed Rezaei Sharifabadi¹, Fariborz Khosravi¹

Ayush Goel¹, Jingyuan Zhu¹, Ravi Netravali², Harsha V. Madhyastha¹