LinkGate for Web Archive Visualization

Youssef Eldakar1, Mohammed Elfarargy1, Ben O'Brien2

1Bibliotheca Alexandrina, 2National Library of New Zealand

In all domains of science, visualization is essential for deriving meaning from data. In web archiving, data is linked data that may be visualized as a graph with web resources as nodes and outlinks as edges. As an IIPC collaborative project, the Bibliotheca Alexandrina and the National Library of New Zealand are working together to develop the core functionality of a scalable link visualization environment and document research use cases within the domain of web archiving for future development.

While tools such as Gephi exist for visualizing linked data, they lack the ability to operate on data that goes beyond the typical capacity of a standalone computing device. The link visualization environment being developed would operate on data kept in a remote data store, enabling it to scale up to the magnitude of a web archive with tens of billions of web resources. With the data store as a separate component accessed through the network, data may be largely expanded on a server or cloud infrastructure and may also be updated live as the web archive grows. As such, this may be thought of as Google Maps for web archives.

The project is broken down into 3 components: Link Service ('link-serv') where linked data is stored, Link Indexer ('link-indexer') for extracting outlink information from the web archive and inserting it into the Link Service, and Link Visualizer ('link-viz') for rendering and navigating linked data retrieved through the Link Service.

Within the project, the Bibliotheca Alexandrina leads design and development, and the National Library of New Zealand facilitates collecting feedback from IIPC institutions and researchers through the IIPC Research Working Group. Source code and the inventory of use cases will be published on GitHub under the GNU General Public License, version 3. An instance of the visualization environment will be deployed on Bibliotheca Alexandrina's infrastructure for demonstration purposes.

In this presentation, 4 months into development, we offer an introduction to the tool being developed and report on progress. With the tool being intended for researchers to explore web archives, we also seek to engage the community and gather feedback and ideas for features that would make a web archive visualization tool most effective to serve as a basis for future development.

Making web collections for research Sustainable & Reusable - Possibilities and Challenges experienced

Eld Zierau, Per Møldrup-Dalum

Royal Danish Library

This presentation concerns the lessons learned from work of ensuring persistency of web corpora (web collections) for later reuse and result verification.

The experiences were gained when finalizing the project “Probing a Nation's Web Domain” (presented at IIPC-2016) which looked at changes in the Danish web archive (Netarkivet) over time.

Persistency of the web collections is essential for future use of the project results. The only way to reconstruct or create new statistical results on basis of the corpora would be to run the extraction tool developed in the project. However, the architecture around Netarkivet may change (requiring changes to the extraction tool) and data in Netarkivet may be enriched. In both cases, there will be uncertainties on whether a new extract will be the same as the original one, even in short three-year horizon.

Persistency of the web corpora was achieved by using the recommendations of the RESAW-2017 paper “Data Management of Web Archive Research Data”. This includes use of Persistent Web Identifiers (PWIDs) for each corpora element (for Netarkivet on form pwid:netarkivet.dk:<UTC archiving time>:part:<archived URL>). In general, Netarkivet recommends use of PWIDs for reference to the materials. Previously, the recommendation was to refer to file/offset, but this broke when Netarkivet data was migrated to compressed files. Furthermore, we foresee that traditional archive URLs will break soon, since changes are planned for the access tools.

At first, we thought that generation of the web corpora specifications would be easy, since the extraction tool had created metadata for each web element in the corpora, including archived URL and time. However, a number of challenges arose when we went into the details:

- The timestamp in the final metadata for analysis was the crawl time and not the WARC time (archive time) which is the one used in PWIDs and in the Solr-index for the access tools. A temporary result with the WARC time was recovered, although the time was stored using the CET time zone, and not the UTC time zone. Transformation between time zones is trivial.

- Another issue with the recovered data was that, as the Solr-index do not have sufficient information about de-duplication in order to make PWIDs for de-duplicated elements. Complicated merge and joins had been performed on the Solr data to solve the problem, but only for pre-2011 data. So, even though this was only a matter of programming and compute time, more than the available time was needed. This will be completed in 2020. This is also another example on why Netarkivet would benefit from shifting to the use of revisit records.
Based on the persistent web corpora, the exact same metadata for the corpora can be produced as long as the Netarkivet exists.

The Netarkivet will also work on extensions to the SolrWayback tool to make better support for web collections defined by PWIDs, by offering search and rendering of web parts limited by a specified web collection.

"Please, Come to our <online> Office Hours!": Lessons Learned from Web Archiving Training at the Library of Congress

Abbie Grotke, Brenda Ford, Lauren Baker

Library of Congress

The Library of Congress Web Archiving Team (WAT) staff serve as program managers and technical team, as well as primary caretaker for the Library’s custom-built in-house curatorial tool, the Digiboard. Selection and curation of the Library’s web archive collections happens by subject experts throughout the Library, as a part of other “Recommending Officer” duties related to selection of content for the Library’s collections.

The Web Archiving Team is responsible for training and support of staff using the Digiboard tool, and for web archiving in general. Through 2018, this had been primarily accomplished (by one or two of the WAT members) sharing online training documents, meeting or corresponding with individual staff who needed some hands-on help, or briefings to small groups of staff as requested.

Since 2018, due to increased awareness in web archiving and in part by mandates outlined in the Library’s Digital Collecting Plan (https://www.loc.gov/acq/devpol/CollectingDigitalContent.pdf), the Web Archiving Team was faced with a significant increase in staff around the Library interested in and participating in web archiving projects. Frequent requests for Digiboard training for new staff plus an uptick in requests for hands-on refreshers helped to kick off a number of exploratory efforts to better support staff with limited team resources. And in 2020, as staff shifted to entirely remote work, we continued to see a dramatic increase in activity by staff looking for tasks suitable for telework. For context, in 2020 the WAT helped manage over 75 active event and thematic collections. Digiboard has over 230 active user accounts (within three distinct roles: Nominator, Reviewer, and Administrator). Over 100 of those active user accounts have active records (i.e., content that we are actively crawling).

Due to the increased demand for training, in late 2018 and early 2019 the WAT began developing more formal training courses and exploring options for ongoing support for those building collections. A training survey (with over 70 responses) was launched to inform training development and a report shared with key stakeholders. As a result, the following new training initiatives were implemented and tested:

  • Quarterly Getting Started with Web Archiving classroom training to introduce the basic concepts of web archiving and a very basic how to use the Digiboard tool
  • Project kick-off meetings as new collections were started to provide a bit more in-depth training;
  • Monthly Office Hours to help staff with existing collections and web archive records who need refresher training;
  • Additional briefings and overview sessions on specific topics (such as searching and using web archives, for reference staff)
  • Web Archiving Interest Group meetings open to any staff working with or interested in web archives

This presentation will highlight the approaches we took and outline in more detail the audiences targeted and topics we focus our training on, and how we improved our internal training guides in 2020 and had to evolve as we moved to full-time telework and to shift training online, and added a mandatory annual review training for collection leaders. Lessons learned will be shared, and we’ll outline some work we’re doing in 2021 about rethinking our getting started materials to better serve the staff in need of the basics. Preliminary plans for sharing and integrating IIPC training materials into the Library’s training program will also be discussed as will tips for those charged with training in other organizations.

Teaching Web Archiving to New and Established Information Professionals

Samantha Abrams

Ivy Plus Libraries Confederation

What does it look like to teach web archiving to current — and the next generation of — information professionals? A practitioner at heart, Samantha Abrams began teaching web archiving through the University of Wisconsin-Madison’s iSchool in 2018, creating two new courses from scratch: one for Master’s students and one for those already working in the field. Delivered online, her courses covered everything from web archiving terminology, to how to use Archive-It and Webrecorder, to how to write collection policies. But when you’re only given five weeks to teach a still-emerging subject to those with no experience with the subject, how do you shape a comprehensive course, and, ultimately, what do you leave out? In this twenty-minute presentation, Abrams will discuss everything from her conception of the course, to the creation of her syllabi, to what it’s like to teach web archiving at a distance.

Thus far, Abrams has focused her teaching on creating a foundation. In both courses, her students select a topic of their choosing — institutional or thematic in nature — and use it to build their own collection. In five (or six, depending on the course) weeks, students write their own collection policies, run their own crawls in Archive-It and Webrecorder, create and refine metadata, and ultimately answer questions related to their larger role in web archiving. After completing the course, where do they see themselves fitting into the larger world? Can they visualize — and advocate for — a web archiving program at their institution, or would they feel comfortable doing so in the future? In Montreal, Abrams will share what she’s learned from the structure of her courses, and the feedback she’s received from her students. She’ll also discuss what she envisions for new courses, and what’s worked — and what hasn’t — in her current courses. She’ll also share parts of her syllabi, and discuss what she thinks is the ultimate goal of teaching web archiving to new practitioners: enlarging the field, and providing more individuals with the ability to advocate for, and carry out, web archiving at their institutions. How can teaching new practitioners — especially students, still in library school — widen and enrich a growing field?

A transnational and cross-lingual crawl of the European Parliamentary Elections 2019

Ivo Branco, Ricardo Basílio, Daniel Gomes


The European Parliamentary Elections are an event of international relevance. The strategy adopted to preserve the World Wide Web has been delegating to national institutions the responsibility of selecting and preserving information relevant to their hosting countries. However, the preservation of web pages that document transnational events is not officially assigned. Arquivo.pt – the Portuguese web-archive is a research infrastructure that preserves historical web content. Arquivo.pt permanently selects, archives and provides public access to its web collections. This communication presents an experiment conducted by the Arquivo.pt team aimed at preserving web content that documents the European Parliamentary Elections of 2019 by applying a combination of human and automatic selection processes.

In summary, we started by identifying relevant terms in Portuguese about the 2019 European Parliamentary Elections, automatically translated them to 24 official languages of the European Union, reviewed the translation in collaboration with the Publications Office of the European Union and then automatically queried a web search engine to get a total of 12147 URLs to seed the crawls. The automation of the selection process enabled the expansion of the coverage of information about the event to multiple countries and languages without significantly increasing the amount of resources required. In parallel, we launched a collaborative list to gather contributions of relevant seeds from the international community. This collaborative initiative was disseminated through Portuguese and international contacts, like Arquivo.pt social media or IIPC mailing lists. We received 608 contributions from 16 countries. Slovakia and Portugal were the countries that suggested the highest number of seeds (114).
We iteratively ran 6 crawls using different configurations and crawling software (Heritrix 3, Brozzler and Browsertrix) to maximize the quality of the collected content. One crawl was executed before the elections and 5 afterwards. These crawls were performed between May and July of 2019 and resulted in the collection of 99 million URLs (4.8 TB). This web-data was aggregated into one special collection and will become searchable and accessible through Arquivo.pt in July 2020. Notice that this collection will be available for automatic processing through the Arquivo.pt API. Collaborations with researchers interested in studying this web collection are welcome.

The web is a rich and enormous source of varied information of research. Thus, selecting samples of relevant web-data to be studied is mandatory. The presented selection methodology combines human expertise with automation to maximize coverage about cross-lingual events but requires a very limited amount of resources. Thus, we believe that it can be applied to easily select and generate highly relevant samples of web-data about any kind of transnational events.

Web archiving in a multilingual environment: an EU experience

Silvia Sevilla

Publications Office of the EU

The European Union (EU) is a political and economic union of 28 Member States, with 24 official languages. As a democratic international organisation, one of the EU's founding principles is multilingualism. This means, amongst others, that the EU aims to communicate with its citizens in their own languages. Therefore, websites of the EU institutions are published in up to 24 synoptic language versions. Readers can easily switch between these languages without losing the context.

On behalf of the EU institutions, the Publications Office of the European Union (OP) creates an EU web archive. It aims to archive web content concerning the EU project to preserve it for the long term and to keep it accessible for the public. The archive covers the various websites of the EU institutions (European Commission, European parliament, EU agencies, …). The majority of these are hosted on europa.eu, the domain that spans the entire institutional framework of EU powers and regulatory bodies.

It is the goal of the EU web archive to maintain the linguistic richness of the sites. The selected sites are harvested on a regular basis, to their full depth (which sometimes goes far beyond the “top 10 blue links”), with ad hoc crawls where necessary. The quality of the first crawling results is checked by OP’s web archiving team, in close cooperation with the website owners. Where needed and feasible, errors are patched. Once the quality is deemed sufficient, the archived sites are made available online to anyone, without access restrictions.

The aim is to maintain the full multilingual experience for anyone who consults the archive. The websites are archived in all language versions. This brings along a number of challenges, both on the harvesting and on the access side. In this context, the particular fact that the EU websites are not mere translations, but parallel consultable linguistic versions, should be highlighted. Archiving a piece of web content in different languages poses as such few problems. Our aim though is to archive the sites in such a way that for the user of the EU web archive, it is still possible to seamlessly switch between languages.
The presentation will outline some of these challenges and describe strategies developed to respond to them. Besides the topic of multilingualism, it will also touch upon themes such as: selection and curation, crawling and quality control strategies, metadata, description and access, and long-term preservation strategies.

After the presentation, there will be time for questions and discussion with the public. Other participants will hopefully learn from our experiences and be able to enrich the discussion by sharing their knowledge on the topic.

Contemporary Art Knowledge: a community-based approach to Web archiving

Hélène Brousseau

Artexte Information Centre

Artexte, a library specialized contemporary Canadian contemporary art, has built a unique collection of over 30,000 art publications including some 16,000 exhibition catalogues and more than 2000 artists’ books and ‘zines acquired mainly through donations from individuals or arts organizations originating from museums, galleries and artist-run centres across Canada and beyond.

Artexte archives and makes available documents from independent arts organizations that are not available in either public or academic libraries.

As publishing practices in the contemporary arts sector evolve, so have our methods of collecting them. This paper explores how Artexte developed a community-based approach to web archiving, in order to capture new forms of web publications that no longer have a printed counterpart. We illustrate how an approach rooted in education using Webrecorder not only empowers artists and artistic organization to self-archive their web artworks, web publications and websites, it allows our institution to fulfill its mandate to collect documents reflecting current artistic production for future access and use. Since 2018, Artexte’s librarians have trained over 100 artists and cultural workers who had no previous knowledge in web archiving.

With a limited funding given to artist-run centres for printed catalogues of their exhibitions and activities, digital publications often set up as stand alone websites have become common. Examples of these include https://www.avatarquebec.org/40000ans/en/, https://performance.gruntarchives.org/ and https://resilienceproject.ca/en/. However, lack of funds and human resources necessary to maintain the websites results in partial or complete loss of accessibility within an average of five to seven years.

This paper will explore the complexity of the types of web documents being created,
by artists and arts organization. Documents including a wide variety complex audiovisual components and interactive elements that are more often than not, inadequately captured by automated web archiving strategies. We explain how the manual use of Webrecorder is essential in order to maintain the integrity and authenticity of the web archive. Furthermore, seeing that the process of manual web archiving is time intensive, we will explore Artexte’s strategy of knowledge sharing through in-person and web-based digital literacy classes in order to teach the use of Webrecorder.

Finally, in the context of a small library institution dedicated to foster critical approaches to creativity, exhibition, research and interpretation in the visual arts.
This presentation explores the importance of adapting our own collection practices to include web archives as a new type of document available to researchers.

Building Community through Archives Unleashed Datathons: Lessons Learned

Ian Milligan1, Samantha Fritz1, Nick Ruest2, Jimmy Lin1

1University of Waterloo, 2York University

Since March 2016, the Archives Unleashed team has hosted a series of datathons to support skills training, facilitate scholarly access, and build community around web archives. These datathon events have brought together over a hundred participants from over 50 unique institutions, who, over the course of two to three days, formed interdisciplinary teams and were given access to data and computing infrastructure to develop a project around a web archive collection. Participants have included those who create web archives, those who create tools and platforms, and those who use them for research.

Our team offers a reflection on the lessons and insights of running events to the web archiving community as considerations for groups developing and hosting similar programming events.

At the close of our final datathon in April 2020, which was transitioned to an online event due to the COVID-19 pandemic, our team investigated the impacts that Archives Unleashed datathon events have had on community building and engagement within the web archiving field.

To do so, we conducted interviews with datathon participants to discuss how the datathons have impacted their professional practices and the broader web archiving community. Drawing from and adapting two leading community engagement models we introduce an approach for building community and engaging users in an open-source digital humanities project. In presenting our model, we illustrate the activities undertaken by our project and the related impact they have on the field, which can be broadly applied to other digital humanities projects seeking to engage their communities.

The Whole Earth Web Archive

Jefferson Bailey

Internet Archive

The Whole Earth Web Archive (WEWA) is a proof-of-concept to explore ways to improve access to the archived websites of underrepresented nations around the world. Starting with a sample set of 50 small nations and extracting their archived web content from the Internet Archive’s (IA) total web archive, the project team built special search and access features on top of this subcollection and created a dedicated discovery portal for searching and browsing at https://webservices.archive.org/wewa. Archived materials from the web play an increasingly necessary role in representation, evidence, historical documentation, and accountability. However, the web’s scale is vast, its content changing and disappearing quickly, and the resources required to preserve the web are prohibitive for requires significant infrastructure and expertise to collect and make permanently accessible. Thus, the community of institutions preserving the web remains overwhelmingly represented by well-resourced institutions from Europe and North America. The WEWA project explores how to provide enhanced access to archived material otherwise hard to find and browse in the massive 25+ petabytes of IA’s web archive and aims to provoke a broader reflection upon the lack of national diversity in institutions collecting the web in order to spur collective action towards better inclusion of all nations and peoples in the overall global web archive. This talk will discuss the project’s conceptual and technical work, including prior projects and technical work to provide improved access to specific, valuable subsets of the overall global web archive behind the Wayback Machine. This includes an overview of internally-developed content extraction and search tools as well as a discussion of further work focusing on improving IA’s harvesting of the national webs of these and other underrepresented countries. Lastly, the will outline ideas for advancing collaborations with libraries and heritage organizations within underrepresented countries, and via international organizations, to contribute technical capacity and open infrastructure to local experts in countries without formal web archiving programs who can identify websites of value that document the lives and activities of their citizens.

Side fûn: mapping the Frisian web domain in the Netherlands

Kees Teszelszky

Koninklijke Bibliotheek - National Library of the Netherlands

How to conduct a domain crawl without a legal deposit? Koninklijke Bibliotheek – National Library of the Netherlands has been involved with mapping the Dutch national web domain already for a year by focusing on a domain within the national web domain: the Frisian web. In 2014 the Frisian top-level domain .frl was introduced in the Netherlands. Fryslân (Friesland) is a region of the Netherlands with its own minority language and culture. Around 354,000 people have the Frisian language as their native language, which is the second official language of the Netherlands. 15,000 URL’s have been registered with this domain extension. Many of these websites have Frisian language content on it.

I will present the preliminary results and the research potential of mapping, harvesting and creating a web data set out of the Frisian web domain starting with the .frl TLD. KB-NL has collected born digital material from the web since 2007 through web archiving. It makes a selection of websites with cultural and academic content from the Dutch national web. Most of the sites were harvested because of their value as cultural heritage of the Netherlands, including the Frisian heritage and culture. KB-NL works together with Tresoar, the repository of the history of Fryslân, and the Fryske Akademy (Frisian Academy of Science) to select and harvest born digital content of the Frisian web for its web archive.

I will describe the methods and experience with mapping, selecting and harvesting websites of the Frisian web domain. I discuss also the characteristics of web materials and archived web materials and will explain the use of these various materials (harvested websites, link clouds, context information) for future research. A harvest of the Frisian web domain will provide future researchers with an unique born digital data set of a minority language which can be combined with other similar data sets of the Frisian language.

Brazilian Elections Web Archive

Moisés Rockembach

Federal University of Rio Grande do Sul / Research Group of Web Archiving and Digital Preservation

The web is the main communication and information environment for election campaigns today, worldwide. Even so, we do not preserve the information that is produced in this digital medium. According Ntoulas, Cho and Olton (2004), in the web environment, approximately 80% of hyperlinks disappear or change within a year. Our research object, web campaigns, are an important tool for communicating with voters, this strategy involves a tension between decentralization and centralization of information, campaigns want to inform as many potential voters as possible; however, they want to control disassociated information about the candidate, as mentioned by Foot and Schneider (2006). The research aimed to address the life cycle model of web archiving (Bragg, Hanna, 2013) and the systematic approach of web archiving (Khan, Rahman, 2019), as well as its technological aspects and information curation. With a theoretical and applied approach, we identify examples and international case studies, covering various contexts, but especially those related to the theme, concerning the web archiving of national elections. To this end, we have preserved the 2018 Brazilian presidential campaigns, from the official websites and campaign websites, of the 13 presidential candidates. Methodologically, the research was categorized as mixed methods (Creswell, Clark, 2017) and was developed in four phases: literature review, infrastructure configuration and software testing, data harvesting and quantitative and qualitative analysis. The infrastructure configuration was based on the study, testing, choice and use of a digital platform (Heritrix) to collect, store and make available archived content, based on open source software, identifying in the international community the best practices and available technologies. In previous research (Rockembach, 2018), no Brazilian web archiving initiative was found in the literature review, nor specialized websites in the country, with only some Brazilian content archived in an unsystematic way. For example, some preserved websites, such as the 2010 Brazilian elections (Library of Congress, 2010), may serve as a relevant source of information on past elections. In addition, the triangulation of quantitative and qualitative data provided a better understanding of the web information phenomenon as digital memory. We still compared our formed web archive with live websites after 1 year of publication and found that 85% of websites don't remain the same, either with modified content or without online page access.

It concludes that the research results have the potential to be used in Brazilian Web archiving projects and as a research source for the 2018 elections. Finally, the web archives of this research will be made available soon and integrated into another research, also produced by our research group, that resulted in a web archive of videos published in the presidential campaigns.

Readying Web Archives to Consume and Leverage Web Bundles

Sawood Alam1, Michele Weigle2, Michael Nelson2, Martin Klein3, Herbert Van De Sompel4

1Internet Archive, 2Old Dominion University, 3Los Alamos National Laboratory, 4Data Archiving Networked Services

Web Bundles, also known as Web Packaging, is an emerging Web standard to enable offline sharing and access to websites or a set of webpages along with their page requisites, in the form of a single bundle file. When a bundle file is loaded from the local file system or a remote server, the browser treats individual bundled resources as if they were downloaded from their respective origins (e.g., nytimes.com) and not the bundle distributor (e.g., google.com). A Web Bundle is a set of HTTP Exchanges (i.e., Request and Response HTTP Messages) encoded in the Concise Binary Object Representation (CBOR) format along with some metadata, manifest, and an entrypoint URI. These HTTP exchanges can optionally be signed by their respective origins, in which case signatures will be deemed valid for a short period of time (seven days as per the current specification). Signed HTTP Exchanges give user agents confidence that the resources indeed came from the said origin and were not altered, irrespective of from where and how the bundle was transported.

Realizing the potential of Web Packaging for Web Archiving (e.g., nytimes.com pages served from archive.org and not google.com), we presented a position paper in the IETF ESCAPE Workshop in July 2019. We described various challenges Web archives face in coherently and completely archiving and replaying Web pages and how Web packaging could help address some of those issues. We also proposed some changes in the current specifications to better accommodate Web archiving use cases. With the public announcement of experimental support of Web Bundles in Google Chrome 80, it is likely that the technology will be adopted by the Web community at a significant scale. It is important for the Web archiving community to prepare for dealing with Web Bundles as a new type of web resource as well as for leveraging the technology to solve some existing archival issues. We want to spur conversation around the following topics:

* Enabling discovery, content negotiation, and ingestion of Web Bundles for archival purposes, as an approach preferred over collecting atomic resources, because they support more coherent and complete crawling without the need of error-prone link extraction from JavaScript files or slow page execution in headless browsers
* Decomposing bundled HTTP Exchanges for efficient storage and deduplication in WARC files, IPFS, or other storage systems
* Indexing both bundled and atomic ingestions with precomputed hierarchical dependency of page requisites for a deterministic and coherent replay of composite mementos
* Dynamically generating bundles for composite mementos at replay with the help of the resource dependency graph from the index and decomposed exchanges from the storage for clients that prefer Web Bundles
* Utilizing Portals instead of iframes for seamless transition to bundled composite memento replay from the search and banner interface and back
* Exploring possibilities of establishing technical means to enable fixity verification and non-repudiation for a web archiving time scale by utilizing Signed HTTP Exchanges, Timestamping Services, Certificate Transparency Logs, Cross-archive Manifest Exchange, and other resources

Analyzing WARC on Serverless Computing

Yinlin Chen

Virginia Tech

Since 2015, Virginia Tech Libraries uses Archive-It to preserve Virginia Tech's official web presence plus selected project websites created and administered by Virginia Tech faculty, students, and staff. The crawl archive contains over a million web pages or 8 TiB of uncompressed content in the WARC (Web ARChive) file format. To better understand the crawled content or open research datasets such as Common Crawl data, we set up an on-premise Hadoop cluster with software installed to host, process, analyze, and visualize datasets.

Our usage for this cluster is Ad-hoc based, usually used when a new crawled dataset is ready or an open research dataset is available for download. It becomes a heavy burden for maintaining such a cluster because it requires a dedicated system admin to ensure this cluster is always up 24/7 with all the patches, security updates, and software versions remaining up to date. Moreover, the hardware could wear down and need to be replaced. Any of each action costs money and specifically cumbersome labor work. As cloud computing becomes a promising place to host such a cluster, the server maintenance workload is not necessarily reduced but could instead increase using the same traditional workflow. Creating such a cluster in the cloud environment with a set of instances always running for Ad-hoc jobs can be costly and waste, especially when the cluster does not process requests and become idle. Ideally, we want a platform that is always on for accepting requests but only consumes the resources when it receives requests and processes tasks, self-scaled as needed, and cleans up the resources automatically when the tasks are completed. This platform can ensure near 100% resource utilization and thus comes to the idea of using a serverless computing approach. To achieve this goal, we designed a serverless architecture platform that enables a resilient, scalable, and cost-effective using AWS cloud-native services (AWS Lambda, Batch, ECS, etc.). This serverless WARC processing platform costs nothing during idle and scales itself as needed when processes large amounts of data. As proof of concept to test the scalability, cost, and performance, we use selected Common Crawl data stored in the AWS S3, extract the content, create derivatives, and measure the time spent and cost. Our analysis shows that the platform processed GBs of uncompressed data in just minutes. Furthermore, we now can have total control over how precisely the resources have been used and further optimize them.

This submission aims to present our serverless architecture design and implementations, elaborate the technical solution on integrating multiple AWS services with other techniques, and describe our streamlined and scalable approach to analyze large WARC datasets. Our platform eliminates the need to manage underlying servers and delegate all the heavy lifting to AWS. We focus on implementing our business logic into microservices in AWS and construct this platform. We want to share our experience and humbly hope this work can open a new direction for institutions, libraries, and scholars on analyzing web archives.

The Black Hole of Quality Control: Toward a Framework for Managing QC Effort to Ensure Value

Patricia Klambauer, Tom J. Smyth

Library and Archives Canada / Bibliothèque et Archives Canada

As web archiving practitioners know, quality control of harvested websites can be the bane of our existence! Most can identify with having spent entire days on quality control (QC) for a single high-value (or not) web resource, justifying it for the greater good of preservation and future access. When in the throes of QC, it is hard for practitioners to know when to let go and say “good enough”. Too often the QC process is characterised by an absence of standardisation, undefined outcomes and uncontrolled staff costs, to the detriment of web archive collections and program sustainability.

This presentation will take a new approach at QC for web archives by asking:

• How do we decide to limit or expand the amount of QC we undertake, where warranted?
• How do we determine QC actions that or should not be performed based on a resource’s heritage value?
• How do we actively manage QC effort and direct it to the most important resources within the context of a curated collection?
• How can this thinking help organizations that may not have much staff resources to do QC?

This presentation will convey the lessons learned from the construction of a quality assurance and control framework at Library and Archives Canada. Key tools and techniques will be conveyed and discussed, including: initial scoping of web collections to define QC levels by topic, relative value, and risk; approaching QC as a “scrum” project; graphing QC technical complexity to understand effort vs value with impact on resource allotment; incorporation and progression from ISO/TR 14873:2013 (Statistics and quality issues for web archiving) with impact on our stats and reporting; and creation of web archival finding aids to describe the end product that allows managing researcher expectations on QC.

It is hoped this discussion will help other web archiving practitioners avoid the black hole of QC, and get “control” of Quality Control.

Detecting quality problems in archived website using image similarity

Brenda Reyes Ayala, James Sun, Jennifer McDevitt, Xiaohui Liu

University of Alberta

A high-quality archived website should be an accurate representation of the original website in content, form, and appearance. It looks and behaves exactly like the original. Unfortunately, It is common to see archived websites with no images or media or with broken links. In order to detect these quality problems, web archivists must engage in an onerous process of QA where they manually inspect hundreds or even thousands of archived websites. When web archiving is done by national libraries seeking to capture and preserve their national domain, quality problems grow to such a scale that human intervention is no longer enough to detect and fix them. There is a clear need to ease the workload of web archivists by detecting quality problems in an automated or semi-automated way.

One of the primary ways web archivists have of judging the quality of an archive website is by comparing its appearance to that of the original website. Our work examines how the visual quality of an archived website can be measured using popular image similarity measures. We are interested in answering the following research question: How effective are different similarity measures at measuring the visual correspondence between an archived website and its live counterpart?

We chose three different web archives in order to apply the similarity metrics, two from the University of Alberta and one from the British Library’s UK Web Archives: The "Idle No More" collection, the Western Canadian Arts collection, and the UK Web Archives Open Access (OA) collection. “Idle No More” is a topical web archive created by the University of Alberta to preserve websites related to “Idle No More”, a Canadian political movement encompassing environmental concerns and the rights of indigenous communities. The Western Canadian Arts collection preserves the born digital resources created by filmmakers in Western Canada. The British Library’s OA web archive is a more general collection encompassing UK websites that can be made available online according to British legal deposit laws.

In order to measure the visual correspondence of an archived website to its live counterpart, we created a set of tools called "wa screenshot compare”. Written in Python, these tools take a seedlist as input and generate screenshots of the live websites and their archived counterparts using Pyppeter and and a headless instance of the Chrome browser. Three image similarity metrics are then deployed to calculate the differences between the screenshots: the Structural Similarity Index (SSIM), the Mean Squared Error (MSE), and vector distance. Furthermore, we recruited human collaborators to manually judge the quality of archived websites and compared their judgements to the results produced by our tools. We found that all three similarity measures, but especially SSIM, were able to distinguish between archived websites of high quality (those that are almost exactly like their live counterparts), archived websites of medium quality (those that retain most of the intellectual content of the original but are missing images or stylesheets), and archived websites of low quality (those that are missing both content and styling elements).

Improving the quality of web harvests using Web Curator Tool

Jeffrey Van Der Hoeven1, Ben O'Brien2, Hanna Koppelaar1, Trienka Rohrbach1, Andrea Goethals2, Steve Knight2

1National Library of the Netherlands, 2National Library of New Zealand

As online presence is indispensable in our volatile world, web archives are an invaluable source of factual information about what was online at a certain moment. As online content is short lived it is important to gain the highest quality when the site is crawled or days thereafter before the content has changed or disappeared. It is therefore that work on the Web Curator Tool (WCT) is currently focused on the area of quality management.

The WCT is an open source workflow management tool for selecting, crawling websites, performing QA and preparing websites for ingest. Through close collaboration between the National Library of New Zealand (NLNZ) and the National Library of the Netherlands (KBNL) the WCT has already undergone several important uplifts in the past two years. Version 2 added support for Heritrix 3 and improved the project documentation adding new tutorials, installation and administration guides. Version 3 addressed a large volume of technical debt, in which the underpinning frameworks were upgraded, creating a stable foundation to take the development of WCT forward. With that work done, the project team began to focus on functional enhancements desired not only by the KBNL and NLNZ, but also by the wider IIPC community.

Through tutorials and workshops at recent IIPC conferences, the project team demonstrated the new versions of WCT and asked for feedback on the features other organisations would like to see incorporated into the WCT. The Hungarian National Library in particular contributed many enhancement ideas, several related to improving the WCT’s quality control features. These dovetailed nicely with the WCT QA improvements already planned for the next release.

Increasing and improving the quality review capability within Version 4 of the WCT will be achieved through enhancements in three core areas (the second two were requested by the Hungarian National Library):

1. Crawl patching using Webrecorder
Integrate Webrecorder into the WCT QA workflow by making use of its new “patch” capability. This will add the ability to repair missing content in addition to the existing WCT import and prune functionality. The integration will transfer newly patched content back into the WCT, incorporating it into the original web harvest.

2. Screenshot generation.
The WCT previously contained limited functionality to capture screenshots of a web harvest. Realising the potential QA benefit, we are enhancing this tool to capture screenshots of live websites being crawled and the resulting web harvest for comparison. The integration of the screenshot software will be configurable, allowing for the use of 3rd party tools. We also intend to leverage the advancements in screenshot comparison metrics within the web archiving community.

3. Integration with Pywb viewer.
The WCT already provides a WARC viewer and OpenWayback integration to browse harvests, but recently Pywb has become the benchmark for web archive replay. This integration will provide WCT users with the best options available for web harvest replay and review.

The KBNL is currently performing a retrospective assessment of their web archives using the WCT that will identify additional QA features to include.

Memento Tracer - An Innovative Approach Towards Balancing Web Archiving at Scale and Quality

Martin Klein1, Herbert Van De Sompel2

1Los Alamos National Laboratory, 2Data Archiving Networked Services

Current web archiving approaches either excel at capturing at scale or at high quality. Despite various attempts [1], crawling frameworks that combine both criteria remain elusive. For example, the Internet Archive’s crawler is optimized for scale enabling an archive of 784 billion web resources [2]. However, capture quality may vary and is often hindered by dynamic elements and interactive web features [3]. The Webrecorder tool [4], on the other hand, provides high-fidelity captures by recording all elements of a web resource a user interacts with. However, since it operates at human-scale, it lacks the ability to archive resources at the scale of the web.

As part of the “Scholarly Orphans” project we devised Memento Tracer [5], a novel web archiving framework that aims at striking a balance between operating at web scale and providing high-quality captures. Memento Tracer consists of three main components:

1) A web browser extension: A human web curator navigates to a web page, for example, to a SlideShare presentation and activates the extension. The curator’s interactions with the web resource (e.g., advancing the slides or following links in the comments section) indicate the components of the web resource she seeks to archive. The extension creates a trace by recording all interactions in terms that uniquely identify the page's elements that are being interacted with e.g., by means of their class ID or XPath. A trace created for a web resource representative of a class of resources is sufficiently abstract that it can be applied across all resources of that class, for example to all slide decks in SlideShare.

2) A shared repository: Traces can be shared with the community via a publicly accessible repository, thereby crowdsourcing a web curator task. The shared repository allows for reuse and refinement of existing traces as well as multiple versions of traces for the same class of resources created by different users since curators may disagree on the essence of an artifact [6].

3) A headless browser capture setup: To generate web captures, the Memento Tracer framework assumes a setup consisting of a WebDriver (e.g., Selenium [7]) that allows automating actions of a headless browser (e.g., PhantomJS [8]) combined with a capturing tool (e.g., WarcProxy [9]) to write resources to WARC files. This fully automated capture setup invokes a trace to guide the capturing of a web resource and its components.

With Memento Tracer a curator defines the essence of a web resource. Traces based on interactions with a resource can be shared with the community and are used to guide an automated crawling framework to generate high-quality captures. With this functionality, Memento Tracer has the potential
for a true paradigm shift in web archiving. However, challenges remain, such as the standardization of the language used to express traces and addressing limitations of browser event listeners for recording traces.

[1] https://github.com/N0taN3rd/Squidwarc
[2] https://twitter.com/brewster_kahle/status/1170820482104348672
[3] https://ws-dl.blogspot.com/2017/01/2017-01-20-cnncom-has-been-unarchivable.html [4] https://webrecorder.io/
[5] http://tracer.mementoweb.org/
[6] https://doi.org/10.1109/IRI.2018.00026
[7] https://www.seleniumhq.org/
[8] http://phantomjs.org/
[9] https://github.com/internetarchive/warcprox

Not Gone in a Flash! Developing a Flash-capable remote browser emulation system

Ilya Kreymer, Humbert Hardy

1WebRecorder, 2National Film Board of Canada

In 2015, the oldweb.today demonstrated the use of emulated browsers to present web archives from the early days of the web in contemporaneous environments. Preserving and making accessible browsers remains as important today, with many web technologies, particularly Flash becoming obsolete. Since the native browser no longer supports Flash, the only way to present Flash projects is in a remote emulated browser. Such a system launches pre-configured web browsers on-demand and streams the desktop and audio from the remote browser to the users own browser in real-time. This system allows for presenting obsolete technologies, such as Flash and even Java applets, either from the live web or from web archives. We will present an updated iteration of this system, developed in collaboration between Webrecorder and the National Film Board of Canada to preserve and present highly-interactive, Flash-based productions and make them accessible using remote browsers.

Many Flash projects involve synchronized audio and video content which ideally should be presented with low-latency for accurate replay. The presentation will cover the technical challenges of archiving and providing access to Flash content, focusing on the architecture of the remote browser system. We will introduce technologies such as Docker containers, GStreamer, VNC, WebRTC. We will cover several possible configurations for remote browser video and audio: VNC + Audio over websocket, VNC + Audio via WebRTC and full video + audio over WebRTC, and the tradeoffs between them. The topics will include various challenges and lessons learned, such as ensuring audio and video are in sync, and configuration options available in the system to work for a variety of users. We will also cover the integration of this system with other tools, such as Python Wayback (pywb) and Webrecorder, and briefly cover how new browser configurations can be added, and demonstrate how new versions of Chrome, Firefox and other browsers can be added.

We hope this technical presentation will be helpful to other institutions that have similar use cases and wish to deploy browser emulation to make their high-fidelity Flash-based content widely accessible into the future.

WASAPIfying Private Web Archiving Tools for Persistence and Collaboration

Mat Kelly

Drexel University

WASAPI is a framework originally funded by US Institute of Museum and Library Services (IMLS) for transmitting WARCs using a standard API. While only a few web archiving services like Archive-It [1] and Webrecorder [2] have implemented the server-side component of the framework, little has been implemented for desktop software beyond the reference libraries [3, 4]. Because individual web archivists might use a variety of online web archiving services for preservation based on scale or crawl quality, we anticipate WASAPI to be a step toward liberation of captures from services. The integration of WASAPI into additional tools and services will also serve as an opportunity for users to redundantly and systematically back up their WARCs for local replay.

In this presentation we detail our efforts to facilitate WASAPI integration into desktop-based web archiving software. We describe our implementation into the Web Archiving Integration Layer (WAIL) [5] desktop application to allow clients to retrieve their captures from services supporting the server component of WAS- API for local replay using the bundled OpenWayback [7]. OpenWayback currently lacks built-in support for WASAPI. For a user to import their WARCs for local replay from a service like Archive-It or Webrecorder, a user would to be programmatically savvy or familiar with interacting with command-line tools. We also detail our efforts in further systematic distribution and collaboration of individuals' web captures by integrating WASAPI server and client components into the InterPlanetary Wayback (ipwb) [6] personal archive replay system. Doing so facilitates opt-in distribution of a user's WARC files for collaboration, propagation, and integration into a distributed set of web archive replay systems.

With these initial integrations of the WASAPI server and client components into desktop software, we hope to encourage other desktop archiving software to integrate and implement the framework. The creation of a native, desktop-based, graphical user interface for interacting with WASAPI functionality is novel to this work where most other interfaces were either a destination on the web, which might be unsuitable for personal/private captures, or a command-line interface, which is inaccessible to many casual personal web archivists.

[1] WASAPI, Archive-It Blog, https://archive-it.org/blog/projects/wasapi/
[2] Announcing Webrecorder API and WASAPI Support, Webrecorder Blog, October 2019, https://blog.webrecorder.io/2019/10/21/wasapi-support.html
[3] py-wasapi-client, https://github.com/unt-libraries/py-wasapi-client/
[4] wasapi-downloader, https://github.com/sul-dlss/wasapi-downloader/
[5] Web Archiving Integration Layer (WAIL), https://github.com/machawk1/wail/
[6] InterPlanetary Waybackback (ipwb), https://github.com/oduwsdl/ipwb/
[7] OpenWayback, https://github.com/iipc/openwayback/

The UK Government Social Media Archive – now more comprehensive, more user friendly and searchable for the first time

Claire Newing

National Archives, UK

This presentation will demonstrate the newly relaunched UK Government Social Media Archive (UKGSMA) which was developed by The National Archives, UK and MirrorWeb. The impetus to undertake the redevelopment project was provided by feedback from user experience testing and from the goals of the organisation with regards to archiving social media. It had three main aims:

(1) To develop a solution to capture content hosted on Flickr
(2) To re-design the UK Social Media Access pages to accommodate a significant increase in the number of social media channels being captured
(3) To provide a full text search service for archived social media content.

The project was successful and the service was re-launched in September 2019. The new channel access pages and full text search function can be accessed from The UK Government Web Archive homepage: http://nationalarchives.gov.uk/webarchive/

The UKGSMA was initially launched in 2014. At that time it provided browse access to archived content from selected UK central government Twitter feeds and YouTube channels. The Tweets and videos were captured directly from Twitter and YouTube API services and access was provided through a custom interface. New data was added to the service regularly but little else was changed.

In 2017/2018 we undertook some user research on the UK Government Web Archive service. During interviews we discovered that users were unaware of the UKGSMA and when made aware of its existence reported that it was not useful to them without a full text search facility.

In November 2018 Flickr announced that they would be limiting free accounts to 1000 posts only with additional posts being removed from early 2019. A check of in-scope Flickr channels revealed that several free accounts held more than 1000 posts – a significant risk. We needed to develop a way of capturing and providing access to Flickr content.

At around the same time, we responded to an organisational priority to expand the number of Twitter and YouTube channels we were capturing. A re-design of the UKGSMA access pages would be needed to display the much larger number of archived channels.

We will showcase the outcomes of the project and provide an overview of the technologies and infrastructure used.

Unlocking web and social media archives for humanities research: a critical reflection

Eveline Vlassenroot1, Friedel Geeraert2, Sally Chambers1,3, Peter Mechant1, Fien Messens2, Julie M. Birkholz 1,3

1imec-mict-Ghent University, 2KBR (Royal Library of Belgium), 3GhentCDH, Ghent University

The perceived challenges of using web and social media archives for research are well-documented in recent publications and research initiatives such as the BUDDAH project[1], RESAW[2] and most recently WARCNet[3]. One of the most important challenges web archiving institutions need to overcome is the lack of awareness of the existence of web and social media archives in the research community[4]. Given that archived web content is relatively new research material, new skills need to be acquired to work with this content which is not something every researcher is willing to do[5]. Access restrictions to web archives can also vary greatly between institutions, ranging from very restricted access to being freely accessible online to all.

Next to this, different policy decisions and legislation also shape web and social media archives and influence both the source material that is put at the disposal of the researchers and the way in which this material is made accessible. A critical reflection is therefore necessary about how web archives can be unlocked for digital humanities research. Gaining insight into the needs and requirements of users of web archives is essential especially for web archives currently being developed (as is the case in Belgium). However, web archiving institutions often do not have a lot of information about the use of their respective web archives[6].

This presentation discusses the results of a survey targeted at researchers (n=145) that was done in the context of the PROMISE project (PReserving Online Multiple Information: towards a Belgian StratEgy) that aims to develop a pilot web archive for Belgium on the federal level[7]. The aim of the survey was to study what the requirements and needs of researchers are and how they want to access, use and consult web archives, if at all. The analysis is linked to three general themes: 1) What is a web archive according to researchers? 2) How are web archives used now? 3) What challenges do researchers perceive when using web archives? During this presentation we also explore the operational and information navigation skills of these respondents and how they are related with these three general themes and socio-demographic information that was collected during the survey.

Furthermore, we will draw on our experiences in the BE-Social project, which aims to develop a sustainable strategy for archiving and preserving social media in Belgium[8] and will pilot access to the social media archive for researchers. As well as the KBR Digital Research Lab[9], which serves to facilitate text and data mining research on KBR’s diverse, multilingual digitised and born-digital collections and DATA-KBR-BE[10], which facilitates data-level access to KBR’s digitised and born-digital collections for digital humanities research.

[1] In the BUDDAH (Big UK Domain Data for the Arts and Humanities) project, a number of bursaries were awarded to researchers for carrying out research in their subject area using the UK web archive. (BUDDAH, 2014).

[2] RESAW stands for Research Infrastructure for the Study of Archived Web Material and has been established ‘with a view to promoting the establishing of a collaborative European research infrastructure for the study of archived web material.’ (RESAW, 2012).

[3] The WARCNet (Web ARChive studies network researching web domains and events) network promotes national and transnational research to understand the history of (trans)national web domains and of transnational events on the web, drawing on the increasingly important digital cultural heritage held in national web archives. (WARCNet, 2020).

[4] Winters, J. (2017). Breaking into the mainstream: demonstrating the value of internet (and web) histories. Internet Histories, 1(1-2), 173-179. https://doi.org/10.1080/24701475.2017.1305713

[5] Winters, J. (2017). Breaking into the mainstream: demonstrating the value of internet (and web) histories. Internet Histories, 1(1-2), 173-179. https://doi.org/10.1080/24701475.2017.1305713. Gebeil, S. (2016). Quand l’historien rencontre les archives du web. Revue de la BnF 53(2), 185-191.

[6] The 2016 NDSA survey for example showed that only 19% of the 104 respondents were sure that their web archive is used by active researchers while 30% answered ‘no’ and 51% ‘Don’t know’. Bailey, J., Grotke, A., McCain, E., Moffatt, C. & Taylor, N.(2017). Web archiving in the United States: a 2016 survey. Washington: National Digital Stewardship Alliance. Retrieved from https://ndsa.org/documents/WebArchivingintheUnitedStates_A2016Survey.pdf. Last accessed on 28/08/2018.

[7] The PROMISE project (2017-2019) was initiated by the Royal Library and the State Archives of Belgium. The universities of Ghent and Namur and the university college Bruxelles-Brabant are partners in the project. For more information see: https://www.kbr.be/en/promise-project.

[8] The BE-Social project (2020-2022) was initiated by KBR, the Royal Library of Belgium in collaboration with the universities of Ghent, Namur and Louvain. For more information see: https://www.kbr.be/en/projects/besocial/

[9] The KBR Digital Research Lab is a unique long-term cooperation with the Ghent Centre for Digital Humanities, Ghent University. For more information see: https://www.kbr.be/en/projects/digital-research-lab/

[10] The DATA-KBR-BE project (2020-2022) is led by KBR, the Royal Library of Belgium in collaboration with the universities of Ghent and Antwerp. For more information see: https://www.kbr.be/en/projects/data-kbr-be/

The Australian Web Archive – the legal, technical, and ethical challenges

Libby Cass

National Library of Australia

In March 2019, the National Library of Australia launched the Australian Web Archive, or the AWA. The AWA is a massive, freely accessible collection of content that provides a historical record of the development of world wide web content in Australia over more than two decades. The AWA makes available a huge body of previously inaccessible content readily and freely available through Trove, and closes the 'digital gap' that exists in the documentary record of Australia through the nineties and the early 21st Century. Many national libraries around the world have internet archives, but Australia's is one of the first fully searchable platforms.

The AWA houses roughly 600 TB of data, includes billions of Australian .au domain web pages. The AWA integrates access to PANDORA archived websites, the Library’s long running curated and collaborative web archiving program and bulk harvested content of the Australian Government Web Archive (AGWA). The AWA also includes over 20 years of annual whole of domain harvests that have captured what is important at a point in time. The AWA exposes the content of the whole domain archive, as well as providing full-text searching of the combined archive for the first time.

This work firmly anchors the National Library and its partners to the critical role of collecting, preserving, and making accessible documentary resources relating to Australia and the Australian people in our digital age. Users can perform full-text searches within the archive, and the collection is accessible to users outside the Library’s physical buildings.

While there are enormous benefits to this unparalleled public access to nationally significant web collection material, the Library recognised that the ability to conduct unmediated searches of a large body of web archive content raised a number of risks. This presentation will explore the legal issues related to collecting and accessing content and the ethical concerns and technical challenges that the Library grappled with in order to deliver the AWA.

The Library understood that some members of the public might consider certain material to be controversial, distasteful or morally objectionable. To preserve the integrity and value of the collection as a historical record, and achieve the AWA objectives, it was not the Library's intent to 'censor' the web archive.
There was no automated technology solution available that could identify and remove website content of a particular nature with 100 per cent accuracy, without removing other content that should remain in the collection. The Library created a new approach to solving these issues, which included new takedown workflow for objectionable content, establishing a new relevance ranking for safe searches and mitigating the technical risks relating to the availability of the service.

The AWA was the most technologically challenging search project undertaken by the Library. The Library took advantage of technological developments to design and deliver a user-friendly digital architecture and operational web archive platform through which the public could search, navigate, and make use of the collection.

Making the most of Legal Deposit in the UK Web Archive

Jason Webber

British Library

Archive will have been archiving websites for 15 years. In 2005 this was on a strictly selective basis and only with the express permission of the website owner. The bonus of seeking permission meant that all of the websites collected this way could be presented publicly through www.webarchive.org.uk.

In 2013 the Non-Print Legal Deposit (NPLD) Act came into force allowing UK legal deposit libraries to collect, without permission, any digitally published UK material including websites. This meant that millions, rather than thousands of websites could be collected each year and, for the first time, a true representative sample of the entire UK web.

On the other side of collecting such a vast and rich source of data about the UK is the restriction of access only being allowed via UK Legal Deposit reading rooms. Not only are users restricted to attending one of seven locations in the UK and Ireland but they need to register for a pass and then use only our terminals with very strict terms of use.

Whilst many potential users of the Legal deposit libraries physical collections completely understand the need to come to a reading room to see them, many don’t comprehend at all why they can’t see copies of historic websites from the comfort of their home or office. The result is that even when someone has heard of the UK Web Archive few will actually come into a reading room to view it. The Access Paradox: the ‘open’ collection, though tiny, gets considerably more use than the vast ‘reading room only’ collection. In addition to the limitations on researchers doing qualitative research, text and data mining are also restricted. How can we, therefore, make the most of these limitations now and argue for fewer restrictions in the future?

Over the last few years the UKWA have made a number of steps to remove at least a few of the barriers to using the web archive collection. This is mostly in the form of a new user interface that allows users to search the archive from anywhere and essentially offers a similar experience whether a user is in a reading room or not. There is also an on-going program to engage with existing users of the reading rooms as they have already crossed the significant barrier of acquiring a reader’s pass.

On the quantitative side of things we have engaged with projects with the Alan Turing institute who have written code to examine large amounts of web data for changes to word meaning over time. Andy Jackson, UKWA Technical lead, has been able to take this code and apply it to the legal deposit content.

The ideal position of course would be to offer open access to users on material collected under legal deposit. How might this be achieved? The NPLD act directed that a review of the regulations be conducted five years from implementation which is an opportunity to see what has been achieved and what might be possible.

To this end, UK Legal Deposit liaison staff have been delicately negotiating advocating over the last several years for conditional open access to NPLD material. This would be on condition that any website owner could request that their site could be made 'reading room only'. The review is currently in a period of consultation. Should open access be granted it would revolutionise access and use of UKWA. Even if open access is not granted there is hope that smaller elements of the regulations can be clarified to enable easier and better use of this resource.

Platform and app histories: Assessing source availability in web archives and app repositories

Anne Helmond1, Fernando van der Vlist2

1University of Amsterdam, 2Utrecht University

In this presentation, we discuss the research opportunities for historical studies of apps and platforms by focusing on their distinctive characteristics and material traces. We demonstrate the value and explore the utility and breadth of web archives and software repositories for building corpora of archived platform and app sources. Platforms and apps notoriously resist archiving due to their ephemerality and continuous updates. As a result of rapid release cycles that enable developers to develop and deploy their code very quickly, large web platforms such as Facebook and YouTube change continuously, overwriting their material presence with each new deployment. Similarly, the pace of mobile app development and deployment is only growing, with each new software update overwriting the previous version. As a consequence, their histories are being overwritten with each update, rather than written and preserved. In this presentation, we consider how one might write the histories of these new digital objects, despite such challenges.

When thinking of how platforms and apps are archived today, we contend that we need to consider their specific materiality. With the term materiality, we refer to the material form of those digital objects themselves as well as the material circumstances of those objects that leave material traces behind, including developer resources and reference documentation, business tools and product pages, and help and support pages. We understand these contextual materials as important primary sources through which digital objects such as platforms and apps write their own histories with web archives and software repositories.

We present a method to assess the availability of these archived web materials for social media platforms and apps across the leading web archives and app repositories. Additionally, we conduct a comparative source set availability analysis to establish how, and how well, various source sets are represented across web archives. Our preliminary results indicate that despite the challenges of social media and app archiving, many material traces of platforms and apps are in fact well preserved. The method is not just useful for building corpora of historical platform or app sources but also potentially valuable for determining significant omissions in web archives and for guiding future archiving practices. We showcase how researchers can use web archives and repositories to reconstruct platform and app histories, and narrate the drama of changes, updates, and versions.

End to end integration of NFB's web interactive works archive with WebRecorder

Jimmy Fournier

National Film Board of Canada

The digitization plan put in place by the NFB in 2008 had 3 objectives:
1. The preservation of NFB works in the digital world
2. The restoration of NFB works in the digital world
3. Enable the accessibility of NFB works on various digital platforms

It is with these objectives that the NFB has put in place processes and technologies to ensure the digitization of these 14,000 works, the majority of which are analog film sources.

The processes also helped ensure the NFB's digital shift for new productions
produced entirely with digital sources. The choice of file formats and archiving are open and not proprietary.

Since 2009, the NFB has also produced interactive works that are broadcast exclusively on the Web. The preservation and long-term accessibility of these works pose a major challenge. The NFB collaborated on the development of an interactive web content archiving solution in WARC format standardized to ISO.
This presentation will demonstrate the concepts and strategies put in place by the team research and development of the NFB.

This presentation will show the subjects of : Create archive, Preserve archive, Valorisation of the archive, and Make the archive accessible.

Website Defacements: Finding Hacktivism in Web Archives

Michael Kurzmeier

National University of Ireland, Maynooth

This paper will provide insight into the archiving and utilization of defaced websites as ephemeral, non-traditional web resources. Web defacements as a form of hacktivism are rarely archived and thus mostly lost for systematic study. When they find their way into web archives, it is often more as a by-product of a larger web archiving effort than as the result of a targeted effort. Aside from large collections such as the Internet Archive, which during a crawl might pick up a few hacked pages, there also exists a small scene of community-maintained cybercrime archives that archive hacked web sites, some of which are hacked in a hacktivist context.

By examining sample cases of cybercrime archives, the paper will show the ephemerality of their content. As defaced websites are usually quickly restored, they can be seen as content especially vulnerable to deletion and as such depend on specialized web archiving services to be preserved. Mostly, this archival work has been done by community-maintained cybercrime archives, whereby their collection of hacked websites contains a number of sites defaced by hacktivists. Those archives exist in a mostly suspended state, with most of them no longer accepting new submissions (Samuel, 2014). A number of them have disappeared from the Web altogether. These archives represent ephemeral resources which can give insight into the digital underground of the past.

Defacing websites is a way of breaking into mainstream discourse, of making yourself heard even to audiences who might not want to listen. It is a combination of a critique of technology and practices of detournement and adbusting aimed at altering the material substrate of the media (Lievrouw, 2011; Jordan, 2015). Discussions about the materiality of memory are dominated by discussions of capacity. As Ori Schwarz (2014, 3) puts it: “We used to, as a rule, forget. Now [with electronic media] we have the power of recall and retrieval at a scale that will decisively change how our society remembers.” Understanding the materiality of mediated memory as a struggle for capacity means to understand it as a struggle for the means of production of the respective medium. Capacity can be in the form of air time, it can be a struggle for printing presses or it can take on forms like pirate radio stations or murals. This approach also leads to the control and subversion of the circulation of memory objects through media. Capacity means nothing less but the ability to buy or rent what constitutes a medium such as paper, wires, frequencies, senders and receivers. Subversion can take the role of accessing this carefully
arranged system without paying at the door.

Hacking and altering webpages – defacements – for the expression of a political message is one form of hacktivism. My paper will present a methodological framework for academic engagement with those archival traces.

Digital activism, and more specifically hacktivism is at the core of my doctoral research, and entails the study of defaced websites as a primary source of research; and offers an overview of the current state of archived content, lost content and academic engagement with it.


Jordan, Tim. 2015. Information Politics: Liberation and Exploitation in the Digital Society. Digital Barricades : Interventions in Digital Culture and Politics. London: Pluto Press.

Lievrouw, Leah A. 2011. Alternative and Activist New Media. Digital Media and Society Series. Cambridge, UK ; Malden, MA: Polity.

Samuel, Alexandra. 2014. ‘Hacktivism and the Future of Political Participation’. Cambridge, Massachusetts: Harvard University. http://www.alexandrasamuel.com/dissertation/pdfs/Samuel-Hacktivism-entire.pdf.

Schwarz, Ori. 2014. ‘The Past next Door: Neighbourly Relations with Digital
Memory-Artefacts’. Memory Studies 7 (1): 7–21. https://doi.org/10.1177/1750698013490591.

Summarize Your Archival Holdings With MementoMap

Sawood Alam1, Michele Weigle2, Michael Nelson2, Daniel Gomes3

1Internet Archive, 2Old Dominion University, 3Arquivo.pt

In 2015 we worked on an IIPC-funded pilot project called “Web Archive Profiling via Sampling”. The goal of the project was to create a high-level summary of the holdings of Web archives. Among many use cases the primary focus of this work was to enable efficient Memento aggregation. A rudimentary Memento aggregator might broadcast every lookup request it receives to every web archive it knows, which can cause unnecessary and wasteful traffic to smaller archives and result in a slow response time. By knowing holdings of each archive an aggregator can make informed routing decisions and only poll from archives likely to return positive responses. Profiling archives based on what they hold is in contrast with approaches based on profiling based on responses to requests, such as the LANL Time Travel URI routing classifier; the former reflects the archive’s accession policies and the latter reflects the archive’s usage patterns. Our initial exploration was based on the CDX datasets from Archive-It, UK Web Archive, and Stanford Web Archive Portal. We submitted the final project report and findings in 2016. However, we continued to work on the idea of archive profiling beyond our initial exploration, making it more practical, flexible, robust, and scalable.

With the lessons learned from our initial exploration, we made some major improvements in the specification of Archive Profiling, which we now call a MementoMap. Instead of using various fixed profiling policies (e.g., H3P1: “uk,co,bbc,)/news” or DDom: “uk,co,bbc,)/”), we now support wildcard-based URI Keys (e.g., “uk,co,bbc,)/news/*”). This allows the seamless merger of an arbitrary number of independently generated profiles for an archive or splitting them into smaller pieces for easy maintenance, which was not possible in the earlier rigid profiling policies. We have also added support for blacklist entries in the MementoMap which specifies URI Keys that an archive does not hold, such as not sending quora.com requests to the Internet Archive.

Recently, we worked on the entire CDX dataset of Arquivo.pt, generated MementoMaps with various levels of detail, and evaluated them against the access log of MemGator (a low-traffic Memento agregator service running at the Old Dominion University) spanning over a period of more than three years. We found that over 94% of MemGator traffic to upstream archives was wasteful. Generating a MementoMap file containing less than 1.5% URI Keys, as compared with a comprehensive list of all the unique original URIs in the archive, can eliminate more than 60% of the wasteful traffic without compromising the recall (i.e., zero false negative). Among other factors, we also analyzed the overlap between what web archives hold and what people look for in those archives.

We have released our implementation (https://github.com/oduwsdl/MementoMap) for generating and maintaining MementoMaps which can be integrated with existing archival replay systems and CDX servers. We think the MementoMap framework and associated Unified Key Value Store (UKVS) file format are ready for adoption by Web archives.

MementoEmbed and Raintale for Web Archive Storytelling

Shawn Jones1, Martin Klein1, Michael Nelson2, Michele Weigle2

1Los Alamos National Laboratory, 2Old Dominion University

Web archive collections may consist of thousands of archived web pages, or Mementos. For traditional library collections, archivists can select a representative sample from the collection to share with visitors, but how should they display this sample to drive visitors to their collection or archived pages? Search engines and social media platforms often represent web pages as cards consisting of text snippets, titles, and images. Web storytelling is a popular method for grouping these cards in order to summarize a topic, as demonstrated by tools such as Storify and Wakelet. However, Storify shut down in 2018 and Wakelet has no public API. We evaluated fifty alternative tools, such as Facebook, Pinboard, Instagram, Sutori, and Paper.li and found that they are not reliable for producing cards for Mementos. Existing services that generate cards are not archive-aware, and for example will list “archive.org” as the source of the content and not the domain of the archived page itself.

Thus we developed MementoEmbed, an archive-aware service that can generate different types of visualizations for a given Memento. MementoEmbed can generate cards that appropriately attribute content to its original resource, including both the original domain and associated favicon, as well as providing a striking image, a text snippet, and title. MementoEmbed can also produce browser thumbnails of varying sizes and viewports, or, if a user desires, MementoEmbed can produce an animated GIF of the top-ranked images extracted from the Memento. For machine applications, MementoEmbed has an extensive API providing information about an individual Memento. Via this API, a client can obtain information about the Memento, such as its seed metadata, its original resource URL, and its web archive collection name. The API also provides output of the conducted content analysis such as paragraph ranking, sentence ranking, and computed ranked values for embedded images.

Raintale is a companion tool that accepts a list of Memento URLs and generates full stories based on the information provided by the MementoEmbed API. Natively, Raintale contains presets, allowing archivists to export the stories via formats such as HTML, MediaWiki, Jekyll, or Markdown. Alternatively, Raintale can export stories to social media services such as Twitter. In addition, archivists can exploit Raintale's template engine and develop their own templates, allowing archivists to customize the look and feel of their stories by incorporating and manipulating the data provided by MementoEmbed.

We envision Raintale and MementoEmbed to be critical components for summarizing collections of archived web pages in interfaces familiar to general users. We developed MementoEmbed so that its API is easily usable by scripts and other automated tools. Raintale's nature lends itself to easy incorporation into existing automated archiving workflows. In this presentation we will outline the motivation for these tools, describe their functionality, and briefly demonstrate their use.

Interactive Collage of Websites: A deep dive into the Web Archive Switzerland

Maya Bangerter, Kai Jauslin, Barbara Signori

Swiss National Library

The archiving and access system for the Swiss National Library’s digital collections was migrated to a new infrastructure in 2018-2019. This gave the National Library the opportunity to fundamentally redesign access to its web archive. The current application was realized with the assistance of two external partners and went live in June 2019.

The access system e-Helvetica Access (https://www.e-helvetica.nb.admin.ch) now offers a unified front end for integrated search across all digital collections in the library. Bibliographic metadata can be searched for material in the web archive and all other publications as well as full text. The index was created with the warc indexer for the web archives and enriched with bibliographic metadata. The new interface offers both pywb and OpenWayback for the consultation of the web snapshots. As a special feature, screenshots of the start pages are generated for the archived web pages and displayed in the hit list along with hits in context.

In order generate the screenshots, each archived website is opened via Headless Browser in pywb and a picture of the start page is created. This is then scaled to the required image sizes, delivered and cached via an IIIF server. Using this technique, the National Library continues an image generation approach to web pages, which it has been using for some time for quality control in harvesting. Technically, the integration of all necessary components is achieved through a fully container-based service infrastructure. On start-up, approximately 40,000 images were generated on our infrastructure in 36 hours. Images are created on demand for newly archived snapshots. Currently, the screenshots are subject to the same access restrictions as the archived websites themselves. With the revised Swiss Data Protection Act, however, it will be possible to show the pictures without restrictions.

Using automatically generated images, Web Archive Switzerland can be visually appealing to a wider audience. Further, a range of applications for automatically generated images of web pages are possible. On the occasion of Museum Night 2020, the National Library will be providing a clickable and zoomable collage from the screenshots on two large touchpads for all visitors. Screenshots could be provided via the IIIF manifest for research purposes. However, auto-generated images also enable meaningful visualizations for video, Flash, and other formats.

Ethical Approaches to Researching Youth Cultures in Historical Web Archives

Katie Mackinnon

University of Toronto

Over the past 25 years the web has become an “unprecedentedly rich primary source…it is where we socialise, learn, campaign and shop. All human life, as it were, is vigorously there” (Winters, 2017). Web archives pose new challenges for historians who must learn how to “navigate this sea of digital material” (Milligan, 2012) - and as an increasingly important resource for writing social, cultural, political, economic, and legal histories, historians must also consider how materials are created by young people. Contemporary studies on digital media have paid particular attention to youth participation, cultures and communities (Turkle, 1995; Kearney, 2006; Scheidt, 2006; Ito et al., 2010; boyd, 2014; Vickery, 2017; Watkins et al., 2018), which often depends on the preservation of web content. The early web communities of GeoCities that are available on the Internet Archive, like neighbourhoods WestHollywood (LGBTQ+) or EnchantedForest (Youth), are a unique and incredibly fruitful resource for studying youth participation in the early web (Milligan, 2017) in a way that gives youth voices autonomy and agency.

New challenges emerge when applying computational methodologies and tools to youth cultures in historical web archives at scale, not only due to the vulnerability of the research subjects but also the tools used to extract their data. The EU’s “Right to be Forgotten” (2014) and GDPR (2018) call into question the regularity with which young people become “data subjects” through their proximity to social networking sites, either through family, friends or themselves. Young people’s data is subject to commodification, surveillance, and archiving without consent. Researchers engaging with archived web material have a responsibility to develop better practices of care. This paper surveys ethical approaches to internet and web archive research (Aoir IRE 3.0, 2019; Lomborg, 2018; Schäfer & Van Es, 2017; Whiteman, 2012; Weltevrede, 2016), identifies gaps in studying historical web youth cultures and suggests next steps. It specifically considers the challenges in researching and writing about the phenomenon of young people divulging personal details about their lives without the possibility of informed consent and further develops frameworks to ethically research young people’s archived web that accounts for the sensitive nature of web materials (Adair, 2018; Eichhorn, 2019) and the ways in which computational methods and big data research often fails to anonymize data (Brügger & Milligan, 2018). Web history research puts living human subjects at the forefront of historical research, which is something that historians are not particularly well-versed in, but need to establish.

Works Cited:
Adair, Cassius. (2019). “Delete Yr Account: Speculations on Trans Digital Lives and the Anti-Archival.” Digital Research Ethics Collaboratory. http://www.drecollab.org/

Brugger, Niels and Ian Milligan. (2018). The SAGE Handbook of Web History. London: Sage.

Bruckman, Amy, Kurt Luther, and Casey Fiesler. 2015. “When Should We Use Real Names in Published Accounts of Internet Research?,” in Eszter Hargittai and Christian Sandvig (eds) Digital research confidential: the secrets of studying behavior online. Cambridge, Mass: MIT Press.

DiMaggio, P., E. Hargittai, C. Celeste and S. Shafer. (2004). “Digital inequality: From unequal access to differentiated use.” In Social Inequality, ed. K. Neckerman. Russel Sage Foundation.

Eichhorn, Kate. 2019. The end of forgetting: growing up with social media. Cambridge, Mass: Harvard University Press.

franzke, a.s., Bechmann, A., Zimmer, M. & Ess, C.M. (2019.) Internet Research: Ethical Guidelines 3.0, Association of Internet Researchers, www.aoir.org/ethics.

Ito et al. (2010). Hanging Out, Messing Around, and Geeking Out: Kids Living and Learning with New Media. MIT Press.

Jenkins, H., M. Ito, and d. boyd. (2016). Participatory Culture in a Networked Era: A Conversation on Youth, Learning, Commerce, and Politics. Polity.

Kearney, M. C. (2006). Girls Make Media. Routledge.

Kearney, M. C. (2007). “Productive spaces girls’ bedrooms as sites of cultural production spaces.” Journal of Children and Media, 1, 126-141.
Lincoln, S. (2013). “I’ve Stamped My Personality All Over It”: The Meaning of Objects in Teenage Bedroom Space.” Space and Culture, 17(3), 266–279.

Lomborg, Stine. (2018). “Ethical Considerations for Web Archives and Web History Research,” in SAGE Handbook of Web History, eds. Niels Brügger and Ian Milligan.

Milligan, Ian. (2017). “Pages by Kids, For Kids”: Unlocking Childhood and Youth History through Web Archived Big Data,” in The Web as History, eds. Niels Brügger and Ralph Schroeder, UCL Press.

Schäfer, Mirko Tobias, and Karin Van Es. (2017). The datafied society: studying culture through data. Amsterdam University Press.

Scheidt, L. A. (2006.) “Adolescent diary weblogs and the unseen audience.” In Digital Generations: Children, Young People, and New Media, ed. D. Buckingham and R. Willet. Erlbaum.
Skelton T. and Valentine G. (1998). Cool Places: Geographies of Youth Cultures. Routledge.

Turkle, Sherry. (1995). Life on the Screen: Identity in the Age of the Internet, Simon and Schuster.

van Dijck, José, Thomas Poell, and Martijn de Waal. (2018). The Platform Society; Public Values in a Connective World. New York: Oxford University Press.

Vickery, J. R. (2017). Worried about the wrong things: Youth, risk, and opportunity in the digital world. Cambridge, MA: MIT Press.

Watkins, S. C. et. al. (2018). The Digital Edge: How Black and Latino Youth Navigate Digital Inequality. NYU Press.

Weltevrede. Esther. (2016.) Repurposing digital methods. The research affordances of platforms and engines. PhD Dissertation, University of Amsterdam

Whiteman, Natasha. (2012). “Ethical Stances in (Internet) Research.” In Undoing Ethics, by Natasha Whiteman, 1–23. Boston, MA: Springer US, 2012.

Winters, Jane. (2017) “Breaking in to the mainstream: demonstrating the value of internet (and web) histories,” Internet Histories, 1:1-2, 173-179, DOI: 10.1080/24701475.2017.1305713


Accessible Web Archives: Rethinking and Designing Usable Infrastructure for Sustainable Research Platforms

Samantha Fritz

University of Waterloo

We are a society that consumes and curates information at a pace unimaginable prior to the development of the World Wide Web in 1991. For almost three decades, we’ve witnessed an exponential growth in digital content as well as our reliance on it, which in turn has drastically changed the way we preserve, disseminate and study cultural information.

Since 1996, memory institutions have increasingly assumed roles aimed at safeguarding born-digital cultural material. Web archiving and the collections they yield are essential to scholarly inquiry for research broaching topics from the mid-1990s to present. Yet despite the tremendous opportunity that these collections pose, access remains a significant barrier for their use, primarily due to their scale.

The primary goal of the Archives Unleashed Project is to make historical internet content accessible to scholars and others interested in researching the recent past. This has been achieved through the creation of several tools that assist librarians, archivists, researchers, and scholars to discover, explore, and analyze web archives. Accessibility is a main pillar of the project which bridges the gap between researchers, social scientists, librarians, and others with web archive collections.

Using the Archives Unleashed Project as a case study, themes of access and usability, within the context of open-source research projects, will be explored; specifically how the project team has made a conscious effort to incorporate the spirit of accessibility in all aspects of the project. Areas of discussion will include: access and usability in systems development, front-end design, learning resources, and sustainability planning.

This presentation will illustrate ways in which interdisciplinary teams can thoughtfully integrate concepts of access and usability throughout project development cycles, and how those concepts provide a base for designing usable infrastructure in a web archiving ecosystem.

More so than ever before, accessibility is at the forefront of projects, systems, and processes – and for good reason. The mentality of “if you build it, they will come” has been replaced with, a more important question of “if you design it, will it be usable?” Value is derived from usability, but usability only exists if it is accessible.

Drawing attention with special collections

Ben Els

National Library of Luxembourg

Most national web archives face the same paradox of preserving information, which is publicly available on the Internet, without being able to offer access to the archived material online. Raising awareness about the benefits and necessity of Internet preservation becomes a challenging task, since we have to find ways to promote the web archive, without showing its actual content.

Our platform webarchive.lu offers general information about the concept of web archiving and our activities at the National Library of Luxembourg, as well as the possibility to submit websites for future crawls and contribute suggestions to special collections. The latter category represents the core piece of our website, where we present the results of different event crawls and display our ongoing thematic collections. Alongside information about the context of each collection (the objectives and coverage achieved by the crawls), users can download the seed list and learn more about our approach to each subject, new experiences and difficulties that we have encountered.

We would like to highlight our election crawls from 2017 and 2018, focused on harvesting national media coverage. By tagging individual news articles in different categories, we are able to sort thousands of pieces by publisher, political party, publication date, towns and candidates. This analysis based on the Kibana tool, is interactive and will be available online. It allows for countless variations of queries and different angles of looking at the online media coverage of our last elections.

In this paper we illustrate our efforts to raise awareness about our web archiving initiative, by spotlighting the special collections on webarchive.lu, with an easily accessible exploration of the captured information and a descriptive presentation about topics with universal appeal.

Policies and processes for ingesting WebRecorder WARCS into the UK Web Archive

Nicola Bingham

British Library

The UK Web Archive has recently attempted to archive complex web content such as video, audio and other types of multimedia, using Webrecorder. This presentation will discuss the policies and procedures for ingest and access that have been developed in order for the British Library to accept WARCs crawled outside the normal workflow into the UK Web Archive collection.

The British Library’s ACT (Annotation Curation Tool), is a system which interfaces with the Internet Archive’s Heritrix crawl engine to provide large scale captures of the UK Web. It was designed to cope with archiving at scale following implementation of the UK Non-print Legal Deposit Regulations in 2013. While the ACT copes well with archiving the web at scale, it does not provide an appropriate level of high fidelity capture necessary to archive more complex websites.

To this end, the UK Web Archive Team has been experimenting with WebRecorder in it's different iterations, (the Webrecorder Desktop app https://github.com/webrecorder/webrecorder-desktop; the integrated hosting service, Conifer, from Rhizome https://conifer.rhizome.org/ (formerly https://webrecorder.io/), and the ArchiveWeb.page Chrome Browser Extension. https://chrome.google.com/webstore/detail/webrecorder-archivewebpag/fpeoodllldobpkbkabpblcfaogecpndd).

The presentation will look at two archiving case studies at the UK Web Archive. Firstly, a donation of WARCs created by a 3rd party outside of the Library's workflow and donated for preservation through a voluntary agreement. Secondly, archiving the manonabeach® oral history project featuring over 1,300 filmed answers to the question “What does the beach mean to you...?”

While the archiving capabilities of WebRecorder are well documented, the process and policies of ingesting externally crawled WARCS into a nationl collection is something that had to be addressed as when we started the project, we did not have a standard workflow of ingesting this content into the collection. Of particular interest is how we can use external tools such as Webrecorder in a way which is compliant with non-Print Legal Deposit Regulations which are quite proscriptive about the web harvesting technologies used.

The presentation will discuss:

-Two case studies in which WebRecorder was used to acquire web content in a particular project context.
-The workflow for storing and indexing the WARCs for exposure in the Archive.
-Policies, challenges and limitations in being compliant with UK Non-print Legal Deposit Regulations.
-Creating and exposing metadata associated with the WARCS.
-Selection policies, resource capabilities and training.

Perpetual Access to Open Scholarship through Web Archiving

Jefferson Bailey

Internet Archive

Since 2018, the Internet Archive has pursued a large-scale project to build as complete a collection as possible of scholarly outputs published on the web, as well as improve the discoverability and accessibility of scholarly works archived as part of these and other global web harvests. This project involves a number of areas of work:

-- targeted archiving of known OA publications (especially at-risk “long tail” publications)
-- extraction and augmentation of metadata and full text, automated systems for continual identification and harvesting of web-published OA materials
-- integration and preservation of related identifier, registry, and aggregation services and datastores
-- partnerships with affiliated initiatives and joint service developments
-- creation of new tools and machine learning approaches for identifying scholarly work in pre-existing global scale web collections.

The project also identifies and archives associated research outputs such as blogs, datasets, code repos, and other secondary research objects from the web. The current beta API and public interface, codenamed fatcat, can be found at https://fatcat.wiki/. This talk will discuss the project’s current status and upcoming work focusing on content acquisition, indexing, discoverability, the role of machine learning, related code and software releases, service provisioning, and the project’s collaborations with libraries, publishers, and other partners. The project is also working with a number of national libraries to test the project’s machine learning driven tools at identifying scholarly outputs in national ccTLD domain crawls. Conceptually, the project demonstrates that the scalability and technologies of “archiving the web” can facilitate automated ingest, enrichment, and dissemination strategies for a variety of web-published primary and secondary scholarly record types that have traditionally been collected via more custom and manual workflows. The project’s strategic goal is to leverage the scale, automation, and open infrastructure of web archiving approaches to ensure perpetual discoverability and access to archived scholarship.


Supporting Research Use of Web Archives: A ‘Labs’ Approach

Julie M. Birkholz1,2, Marie Carlin3, Sally Chambers1,2, Katrine Hoffman Gasser4, Olga Holownia5, Andrew Jackson6, Anders Klindt Myrvoll4, Tim Sherratt7 & Vladimir Tybin2

1KBR (Royal Library of Belgium), 2GhentCDH, Ghent University, 3BnF, 4Royal Danish Library, 5IIPC, 6British Library, 7GlamWorkbench 

The use of the archived web as an object of research remains at the fringes of (digital) humanities research (Winters, 2017). While a number of surveys and studies have identified common challenges and researchers’ requirements (e.g. Vlassenroot et al., 2019; Costea, 2018; Riley & Crookston, 2015; Stirling, Chevallier, & Illien, 2012), the conclusion saying that ”there is still a gap between the potential community of researchers who have good reason to engage with creating, using, analysing and sharing web archives, and the actual (generally still small) community of researchers currently doing so” (Dougherty et al., 2010) largely holds true. Furthermore, as a result of legal restrictions, many web archives still remain solely accessible through dedicated computers inside (national) libraries. Additionally, archived web resources are large and complex datasets that require a relatively advanced level of digital literacy, not always at the fingertips of all humanities researchers.

In the introduction to our panel, we will consider whether the concept of ‘library labs’, as pioneered by organisations such as the British Library and exemplified through the international Galleries, Libraries, Archives and Museums (GLAM) Labs Network (Chambers et al., 2019), could be a) ideal incubators for increasing access to archived-web resources, such as within national library buildings themselves, b) an environment for the inclusion of web-archives as one of the many available resources alongside e.g. digitised newspapers, etc., and c) a sandbox for testing and creating tools and applications to explore and analyse both digitised and born-digital content. Our introduction will also include an overview of the current GLAM Labs landscape, focusing on how different institutions have experimented with offering datasets from their web archives as part of labs or research services.

Finally, the panel will present four case studies which will be used as inspiration for a discussion with the audience as to how ‘Labs’ initiatives can stimulate the uptake and use of web archives for research. Our case studies and the discussion will focus on four main topics: 1) key services the labs offer to researchers, 2) type of web archiving content is currently available (within and outside of the lab) and how it can be accessed, 3) future plans for offering web archiving datasets and tools, and, 4) researcher engagement, e.g. through projects, PhD/MA placements, resident researchers and artists, etc.

Andy Jackson, British Library

Established in 2013, as one of the earliest examples of a Library Lab in Europe, British Library Labs (BL Labs), promotes, inspires and supports the use of the BL’s digital collections and data in innovative ways, through competitions, events and collaborative projects. The BL has been running a successful Labs service for many years now, but largely focussed on non-web material, and with limited computational resources. Last year, the UK Web Archive started a deeper collaboration, where Labs users and library staff can share our computational platforms and data analysis tools. This is helping to further establish this class of service as a part of the wider library, and provides an opportunity to widen the audience of the web archives too.

Anders Klindt Myrvoll (Netarkivet - the Danish web archive), Katrine Hoffman Gasser (KB Tech Lab), Royal Danish Library

Since 2005, the Royal Danish Library has collected material from the Danish part of the Internet and preserved it in our web archive. In 2011 URL-search was added and in 2015 full text search for researchers became a reality.

In 2021 SolrWayback, a powerful, open source, discovery and playback platform for exploring web archives, that started its life as a lab-like experiment at innovation week, was put in production for our web archive. Two KB Tech Lab projects about Link graphs, N-gram visualization, and also work done on geographical search, have found its way into SolrWayback, and will be of great value to our users, but also raises legal concerns.

Researchers can use SolrWayback for feasibility studies, combined with talks with us, to specify data they would like to extradite from the web archive. We have worked with data dumps deliveries since 2018 and it's a great way to improve research based on web archive data.

KB Tech Lab seeks innovative ways to combine the library’s digital cultural heritage collections and research, with the latest methods within machine learning, data visualization and beyond. This includes different applications made by the Royal Danish Library to visualize, engage or display the different available materials, to inspire and deepen the knowledge of what collections we have, and expand their use.

The Royal Danish Library has three physical labs at the Campus at the University of Copenhagen and two are planned at the Aarhus University Campus. Both KB Tech Lab and the web archive are collaborating with these labs to raise awareness of the web archive and possibilities for users.

Julie M. Birkholz and Sally Chambers, KBR, Royal Library of Belgium and Ghent Centre for Digital Humanities, Belgium

The KBR Digital Research Lab is a unique long-term cooperation between KBR, Royal Library of Belgium and the Ghent Centre for Digital Humanities, Ghent University to facilitate text and data mining research on KBR’s diverse, multilingual digitised and born-digital collections. This includes supporting the digital access to textual sources and stimulating the (re)use and research of these digital sources, data and metadata of these collections. Provision of data-level access to born-digital, as well as digitised collections, is facilitated through the DATA-KBR-BE project, inspired by the Collections as Data movement. For born-digital material, KBR will build on the work undertaken in the PROMISE project on piloting access to the Belgian web archive for scientific research, as well as the BE-Social project, which aims to develop a sustainable strategy for archiving and preserving social media in Belgium. Methods for providing ongoing research access to the Belgian web and social media archives will be explored.

Marie Carlin and Vladimir Tybin, Bibliothèque nationale de France (BnF), France

The BnF DataLab, opening in autumn 2021, is the Lab of the National Library of France (BnF). It aims to give access to digital corpora of digitized and born-digital collections and derivative data sets by hosting and supporting researchers and research teams from many academic fields and various levels of technical expertise. Through the use of the BnF’s digital collections and data sets, events, training sessions and collaborative projects involving BnF’s staff and researchers, the BnF DataLab will become both a physical location and a technical infrastructure to improve collection analysis, explore use cases, and deepen the knowledge of collections.

The upcoming opening of the BnF DataLab provides an opportunity to consolidate, enhance, and value various data tools and services aimed at helping researchers explore and analyse the BnF web archive collections. Those services, such as providing metadata or derived data using various formats, offering data mining or reporting tools, have been developed in recent years on an experimental basis to meet the needs of former and current research projects working on the BnF web archive collections.

Introduction: Olga Holownia, IIPC


Chambers, S., Mahey, M., Gasser, K., Dobreva-McPherson, M., Kokegei, K., Potter, A, Ferriter, M. and Osman, R. (2019). Growing an international Cultural Heritage Labs community. Retrieved from http://doi.org/10.5281/zenodo.3271382

Costea, M.-D. (2018). Report on the Scholarly Use of Web Archives. Retrieved from http://netlab.dk/wp-content/uploads/2018/02/Costea_Report_on_the_Scholarly_Use_of_Web_Archives.pdf

Dougherty, M., Meyer, E. T., McCarthy Madsen, C., van den Heuvel, C., Thomas, A., & Wyatt, S. (2010). Researcher Engagement with Web Archives: State of the Art. Retrieved from https://ssrn.com/abstract=1714997

Riley, H., & Crookston, M. (2015). Awareness and Use of the New Zealand Web Archive: A Survey of New Zealand Academics. Retrieved from https://natlib.govt.nz/files/webarchive/nzwebarchive-awarenessanduse.pdf

Stirling, P., Chevallier, P., & Illien, G. (2012). Web Archives for Researchers: Representations, Expectations and Potential Uses. D-Lib Magazine, 18(3/4). doi:10.1045/march2012-stirling

Vlassenroot, E., Chambers, S., Di Pretoro, E., Geeraert, F., Haesendonck, G., Michel, A., & Mechant, P. (2019). Web archives as a data resource for digital scholars. International Journal of Digital Humanities, 1(1), 85-111. doi:10.1007/s42803-019-00007-7

Winters, J. (2017). Coda: Web archives for humanities research: some reflections. In N. Brügger & R. Schroeder (Eds.), The Web as History: Using Web Archives to Understand the Past and Present (pp. 238-248). UCL Press: London.


Run your own full stack SolrWayback

Thomas Egense, Toke Eskildsen

The Royal Danish Library

Monday, 7 June, 16:00-17:30 CEST (time zones)


This workshop will

1) Explain the ecosystem for SolrWayback 4 (https://github.com/netarchivesuite/solrwayback)

2) Perform a walkthrough of installing and running the SolrWayback bundle. Participants are invited to mirror the process on their own computer and there will be time for solving installation problems

3) Leave participants with a fully working stack for index, discovery and playback of WARC files

4) End with open discussion of SolrWayback configuration and features.


Target audience:

Researchers with medium knowledge of web archiving and tools for exploring web archives. Basic technical knowledge of starting a program from the command line is required; the SolrWayback bundle is designed for easy deployment.


SolrWayback 4 (https://github.com/netarchivesuite/solrwayback) is a major rewrite with a strong focus on improving usability. It provides real time full text search, discovery, statistics extraction & visualisation, data export and playback of webarchive material. SolrWayback uses Solr (https://solr.apache.org/) as the underlying search engine. The index is populated using Web Archive Discovery (https://github.com/ukwa/webarchive-discovery). The full stack is open source and freely available. A live demo is available at http://webadmin.oszk.hu/solrwayback/

During the conference there will be focused support for SolrWayback in a dedicated Slack channel by Thomas Egense and Toke Eskildsen.

Web archive discovery systems and scaling

Thomas Egense, Toke Eskildsen

The Royal Danish Library

Tuesday, 8 June, 16:00-17:30 CEST (time zones)


This workshop will

1) Present challenges for building and maintaining a web archive scale discovery system

2) Explain concrete strategies for running Solr at different scales (same strategies should work for Elacticsearch)

3) Provide a forum for sharing experiences and problems with the scale of web archives. Bring your own challenges and we will solve them together!


  • An interest in the scaling of web archive discovery systems


The Royal Danish Library has been providing full text search and discovery for the Danish Netarchive for several years, lately using SolrWayback. The archive contains 33 billion records, which are all indexed and available online. Solr is used as the underlying search engine and scaling has been both a design criteria and an ongoing challenge.

Indexing (using Web Archive Discovery (https://github.com/ukwa/webarchive-discovery)) and searching (using Solr (https://solr.apache.org/)) each have their own issues which can easily compound to larger problems: Setups that works well at a certain size is no guarantee for a working system at 10× that size.


Ten years of archiving the Olympic movement

Helena Byrne

British Library

It has been ten years since the International Internet Preservation Consortium (IIPC) first started to develop collaborative collections. In summer 2020 the IIPC Collaborative Development Group (CDG) will be collecting nominations for the Tokyo 2020 Summer Olympic and Paralympic Games collection. The success of any collection relies on the IIPC members from around the world contributing content. Since Rio 2016 the CDG has also started to take in nominations from non-members through an online web form. Details on how to get involved will be coming out soon.

The first collaborative collection was on the Winter Olympic Games held in Vancouver, Canada in February 2010. This first collection is very small when you compare it to recent collections, a total of 308 URLs were nominated during this collection period. In contrast there were over 1,000 URLs nominated for the Winter Olympic Games held in PyeongChang, South Korea. The structure of collaborative collections within the IIPC has changed over time. The early Olympic/Paralympic collections from 2010 and 2012 were overseen by the IIPC Access Working Group. It wasn’t until late 2014 that the IIPC Content Development Group (CDG) was formed. Since 2015 The CDG has subscribed to Archive-IT to do its crawling and make all the collections publically available:

This poster will reflect on what has been previously collected in the IIPC Olympic/Paralympic collections, possible research use for this collections as well as outlining what researchers can do with these collections using the Archives Unleashed Cloud.

Starting Small but Dreaming Big - A beginner's journey to capture a local event through web archiving activities

Yoo Young Lee, Marina Bokovay

University of Ottawa Library

Nobody can deny the importance of web archiving to preserve history, yet few academic libraries or small institutions have initiated web archiving projects. Due to the intensity and volume of web data, national libraries or large and well-funded institutions have typically been tasked with capturing web content on a broad scale. What happens, however, with content focused on hyper-local events which could easily be disregarded by an automatic and systematic harvesting approach? This presentation will share strategies to select, appraise, curate, and preserve social media (e.g., Twitter and Reddit), online news articles, and websites around local live events without a dedicated team for web archiving projects.

This presentation will describe a small scale web archiving initiative in collaboration between one librarian and one archivist to preserve a local incident that happened at the University of Ottawa in the summer of 2019. In June, a student was arrested on campus based on what was found to be racial profiling. The event started a broader discussion about how the University addresses racism and what it means to be “Black on Campus.” The event itself and the follow-up discussion, was “broadcast” over Twitter, which proved to be an important communication channel for the community. In thinking about preserving the conversation and the resulting actions, it was critical to record and document the tweets before they disappeared. In this presentation, we will discuss our intention to preserve such an incident, various open source tools applied to archive different digital mediums (Twitter, web pages, etc.), and methods for selection and appraisal processes. In addition, we will talk about ethical concerns and how to expand a collection’s scope to handle an unfolding live incident. We will also discuss the challenges posed by having multilingual content (English and French) and multimedia (video) and how we overcame them. Without dedicated staff and resources, it is challenging to carry out web archiving even at a small scale.

We will provide the perspective of beginners who have just started web archiving initiatives and are building a web archive collection at their institution. This includes our lessons learned, methods of selecting tools based on content types, success and failures.

Using Browser Developer Tools in QA: A Novice’s Toolkit

Rebekah Glendinning

University of Toronto

Quality assurance continues to be both essential to successful web archiving and often left out of conversations of innovation. Although progress has been made in automating select aspects of the QA process, manual QA is still commonly used to assess the appearance and navigability of a capture.

As a new practitioner with limited experience in the background mechanics of web pages, I find that QA work can be daunting when dealing with errors beyond missing URLs and scoping adjustments. After prompting by my supervisor and reading a blog post by Jillian Lohndorf in the Archive-It community space, I began to investigate and use browser developer tools in my QA work regularly. Browser developer tools offer an accessible entry point for non-developers to locate and understand the errors they are encountering in their captures and are under-utilized by practitioners.

My poster presentation will map common rendering and navigation errors that can occur in captures, and how to easily find evidence of them by using the developer tools feature built into most browsers. The poster will direct practitioners where to look within the developer tools console for each error, and where to go from there. Overall, the poster will offer a toolkit that practitioners can utilize in their QA work going forward.

Off-Topic Memento Toolkit to Identify Topical Outliers in Web Archive Collections

Shawn Jones1, Martin Klein1, Michael Nelson2, Michele Weigle2

1 Los Alamos National Laboratory, 2 Old Dominion University

Topical web archive collections often contain multiple copies of individual web pages created by web crawlers at different times. These archived versions of web pages (Mementos) allow researchers to study the evolving nature of news events or changes in an organization's web presence. Unfortunately, web pages go off-topic for a variety of reasons: technical problems, hackers defacing pages, new ownership, or new content replacing the content that was initially on-topic. Conventional web crawlers commonly do not have the ability to automatically detect such pages and therefore the off-topic content is crawled and ingested into the collection.

Since researchers and archivists are most often interested in the on-topic content of these collections, identifying the off-topic Mementos is a crucial first step before further analysis. For that reason, we created the Off-Topic Memento Toolkit (OTMT), which identifies (but does not delete) potentially off-topic Mementos. We assume the first Memento to be on-topic, thus the OTMT compares the first Memento of a URL (earliest version crawled) against each subsequent Memento of the same URL. This comparison is carried out using similarity measures selected by the OTMT user. We provide many different measures so that users can customize the results to fit the needs of their analysis and the traits of their collection. The similarity measures OTMT currently supports are: byte count, word count, Jaccard distance, Sörensen-Dice distance, Simhash of document content, Simhash of term frequencies, cosine similarity informed by TF-IDF, and cosine similarity informed by LSI topic modeling. We assessed the OTMT against a gold standard data set and found that word count works best to identify off-topic Mementos. Our poster will highlight the motivation for the toolkit and summarize the results of our analysis of the similarity measures

Collaborate to Capture: New Approaches to Social Media Archiving at Library and Archives Canada

Angela Beking, Russell White

Library and Archives Canada / Bibliothèque et Archives Canada

This poster will delve into the three methods by which Library and Archives Canada (LAC) acquires social media data. Since 2015, LAC has been collecting Twitter hashtag data via the Twitter API on subjects of national importance, such as federal elections and Canada’s efforts at the Olympics. LAC has also begun web crawling social media accounts for federal political parties and selected Members of Parliament. In addition to these methods, LAC has developed a transfer methodology to diversify its social media collections through direct collaboration with producers. Through proactive engagement and targeted advice, LAC has begun to acquire social media data exported from individual accounts in addition to its traditional acquisition approaches. Each of the three approaches offers benefits and drawbacks, which will be explored through an analysis of technical considerations and resource cost (both infrastructure and human) of each approach. It will be suggested that by using all three methodologies simultaneously, LAC has been able to enhance its archive of social media through a diversity of content that would be impossible to achieve through traditional methods alone. New content types, such as Facebook Insights data, can only be acquired through collaboration with producers, and such content has tremendous research potential. This poster will be of interest to anyone looking to enhance their social media archives through diverse methods of acquisition.

The WebMedia browsing solution

Haykel Boukadida, Jerome Thievre and Thomas Drugeon

Institut National de l'Audiovisuel (INA)

As part of its legal deposit mission, the National Audiovisual Institute (INA) has crawled, preserved and made available to researchers almost 100 billions web contents from the French audiovisual web, representing 8 PB of data over the last 10 years. These data comprise websites, social network publications, as well as videos from online platforms. The mandatory authenticity and integrity warranty implied by a legal deposit has important implications on the access and browsing solution requirements.

There are two known technical ways of browsing an archived web collection: a server approach relying on a link rewriting effort, and a proxy approach were links are left untouched and the network serves as a redirection layer. Like many other web archives, we decided on the proxy approach to optimize browsing quality and avoid online leaks. In this context, we have long been relying on a generic firefox browser customized with a proprietary addon ensuring date selection and session identification. The past years have seen a rising prevalence of online HTTPS contents, while web browsers were in the meantime increasing security constraints and deprecating legacy features and compatibility.

This situation made maintaining such a generic browser-based archive solution more and more difficult, ultimately cornering us into developing WebMedia, our own archive browsing solution.

Our WebMedia solution positions itself in front of our archive server and consists of a MitM (Man in the Middle) proxy for HTTPS handling and a completely custom electron-based desktop web browser.

Electron is a framework, developed by Github, which allows creating desktop graphical user interfaces. It is compatible with Mac, Windows and Linux. Its internal architecture is based on NodeJs (a framework written in Javascript, generally used to develop server side applications) and Chromium (an open source web rendering engine, developed by Google). Electron is used in the development of several prevalent applications, such as GitHub Desktop, Slack, Visual Studio Code, Wordpress, WhatsApp, and Twitch desktop.

The development of our own web browser gives us a total control over the features and options we want to offer, such as plugins and security configuration (flash, HTTPS certificates, legacy HTML practices, etc.), user authentication and identification, contextual information (embedded video and tweet detection, date and version handling, etc.), and links to our archive applications.

We feel that WebMedia, albeit admittedly requiring a significant investment in development and maintenance, is a satisfactory solution giving us control over the evolution of browsing conventions in the future.

Website Archiving at the Legislative Assembly of Ontario: overview, challenges, partnerships

Sandra Craig

Legislative Assembly of Ontario

Building Web Archives

This poster session / lightning talk will provide an overview of website archiving activity at the Legislative Assembly of Ontario and demonstrate how developing partnerships for building web archives contributes to successful outcomes especially for large projects such as the Ontario provincial election in 2018.

• Election campaign material is an important and unique part of our collection, and we have print material dating back to the late 1800’s
• We routinely collect campaign literature during elections, usually acquired through donations from staff which has resulted in an incomplete collection and only partial coverage of electoral districts
• Began archiving political party websites and candidates websites for general elections and by-elections in 2007 to provide more comprehensive coverage of elections and better access to them

Software used
• Primarily used Adobe Acrobat to capture the sites but we also used
HTTrack, MetaProducts Offline Explorer Pro and WebCan, which was developed by Library and Archives Canada but is no longer maintained

• Access to the captured websites is available from our online catalogue
• MARC records created for each website with a link to the captured content

• With more parties and candidates having a web presence it became too difficult for our small team to continue to capture all the content effectively
• Partnered with University of Toronto Libraries (UTL) who were also capturing this content
• UTL have the technical expertise in web archiving and uses Archive-It software
• Legislative Assembly provided assistant with seed list development, metadata and support for quality assurance
• Result is a comprehensive collection of campaign websites for the Ontario Provincial Election 2018
• https://archive-it.org/collections/10004

Capturing social movements: Web archiving needs of activist collections in Yorkshire

Bethany Aylward

University of Sheffield

Protest culture in the twenty-first century has embraced the use of digital media to build movements, disseminate ideas, and coordinate action. Although digital activism has by no means replaced traditional modes of protest, or their documentation, there are fears that without sustainable approaches to web archiving, activist-archives will lose valuable pieces of their movements’ narratives.

My research draws on essays by Fair, Ziegler, Moran and Teetaert collected in Melissa Morrone’s ‘Informed Agitation: Library and Information Skills in Social Movements and Beyond’ (2014) on the radical archive and library phenomenon, which – in the anglophone world – is clustered in the US. This collection sheds light on the ways that activist groups are harnessing the power of the archive for social change; from reclaiming histories, to strengthening solidarities, to informing future activism. Their experiences inspired me to investigate the activist-archival landscape in Yorkshire and work with local groups to elevate and safeguard their narratives in our increasingly digital society.

I am working with three activist-archives: a feminist archive celebrating the lives of local women; and two anarchist libraries that are microcosms of the future they are fighting for. Together we are exploring the web archiving potential of their archives, as well as the barriers they face as people-powered organisations operating on a shoe-string budget in the midst of a global pandemic.