Sylvain Bélanger, Nick Ruest, Ian Milligan & Anna Perricci (via video conference): Digital preservation strategies, the Archives Unleashed Cloud Project and Webrecorder

Sylvain Bélanger, Library and Archives Canada
Nick Ruest, York University
Ian Milligan, University of Waterloo
Anna Perricci, Rhizome

Sustainability Panel: Preservation of Digital Collections, Webrecorder and Archives Unleased Cloud Project

One of the major issues facing the web archiving community is that while systems exist to acquire, analyse and preserve web archive content, they require a considerable level of resource to deploy, use and maintain. This panel will discuss the problems of long term sustainability in the web archiving ecosystem, focussing on issues such as capacity for sustainable digital preservation, technical infrastructure development, tools development and project resilience. The panel will consider that reaching sustainability requires an approach including organisational, financial and technical effort and will share examples of how this has been achieved within the panellists own organisations/projects.

Sylvain Bélanger: Preservation of Digital Collections, from obsolescence to sustainability

This presentation will delve into the issues Library and Archives Canada faces, as a national library and archives, in tackling obsolete formats, and in applying digital preservation principles, while living within our means. This presentation focuses on what LAC has been doing to address the petabytes of digital collections ingested in LAC annually, including the dozen of terabytes of web content annually, through the lenses of the development of a sustainable digital preservation program and technical infrastructure advancements.

Ever wonder what happens to digital collections once Library and Archives Canada (LAC) receives them from publishers, universities, archival donors and government institutions? With physical collections, they are stored in a vault, in a storage container, in specialized housing or simply on a shelf. With digital collections, it is not that straightforward, and in years past, it was tortuous.

Traditionally, over many hours of manual interaction, IT specialists in the Digital Preservation team, along with library and archival staff, would extract data bit by bit from carriers. Then they would face the daunting task of migrating data from archaic formats to modern, readable and accessible ones for client access and long-term preservation.

LAC developed what we called a Trusted Digital Repository in the late 2000s, which involved continued manual interaction with our collections but little in the way of automation or simplification.

In the early 2010s, the Digital Preservation unit was a fledgling team, barely visible and even less resourced. There were multiple internal and external pressures on LAC to increase its digital preservation capacity. In particular, an accelerating volume of digital materials needed to be preserved for the long term. The Auditor General of Canada issued a report in 2014 raising questions about the readiness of LAC to handle digital records as the format of choice by 2017. It stated that LAC “must articulate these plans in its vision, mission, and objectives. It must put in place strategies, policies, and procedures that will allow the transfer and preservation of digital information so that it is accessible to current and future generations.” The audit report noted: “An electronic archival system, such as a trusted digital repository, could help [LAC] acquire, preserve, and facilitate access to its digital collection.”

Although the overarching institutional goal for a trusted digital repository stayed constant throughout this decade, changing institutional priorities and the focus on technology and short-term projects stimulated a re-examination of what was needed to install digital preservation as a core and enduring business component.

The audit report was a call to action in dealing with our digital content, and it pushed LAC to attempt, for the umpteenth time, to tackle the problem head on. A team of stakeholders provided input and feedback into what would become a call-out to industry for a digital asset management solution that could support LAC’s requirements. Industry and partner consultations were held over many months and helped shape LAC’s request for proposals that finally went out in late summer 2017.

In summer 2018, LAC acquired digital asset management technology, along with associated technologies to allow us to implement a solution (for pre-ingest, ingest and preservation processes) for collections coming to LAC in digital format. This means no longer receiving hard drives and other technology carriers, but also a wholesale modernization of our digital work.

We have finally reached the starting point!

What this really means is that we are still in the early stages of implementing a viable solution. Teams from Digital Operations and Preservation, Published Heritage, and the Chief Information Officer branches have been working on the first series of collections to process from clients, through to preservation and future access. Using specialized managed file-transfer software for pre-ingesting the metadata and assets, to testing the preservation capabilities of Preservica, everything is being reviewed with the aim of transforming how we manage our digital operations. To ensure a seamless and effective testing approach, as we are testing published workflows, staff within Published Heritage dedicated to this work full-time are working hand in hand with preservation and IT specialists to implement seamless processes.

For LAC, the implementation of a digital asset management system means being at the forefront of digital acquisition and preservation. Many partners, both nationally and internationally, are keen to understand the approach we have taken over the past four years, and how we are integrating various technologies to implement our long-term digital vision for both published and archival collections.

Even more important is what a digital asset management system may provide to Canadians in the long term: digital collections that are preserved and accessible to them when and where they want them.

This is but one step in LAC’s digital transformation.

Nick Ruest & Ian Milligan: Project Sustainability and Research Platforms: The Archives Unleashed Cloud Project

The Archives Unleashed Project, founded in 2017 with funding from the Andrew W. Mellon Foundation, aims to make petabytes of historical internet content accessible to scholars and others interested in researching the recent past. We respond to one of the major issues facing web archiving research: that while tools exist to work with WARC files and to enable computational analysis, they require a considerable level of technical knowledge to deploy, use, and maintain.

Our project uses the Archives Unleashed Toolkit, an open-source platform for analyzing web archives (https://github.com/archivesunleashed/aut). Due to space constraints we do not discuss the Toolkit at length in this abstract. While the Toolkit can analyze ARC and WARC files at scale, it requires knowledge of the command line and a developer environment. We recognize that this level of technical expertise is beyond the level of the average humanities or social sciences researcher, and our approaches discussed in this paper concern themselves with making these underlying technical infrastructures accessible.

This presentation expands upon the Archives Unleashed Cloud, building upon previous presentations of earlier work at the IIPC meeting in Wellington. This is both to introduce it to researchers, but in this presentation we will focus on stimulating a conversation around where the work of the researcher begins and the work of the research platform ends. It also discusses the problem of long-term project sustainability. Researchers want services such as the Cloud, but how do we provide this service to them in a cost-effective manner? This targeted discussion will speak not only to our project, but broader issues within the web archiving ecosystem throughout the field.

As we develop the working version of the Archives Unleashed Cloud, one of the main concerns of the project team is the future of the Cloud after Mellon funding ends in 2020. While we are currently exploring whether the Cloud makes sense as a stand-alone non-profit corporation, we are still unsure about the future direction. How do services like this, that meet demonstrated needs, survive in the long run? Our presentation discusses our current strategies but hopes to engage the audience around the state-of-the-field and how to best reach web archiving practitioners.

Projects and services like WebRecorder.io and Archive-It have made amazing strides in the world of web archive crawling and capture. The Archives Unleashed Cloud seeks to make web archiving analysis similarly easy and straightforward. Yet the scale of web archival data makes this less straightforward.

Anna Perricci: No one said this would be easy: sustaining Webrecorder as a robust web archiving tool set for all.

Sustaining projects both organizationally and financially is hard especially in complex, fast moving areas like web archiving. This presentation will give an overview of steps the Webrecorder team has taken to achieve sustainability both organizationally and financially.

Webrecorder is a project of Rhizome, which is an affiliate of the New Museum in New York City. Rhizome champions born-digital art and culture through commissions, exhibitions, digital preservation, and software development. Webrecorder (webrecorder.io) is a free, easy to use, browser based web archiving tool set for building, maintaining and giving access to web archives. The development of Webrecorder has been generously supported by the Andrew W. Mellon Foundation since 2016, and the Knight Foundation (2016-2018). In addition to offering a free hosted web archiving platform Rhizome creates customizations of our Python Wayback (pywb) tool set for other web archives. Pywb is in use in some major web archiving programs including at the UK Web Archive (British Library), the Portuguese Web Archive (Arquivo.pt) and Perma.cc. The Webrecorder team also makes other open source software projects such as Webrecorder Player (https://github.com/webrecorder/webrecorder-player) and command line utilities such as warcit (https://github.com/webrecorder/warcit).

In 2017 strategic planning for Webrecorder began and further steps to build a business plan grew from that point. In this presentation an overview of the issues explored and conclusions reached so far will be given. These points will illuminate why Webrecorder has made certain choices and where we anticipate Webrecorder will go next.

It would be an honor to share the work we have done so far at the IIPC WAC 2019. Sharing our findings to date and explaining the decisions they helped us make might also be useful to others who need to figure out how to break down big problems into more manageable units. No one said reaching sustainability would be easy, and it has not been, but the Webrecorder team has made substantial progress so we would like to share what’s been learned with all conference attendees.

Panel: Planning for sustainability