The conference programme comprises over 55 papers, keynotes, plenary talks, panels, posters with accompanying lightning talks, workshops and tutorials. Sessions include: new crawling approaches, institutional program histories, archiving communities and dissent, national collaboration, thematic collecting initiatives, operational workflows, text mining and legal considerations. Please see the full schedule for details.
PAPERS |
- SUSANNA JOE: From e-publications librarian to web archivist: a librarian’s perspective on 10 years of web archiving at the National Library of New Zealand [SLIDES]
- ANDREW JACKSON: Continuous, incremental, scalable, higher-quality web crawls with Heritrix [SLIDES]
- MAXINE FISHER: Web archiving Australia’s Sunshine State: from vision to reality [SLIDES]
- HOWARD BESSER: Archiving websites containing streaming media: the Music Composer Project [SLIDES]
- TIIU DANIEL: Becoming a web archivist: my 10-years journey in the National Library of Estonia [SLIDES]
- MARTIN KLEIN, LYUDMILA BALAKIREVA & HERBERT VAN DE SOMPEL: Building event collections from crawling web archives [SLIDES]
- ED SUMMERS & BERGIS JULES: Social media and archival praxis [SLIDES]
- JEFFERSON BAILEY: Nation wide webs [SLIDES]
- MARIA PRAETZELLIS & MAKIBA J. FOSTER: Community Webs: empowering public libraries to create community history web archives [SLIDES – MF] & [SLIDES – JB]
- FERNANDO MELO & DANIEL GOMES: Cultivating open-access through innovative services [SLIDES]
- PETER JETNIKOFF: Curating dissent at the State Library of Victoria [SLIDES]
- JASON WEBBER: Creating a new user interface for the UK Web Archive [SLIDES]
- IAN MILLIGAN: Opening up WARCs: the Archives Unleashed Cloud and Toolkit projects [SLIDES]
- ABBIE GROTKE & MARK PHILLIPS: The End of Term Archive: collaboratively preserving the United States government web [SLIDES]
- CATHERINE NICOLE COLEMAN: A digital preservation paradigm shift for academic publishers and libraries [SLIDES]
- COREY DAVIS, NICH WORBY & JEREMY HEIL: Addressing our many solitudes: building a web archives community of practice in Canada [SLIDES]
- MICHAEL PARRY, MAX SULLIVAN & STUART YEATES: Utilising the Internet Archive while retiring legacy websites and establishing a digital preservation system
- ARNOUD GOOS: A National mini-IIPC: setting up collaboration in web archiving in The Netherlands [SLIDES]
- ANDREJ BIZÍK, PETER HAUSLEITNER & JANA MATÚŠKOVÁ: Digital resources – the national project of webharvesting and webarchiving in Slovakia [SLIDES]
- ILYA KREYMER: Pywb 2.0: Technical overview and Q&A, or everything you wanted to know about high-fidelity web archiving but were afraid to ask [SLIDES]
- NICOLA BINGHAM: Preserving the public record vs the ‘right to be forgotten’: policies for dealing with notice & takedown requests [SLIDES]
- KATHRYN STINE, STEPHEN ABRAMS & PETER BROADWELL: Cobweb: collaborative collection development for web archives [SLIDES]
- SAMANTHA ABRAMS: One year down: taking the Ivy Plus Libraries web resources collection program from pilot to permanent [SLIDES]
- SUZI SZABO: Web archiving guide for governmental agencies: how to ensure sustainable accessibility of Dutch governmental websites [SLIDES]
- CHINDARAT BERPAN, WACHIRAPORN KLUNGTHANABOON & SITTISAK RUNGCHAROENSUKSRI: Archiving web content in anthropology: lessons learned for a step in the right direction [SLIDES]
- MARTIN KLEIN, LYUDMILA BALAKIREVA, HARIHAR SHANKAR, JAMES POWELL & HERBERT VAN DE SOMPEL: Smart routing of Memento requests [SLIDES]
- KIT CONDILL: Peeling back the onion (domes) in the North Caucasus: multi-layered obstacles to effective web-based research on marginalized ethnic groups [SLIDES]
- SARA ELSHOBAKY & YOUSSEF ELDAKAR: A workflow for indexing and searching large-scale web archive content using limited resources [SLIDES]
- JOÃO GOMES: Arquivo.pt: taking a web archive to the next level [SLIDES]
- THIB GUICHERD-CALLIN: Sifting needles out of (well-formed) haystacks: using LOCKSS plugins for web archive metadata extraction [SLIDES]
- ABBIE GROTKE & GRACE THOMAS: Expansion and exploration in 2018: processing the Library of Congress Web Archive [SLIDES]
- JEFFERSON BAILEY: Your web archives are your everything archives [SLIDES]
- MARIA RYAN: Working on a Dream: The National Library of Ireland’s Web Archive [SLIDES]
- KEES TESZELSZKY: How to harvest born digital conspiracy theories: webarchiving Dutch digital culture in the Post-truth era [SLIDES]
- GRACE THOMAS & TREVOR OWENS: What can tiny, transparent GIFs from the 1990s teach us about the future of access and use of web archives? [SLIDES]
- EMMANUEL CARTIER, PETER STIRLING & SARA AUBRY: Néonaute: mining web archives for linguistic analysis [SLIDES]
- RUSSELL LATHAM: “Who by fire”: lifespans of websites from a web archive perspective [SLIDES]
- THOMAS EGENSE & ANDERS KLINDT MYRVOLL: Demo of the SolrWayback search interface, tools and playback engine for WARCs [SLIDES]
PANELS |
- YAN LONG, REGAN MURPHY KAO, NICHOLAS TAYLOR & ZHAOHUI XUE: Collaborative, selective, contemporary: lessons and outcomes from new web archiving forays focused on China and Japan [SLIDES – YL & ZX, SLIDES – NT; SLIDES – YL&RMK]
- JÉRÔME THIÈVRE, GÉRALDINE CAMILLE & GILLIAN LEE: Bottling the firehose: preserving Twitter content [SLIDES – GC/PS, SLIDES – JT & SLIDES – GL]
- JASMINE MULLIKEN, ANNA PERRICCI, SUMITRA DUNCAN & NICOLE COLEMAN: Capturing complex websites and publications with Webrecorder [SLIDES]
- AMY JOSEPH, NICOLA BINGHAM, PETER STIRLING, KRISTINN SIGURÐSSON & MARIA RYAN: Legal deposit in an era of transnational content and global tech titans
POSTERS WITH LIGHTNING TALKS |
- ALEXIS ANTRACOLI & SUMITRA DUNCAN: It’s there, but can you find it? Usability testing the Archive-It public interface [SLIDES & POSTER]
- MARK BODDINGTON: Legal deposit legislation for online publications: framing the issues [SLIDES]
- KATHRYN STINE, KRIS KASIANOVITZ, JULIE LEFEVREPOST & LUCIA ORLANDO: Crowdsourcing descriptive metadata for web archives: the CA.gov Archive [SLIDES]
- GRACE THOMAS, MARIA PRAETZELLIS, EDWARD MCCAIN & MATTHEW FARRELL: Tracking the evolution of web archiving activity in the United States [POSTER]
- ZHENXIN WU & XIE JING, iPRES 2020 Introduction and Cooperation & Chinese National Digital Preservation Programme for Scientific Literature [POSTER 1 & POSTER 2]
WORKSHOPS & TUTORIALS |
- ANDREW N. JACKSON, IAN MILLIGAN & OLGA HOLOWNIA: What can you do with WARCs? [IM – SLIDES]
- ANNA PERRICCI: Human scale web collecting for individuals and institutions (Webrecorder tutorial) [SLIDES]
- BEN O’BRIEN & HANNA KOPPELAAR: The Web Curator Tool relaunch [SLIDES]
- SARA AUBRY: The WARC file format: preparing next steps [SLIDES]
- KATHRYN STINE, STEPHEN ABRAMS & PETER BROADWELL: Using Cobweb to manage collaborative or complementary web archive collecting projects [SLIDES]
- JESSICA MORAN, MATARIKI WILLIAMS, BERGIS JULES, EDWARD SUMMERS, ALEXANDRA DOLAN-MESCAL AND FRANCIS KAYIWA: Ethical social media archiving through community collaboration [SLIDES]
PAPERS |
Susanna Joe
National Library of New Zealand
From e-publications librarian to web archivist: a librarian’s perspective on 10 years of web archiving at the National Library of New Zealand
The National Library of New Zealand has been building thematic and event-based collections of New Zealand websites in an active selective web archiving programme since 2007. This presentation will give an overview of how the Library’s efforts to collect and preserve the nation’s digital documentary memory have grown and developed from the point of view of a librarian taken on in 2005 as one of the Library’s inaugural ‘E-Publications Librarian’ roles which focused on selecting websites and quality reviewing web harvests. It was a time when legislative changes combined with internal technological milestones in web harvesting and digital preservation propelled the advancement of the Library’s web harvesting activities and the growth of the Library’s Web Archive collection. Today the Web Archive is a highly curated collection and web archiving staff still both select and manually quality review all harvests to ensure good quality, comprehensive captures of selected sites.
In recent years the job title has changed to ‘Web Archivist’, but how have we adapted to tackle the changes and complexities in collecting born-digital content in the modern age? What has – or has not – changed in the Library’s curatorial approach, technical developments, and staff to reflect new capabilities, and how successful have we been? There have been many technical, legal, access and financial challenges and barriers along the way and we are still grappling with many of these issues so we will look at what we have learnt and where we are now.
Andrew Jackson
The British Library
Continuous, incremental, scalable, higher-quality web crawls with Heritrix
Under Legal Deposit, our crawl capacity needs grew from a few hundred time-limited snapshot crawls to the continuous crawling of hundreds of sites every day, plus annual domain crawling. We have struggled to make this transition, as our Heritrix3 setup was cumbersome to work with when running large numbers of separate crawl jobs, and the way it managed the crawl process and crawl state made it difficult to gain insight into what was going on and harder still to augment the process with automated quality checks. To attempt to address this, we have combined three main tactics; we have moved to containerised deployment, reduced the amount of crawl state exclusively managed by Heritrix, and switched to a continuous crawl model where hundreds of sites can be crawled independently in a single crawl. These changes have significantly improved the quality and robustness of our crawl processes, while requiring minimal changes to Heritrix3 itself. We will present some results from this improved crawl engine, and explore some of the lessons learned along the way.
Maxine Fisher
State Library of Queensland
Web archiving Australia’s Sunshine State: from vision to reality
Australia’s ‘Sunshine State’ Queensland is known for its warm climate, golden beaches and World Heritage natural assets such as the Great Barrier Reef. However, it is also a state vulnerable to natural disasters, facing economic and infrastructure pressures, and a growing list of environmental concerns. The State Library of Queensland has long been committed to collecting and providing access to resources documenting Queensland’s history, society and culture, and reflecting the Queensland experience. Web archiving has become an increasingly vital aspect of our contemporary collecting. Through participation in Australia’s PANDORA archive, State Library has captured a unique array of Queensland stories and events, as played out on the web, and collaborates with other contributing agencies to capture important Australian web content for current and future generations. This presentation describes how web archiving began at the State Library of Queensland in 2002, and key factors that have enabled growth of our web collecting, with a focus on how State Library of Queensland has collaborated with other PANDORA contributors.
PANDAS – the PANDORA Digital Archiving System – is the vehicle that enabled a national approach to web-archiving by facilitating distributed responsibility and collaborative collection building. Experiences, efficiencies and challenges of PANDAS as the day-to-day tool supporting individual user activities will be reflected upon from the viewpoint of a PANDORA curator at the State Library of Queensland. This includes management of cases when collecting intentions between institutions have overlapped; and how collection building through combined efforts has increased as the archive has matured. In closing, the future of web archiving at State Library is considered in the light of new opportunities and challenges.
Howard Besser
New York University
Archiving websites containing streaming media: the Music Composer Project
Web Crawling software is notoriously deficient at capturing streaming media. For the past two years New York University Libraries has been working with the Internet Archive to replace the ubiquitous Heritrix web crawler with one that can better capture streaming audio and video. With funding from the Andrew W. Mellon Foundation they have created a new crawler (Brozzler) and tested this within the context of archiving the websites of contemporary young composers (showing how early-career composers represent themselves with a web presence).
This presentation will examine the deficiencies in current web crawlers for handling streaming media and presenting it in context, and explain how Brozzler addresses those deficiencies by extending existing web archiving tools and services to not only collect audio and video streams, but also to present the results in proper context. It will also explain the project to archive composer websites, touching on everything from contractual arrangements and working relationships with the composers, to tying together various NYU Library tools with the Internet Archive’s Archive-It, to assessing researcher satisfaction with the result. It will also cover the combinations of automated and manual methods for archiving composer websites.
Tiiu Daniel
National Library of Estonia
Becoming a web archivist: my 10-years journey in the National Library of Estonia
This presentation starts with my personal story of becoming a web archivist and working on this area for the last ten years in the National Library of Estonia, reflecting on how the work has changed over time in aspects including curating, harvesting, preserving, describing and giving access to the web content in our institution. I will also mark out major institutional, juridical and other impacts that have influenced my work.
I’ll look back to a decade-long journey of ups and downs for the team of web archivists. We are lucky to have curatorial and technical specialists working side by side in one little department, and our roles are often mixed (web curators do harvesting and configure Heritrix when needed). Personally, it has been rather difficult to handle the technical duties as I come from the library background. But just recently as I have the longest web archiving experience in my team, I realized that I finally have the big picture and am able to have a say in almost in every aspect of our work. It took me nearly ten years to get there!
In the second part of the talk I’ll present my personal observations about how the international web archiving field has developed over the years I have been involved in it. Looking back, awareness of the importance of preserving cultural heritage on the web has risen remarkably. But there are still countries that don’t have any national web preservation programs, or have only just started. So the longer players in the field, including Estonia, have had the opportunity to help them out by sharing their experiences. And even longer players have helped us to become better in our work – we owe a lot to the web archiving community, to the IIPC in particular.
Martin Klein, Lyudmila Balakireva & Herbert Van De Sompel
Los Alamos National Laboratory, Research Library
Building event collections from crawling web archives
Event-centric web collections are frequently built by crawling the live web with services such as Archive-It. This process most commonly starts with human experts such as librarians, archivists, and volunteers nominating seed URIs. The main drawback of this approach is that seed URIs are often collected manually and the notion of their relevance is solely based on human assessment.
Focused web crawling, guided by a set of reference documents that are exemplary of the web resources that should be collected, is an approach that is commonly used to build special-purpose collections. It entails an algorithmic assessment of the relevance of the content of a crawled resource rather than a manual selection of URIs to crawl. For both web crawling and focused web crawling, the time between the occurrence of the event and the start of the crawling process is a concern since stories disappear, links rot, and content drifts.
Web archives around the world routinely collect snapshots of web pages (Mementos) and hence potentially are repositories from which event-specific collections could be gathered some time after the event.
In this presentation, we discuss our framework to build event-specific collections by focused crawling web archives. We build event collections by utilizing the Memento protocol and the associated cross-web-archive infrastructure to crawl Mementos in 22 web archives. We will present our evaluation of the content-wise and temporal relevance of crawled resources and compare the resulting collections with collections created on the basis of live web crawls as well as a manually curated Archive-It crawl. As such, we provide novel contributions showing that:
Event collections can indeed be created by focused crawling web archives.
Collections built from the archived web can score better than those based on live web crawls and manually curated collections in terms of relevance of the crawled resources.
The amount of time passed since the event affects the size of the collection as well as the relevance of collected resources for both the live web and the web archive crawls.
Ed Summers & Bergis Jules
Ed Summers, University of Maryland
Bergis Jules, University of California, Riverside
Social media and archival praxis
Social Media presents both challenges and opportunities for archivists of the web. On the one hand, the sheer volume and velocity of content poses insurmountable challenges to even the world’s largest cultural heritage institutions. On the other, social media content exists along a public-private continuum where the rights of creators and their consent to representation in an archive are difficult, if not impossible, to ascertain.
Both of these problems are fundamentally related to issues of scale, which assume a big data orientation to social media archiving. However, archives have been grappling with the management of records in an age of abundance long before the creation of the web. What can archival practices of appraisal and small data approaches teach us about how we look at the problem of social media archiving, and how we design tools to help us in the work?
The Documenting the Now project has been working for the past two years to build a community of practice and a set of tools to help archivists work with social media content, primarily Twitter. During that time we have come to see social media archiving practices as turning on a question of relationships between archivists and the communities that they are documenting. Indeed, social media present tremendous opportunities for archives to tell the stories of underrepresented communities, instead of the usual narratives of power. In this presentation we will share some of our key findings from this work and how they can inform the design of tools and practices for social media archives going forward.
Jefferson Bailey
Internet Archive
Nation wide webs
This talk will outline efforts to build new nation-specific web archive access portals with enhanced aggregation, discovery, and capture methods. Many national libraries have been conducting web harvests of their ccTLD for years. These collections are often composed solely of materials collected from internally-managed crawling activities and have access endpoints that are highly restricted to reading-room-only viewing. These local-access portal largely adhere to the “known-URL” lookup and replay paradigm of traditional web archive access tools.
Working with partners, and as part of advancing R&D on improving access to web collections, the Internet Archive have been developing new portals to national web domains in concert with the work of national libraries with the mandate to archive their websphere. These collections are “sourced” from a variety of past and scheduled crawling activities — historical collections, specific domain harvests, relevant content from global crawling, in-scope donated and contributed web data, curatorial web collecting, user-submitted URL contributions, and other acquisition methods. In addition, these portals leverage new search tools including both full-text search, non-text item (image, audio, etc) search, linkback from embedded resources, relevant content identified by geoIP matching or PageRank-style scoring, and categorization such as “highly visited” or “no longer on the live web.” While giving new life to the discovery and use of ccTLD-specific web access portals, the project is also exploring how new features, functionality, profiling, and enhanced discovery and reporting methods can advance how we think of access to web archives.
Maria Praetzellis & Makiba J. Foster
Maria Praetzellis, Internet Archive
Makiba J. Foster, Schomburg Center for Research in Black Culture, New York Public Library
Community Webs: empowering public libraries to create community history web archives
Many public libraries have active local history collections and have traditionally collected print materials that document their communities. Due to the technical challenges of archiving the web, lack of training and educational opportunities, and lack of an active community of public library-based practitioners, very few public libraries are building web archives. This presentation will review the grant funded Community Webs program working with 27 public library partners to provide education, training, professional networking, and technical services to enable public libraries to fulfil this vital role.
The Schomburg Center for Research in Black Culture (a Community Webs cohort member) will provide an example of a Community Webs project in action and discuss their innovative project to archive social media hashtagged syllabi related to race and social justice. With continued national dialogue around race and gender, The Schomburg Center is collecting and preserving web-based syllabi focused on race and social justice issues. The recent phenomena of the syllabi movement hashtagged on social media with crowdsourced Google Docs and blogged syllabi (e. g. #CharlestonSyllabus, #FergusonSyllabus, #KaepernickSyllabus, #TrumpSyllabus), represents an innovative way to create a more learned society regarding race and social justice. Web-based publishing of syllabi extends the traditional classroom and enables participation for those excluded from formal learning opportunities. The Internet Archive will talk about the development, group activities, and outcomes of the full Community Webs program.
There is great potential to apply the Community Webs educational and network model to other professional groups such as museums, historical societies or other community based groups in order to diversify institutions involved in collecting web content. There is an opportunity for IIPC members and their local constituencies to implement similar programs or play a network or leadership role in expanding the universe of web collecting. We will close the session with a call for partnerships to help bring this model to other IIPC member organizations and continue to grow the field of web archiving.
Fernando Melo & Daniel Gomes
Arquivo.pt at FCCN-FCT
Cultivating open-access through innovative services
Arquivo.pt preserves more than 4 billion files in several languages preserved from the web since 1996 and provides a public search service that enables open-access to this information. The service provides user interfaces for textual, URL and advanced search. It also provides Application Programming Interfaces (APIs) to enable fast development of value-added applications over the preserved information by third-parties.
However, the constant evolution of the web and of society demands constant development of a web archive to follow its pace of evolution and maintain the accessibility of the preserved content. Thus, creating a new mobile version and improving our APIs were mandatory steps to support broad open access to our collections.
In November 2017, Arquivo.pt launched a new mobile version. The main novelty was the adaptation of user interfaces to mobile devices and preservation of the mobile web. In order to achieve responsive and mobile-friendly user interfaces for Arquivo.pt we had to address questions such as:
-
- Who wants to view archived web pages on such a small device?
-
- Should we replay archived web pages on mobile web pages inside an iframe or full screen?
-
- What additional services/functionalities can we add to a mobile version?
-
- Should we privilege the replay of older archived web pages, or newer responsive ones?
- How can we show an extensive list of archived versions of a given URL in such a small device?
In order to facilitate automatic access to our full-text search capabilities we decided to release a new text search API in JSON. One can perform automatic queries from a simple word search, to more complex ones such as finding all archived web pages from a given site that contain a text expression in a certain time range.
We would like to present this new API, and show how anyone can easily integrate their work with our preserved information.
Despite the challenges, it is very gratifying to develop an open-access web archive, which can reach a large audience of users and researchers. Arquivo.pt reached 100 000 users for the first time in 2017, and there was a significant increase of research projects using Arquivo.pt as a source of information.
Peter Jetnikoff
State Library of Victoria
Curating dissent at the State Library of Victoria
The State Library of Victoria has been collecting web publications through PANDORA for twenty years. There are numerous themes discernible in this collection that express a timeline of web usage, design and behaviour. This paper will address one in particular: protest.
Material of political dissent and social action in the state of Victoria has been collected by the Library from the nineteenth century onwards. The online material is an extension of that in terms of technology and also a continuance of tradition. But perhaps the most important aspect of this material is that, representing as completely as it can one side of the dispute, it emerges as primary historical source material, aligning it with manuscripts, small press and pamphlets in the greater collection. More than a simple record of the times, the look and feel of the technology, this material is witness to the timeline of dissent as a series of modes along with shifts in content.
This paper will discuss some of the more significant items collected by the Library over the past two decades such as Residents Against McDonalds, Occupy Melbourne and other protest publishing as well as dissenting material that appears at election time (with particular attention given to the 1999 Victorian state poll). The collection of this material offers its own peculiar issues and challenges, sometimes involving the Library itself being perceived as partisan coupled with the ongoing need to convince online publishers that they are, in fact, publishing. The issue of the need to secure publisher permission will continue but recent developments within the PANDORA partnership have provided new options. The intersection of political activity and the increasing utility of emerging technology has seen a steady shift from websites to social media which, in turn, offers new challenges to collect moments of dissent for permanent curation.
Jason Webber
The British Library
Creating a new user interface for the UK Web Archive
The UK Web Archive (UKWA) started collecting selected websites (with owners’ permission) in 2005. In the subsequent 13 years this ‘Open UK Web Archive’ has grown to approximately 15,000 websites, all of which are available publicly through www.webarchive.org.uk.
In 2013 UK law changed to allow the collection of all websites that can be identified as owned or produced in the UK. Since then the ‘Legal Deposit Web Archive’, through an annual domain crawl, has added millions of websites (and billions of individual items). This collection, however, can only be viewed in the reading rooms of UK Legal Deposit Libraries (seven locations in the UK and the Republic of Ireland).
It is a key aim of the UKWA to be practically useful to researchers and give the best access that is possible given the legal restrictions. Up to now this has been a considerable challenge and in order to attempt an answer to this challenge UKWA have worked for two years on a new user interface.
This talk aims highlight the challenges of using a large national collection for research and how UKWA have resolved or mitigated these difficulties, including:
The UKWA service has multiple collections (‘Open’ and ‘Legal Deposit’) that offer different content and have different access conditions. How best to communicate these differences to researchers?
The ‘Open’ and ‘Legal Deposit’ collections are viewed through two different interfaces. Can or should there be a single interface for both collections?
The UKWA service has fully indexed both ‘Open’ and ‘Legal Deposit’ collections that gives enormous potential for researchers to search by keyword or phrase. Any search, however, results in thousands or even millions of returns. Without Google style relevance, how do researchers find meaningful results?
UKWA has over 100 curated collections on a wide scope of subject areas. How should these collections be highlighted and presented to researchers?
Ian Milligan
University of Waterloo
Opening up WARCs: the Archives Unleashed Cloud and Toolkit projects
Since 2013, our research team has been exploring web archives analytics through the Warcbase project, an open-source platform that we have developed in conjunction with students, librarians, and contributors. Through in-person presentations, workshops, and GitHub issues and tickets, we identified several barriers to scholarly engagement with web archives: the complexity of tools themselves and the complexity of deployment.
Our Archives Unleashed Project, funded by the Andrew W. Mellon Foundation, aims to tackle tool complexity and deployment through two main components, the Archives Unleashed Toolkit and the Archives Unleashed Cloud. This presentation introduces these two projects both through a conceptual introduction, as well as a running in-depth live demo of what the Toolkit and Cloud can do. Our approach presents one model of how institutions can facilitate the scholarly use of ARC and WARC files.
The Archives Unleashed Toolkit is the new, cleaner, and more coherent version of Warcbase. Starting with a clean slate in our redesign, we are adopting Python as the primary analytics language. This offers advantages in that it can reach out to digital humanists and social scientists, and also allow us to tap into a broad ecosystem of Python tools for linguistic analysis, machine learning, visualization, etc. It supports a combination of content-based analysis (i.e. selecting pages with certain keywords or sentiment) and metadata-based analysis (particular date ranges or hyperlinking behaviour).
Yet we realized that the command-line based Archives Unleashed Toolkit presents difficulties for many users in that it requires technical knowledge, developer overhead, and a knowledge of how to run and deploy a system.
The Archives Unleashed Cloud thus bridges the gap between easy-to-use curatorial tools and developer-focused analytics platforms like our Toolkit. Archivists can collect their data using GUI interfaces like that of Archive-It. Our vision is that the Archives Unleashed Cloud brings that ease to analytics – taking over where existing collection and curatorial dashboards end.
While the Cloud is an open-source project – anybody can clone, build, and run it on their own laptop, desktop, server, or cluster – we are also developing a canonical version that anybody can use. We will note our sustainability discussions around how we can make this project viable after the length of our project.
Abbie Grotke & Mark Phillips
Abbie Grotke, Library of Congress
Mark Phillips, University of North Texas Libraries
The End Of Term Archive: collaboratively preserving the United States government web
In the fall of 2016 a group of IIPC members in the United States organized to preserve a snapshot of the United States federal government web (.gov). This is the third time the End of Term (EOT) project members have come together with the goals of identifying, harvesting, preserving and providing access to a snapshot of the federal government web presence. The project is a way of documenting the changes caused by the transition of elected officials in the executive branch of the government, and provides a broad snapshot of the federal domain once every four years that is ultimately replicated among a number of organizations for long-term preservation.
Presenters from lead institutions on the project will discuss its methods for identifying and selecting in-scope content (including using registries, indices, and crowdsourcing URL nominations through a web application called the URL Nomination Tool), new strategies for capturing web content (including crawling, browser rendering, and social media tools), and preservation data replication between partners using new export APIs and experimental tools developed as part of the IMLS-funded WASAPI project.
The breadth and size of the End of Term Web Archive has informed new models for data-driven access and analysis by researchers. Access models that have included an online portal, research datasets for use in computational analysis, and integration with library discovery layers will be discussed.
Presenters will speak to how the project illuminates the challenges and opportunities of large-scale, distributed, multi-institutional, born-digital collecting and preservation efforts. A core component has also been how the project activities align with participant institutions collection mandates, as well as with other similar efforts in 2016-2017, such as the Data Refuge project, to preserve government web content and datasets. The EOT, along with related projects, has raised awareness about the importance of archiving historically-valuable but highly-ephemeral web content without a clear steward, resulting in a dramatic increase in the awareness of the importance of web archiving during times of transition of government.
Catherine Nicole Coleman
Stanford University
A digital preservation paradigm shift for academic publishers and libraries
The Stanford University Press and Stanford University Libraries are engaged in a grant-funded partnership to pave the way for publishing book-length peer reviewed online academic projects that we are calling interactive scholarly works. University presses and libraries have well established protocols and processes for print publication, many of which are rooted in our assumptions about the durability and longevity of the printed word. With the advent of electronic books, we have had to find ways to preserve not bound paper, but the bits. Now interactive scholarly works present an entirely new set of challenges for preservation because the scholarship is embedded in the digital form. It is not possible to have a print version of the original to fall back to since the online interactive presentation of the work—its unique format—is an essential part of the argument.
University presses and university libraries are close collaborators in scholarly publishing. Libraries acquire the books, then provide access, discovery, preservation and conservation. New processes and workflows are required if we are to provide those same library services for these new interactive scholarly works, like web archiving. Since Stanford Libraries has an existing web archiving service, we had hoped that our solution would start with a web archived version of each project as the published output. But what about the many projects that resist web archiving? A productive tension arose between nudging authors to produce works that fit our current preservation strategies and giving authors the freedom to produce innovative works that require new preservation strategies.
We do not yet know how researchers will want to explore Interactive Scholarly Works five, ten, fifty, or a hundred years from now. If we follow the example of print, we can assume that some will be interested in the original format while others will be interested in the underlying code; some will be interested in the author’s argument and intent, but see the vehicle of expression outdated and irrelevant. In anticipation of this uncertain future, we are exploring an approach to publication that anticipates multi-modal access and preservation strategies, with attention to the perceptual and conceptual aspects of the work as well as the constituent content elements. This paper will present perspectives from authors, the publisher, and the library (including the web archiving team, digital forensics, and operations) that have driven our design of a preservation strategy for these innovative works. The paper will also address the conflicting assumptions about what we are preserving and why.
Corey Davis, Nich Worby & Jeremy Heil
Corey Davis, Council of Prairie and Pacific University Libraries (COPPUL)
Nich Worby, University of Toronto
Jeremy Heil, Queen’s University
Addressing our many solitudes: building a web archives community of practice in Canada
Canada is a large country with many stakeholders involved in web archiving, from city archives to national libraries. Until 2017, most of these efforts took place in relative isolation, which resulted in needless duplication of efforts and significant collection gaps. This session will provide an overview of the establishment of the Canadian Web Archiving Coalition (CWAC), a national effort to formalize collaboration and coordination for web archiving across the country.
Under the auspices of the Canadian Association of Research Libraries Digital Preservation Working Group and Advancing Research Committee (CARL DPWG and CARL ARC), the Canadian Web Archiving Coalition (CWAC) was established in 2017 to develop an inclusive community of practice within Canadian libraries, archives, and other memory institutions engaged or otherwise interested in web archiving, in an effort to identify gaps and opportunities best addressed by nationally coordinated strategies, actions, and services, including collaborative collection development, training, infrastructure development, and support for practitioners.
This session will provide an overview of the CWAC in an effort to help our international colleagues understand and connect with web archiving efforts in Canada, but also to serve as an example for other jurisdictions attempting to develop an effective national community of practice and coordinating mechanism where before there was only haphazard and informal collaboration and coordination.
Michael Parry, Max Sullivan & Stuart Yeates
Victoria University of Wellington Library
Utilising the Internet Archive while retiring legacy websites and establishing a digital preservation system
In 2017 Victoria University of Wellington Library implemented Wairētō, an installation of the Rosetta Digital Preservation System from Ex Libris. Two of the core collections to be migrated into the new system are The New Zealand Electronic Text Collection and the ResearchArchive. Both of these collections have legacy websites that need to be decommissioned.
In this presentation we will discuss using the Internet Archive and Wairētō to archive these websites ensuring ongoing access and long term preservation. The process has four stages, with the first completed and the second and third to be complete before the end of 2018. The fourth is planned for 2019.
First stage: subscribe to the Internet Archive and ensure the websites are archived by creating collections and crawling each site.
Second stage: download the (W)ARC file for each site from the Internet Archive into Wairētō
Third stage: ensure that redirects from links to the original sites are either pointed to the new equivalent within Wairētō or the Internet Archive
Fourth stage: integrate the open source Wayback Machine from the Internet Archive into the Wairētō infrastructure.
This four stage process will also act as a pilot for the Library to potentially establish a web archiving service for the wider University.
The presenters will share how this process has been implemented, issues and solutions raised, and where to next. We will be discussing the use of third party tools for web archiving and how to link them into internal tools and workflows.
Arnoud Goos
Netherlands Institute for Sound and Vision
A national mini-IIPC: setting up collaboration in web archiving in The Netherlands
Of all the Dutch websites, only one percent is actually archived by one of the national web archives. A lot goes lost, or is already gone. In The Netherlands there are many organisations that have relatively small web archives. Collaboration between them is important. Each of these national or regional archives have their own reasons for archiving websites and they have their own collection scope and selection criteria. The Royal Library is by far the largest web archive in The Netherlands, but besides them there is the Netherlands Institute for Sound and Vision that is collecting media related websites, the University of Groningen that is collecting political websites and quite a few regional archives that collect websites from local schools, sport clubs, local festivals or other websites on the life in the city or region. Up until recently these web archives were acting on their own, sometimes even without knowing each other’s existence.
This has changed over the past two years. The Digital Heritage Network and the Netherlands Institute for Sound and Vision have worked on setting up collaboration between these different initiatives. With starting an web archiving expert group, a sort of national mini-IIPC has been created. Besides organizing conferences, the Digital Heritage Network has produced videos on the importance of web archiving, and developed the National Register for Archived Websites. This register is an overview of all the websites that have been archived in the Netherlands. It shows the archived URL, the period in which it was crawled, the tools that were used, how accessible the archived website is and the reason for archiving the website. This overview is at first accessible for web archiving professionals (to see what colleagues are archiving and what not), but when all the data is up to date, it will be made accessible for the public. Because of copyright issues, the register for now only contains metadata and links to the web archives’ live websites instead of the archives pages. Hopefully, this can change in the near future.
Andrej Bizík, Peter Hausleitner & Jana Matúšková
University Library in Bratislava
Digital resources – the national project of webharvesting and webarchiving in Slovakia
In April 2015 the University Library in Bratislava (ULB) was charged with the national project ‘Digital Resources – Web Harvesting and E-Born Content Archiving.’ The goals of the project were acquisition, processing, trusted storage and usage of the original Slovak digital resources. Its ambition was to establish a complex information system for harvesting, identification, management and long term preservation of web resources and e-Born documents (a platform for controlled web harvesting and e-Born archiving). The Digital Resources Information System consists of specialised, mainly open source software modules in a modular system with a high level of resource virtualization. The basis represents the server cluster, which consists of dedicated public and internal portal server and a form of work” servers for running the system processes. The system management is optimized for parallel web harvesting. This enables the system to carry out the full domain harvest with required politeness in acceptable time.
At present, the ULB web archiving system disposes with 800 TB storage. The application is supported by a powerful HW infrastructure. There is a form of 21 blade servers representing a virtual environment for multiple harvesting processes and 3 standalone database servers. The HW components are interconnected via high speed channels. The system consists further of the support modules for communication, monitoring, backup and reporting. A very useful system feature is a functionally identical parallel testing environment, which enables preventive harvest and problem analysis without interference of the production processes.
A substantial part of the system is the catalogue of websites, which is regularly updated during the automated survey of the national domain .sk. Domains that match our policy criteria are added to the catalogue manually (e.g. .org, .net, .com, .eu).
The operation, management and development of the Digital Resources Information System performs the department Deposit of Digital Resources of ULB with one head, three specialised digital curators and one part-time person for born-digital titles.
The project finished in the fall of 2015. At present the routine practice continues. Since 2015 ULB has performed three full-domain harvests (harvesting of the national .sk domain), multiple selective and thematic crawls.
Ilya Kreymer
Rhizome
Pywb 2.0: technical overview and Q&A, or everything you wanted to know about high-fidelity web archiving but were afraid to ask
Webrecorder pywb (python wayback) is a fully open-source Python package designed to provide state-of-the-art high fidelity web archive replay. The pywb 2.0 version was released in the beginning of 2018 with an extensive list of new features.
Originally developed as a replacement for the classic Wayback Machine, the latest release includes several new features going beyond that original scope, including a built-in capture mode for on-the-fly WARC capture and patching, full HTTP/S proxy mode, a Memento aggregation and fallback framework, an access control system, and a customizable rewriting system.
The presentation will briefly discuss the new features in pywb and how they can help institutions provide high fidelity web archive replay and capture. However, the purpose of the talk is not to be a tutorial on how to use pywb, but rather to share the knowledge of the many difficult problems facing web archive capture and replay for an ever-evolving web and to present the possible solutions that have worked in pywb to solve them. Topics covered will include the mechanics of the fuzzy matching in the rewriting system present in pywb, client-side rewriting, video stream rewriting, and domain-specific rules. Ongoing work and and remaining unsolved technical challenges facing web archives in the future will be discussed as well.
The talk will end with a Q&A portion that will help inform future pywb development and help the project become more useful to the IIPC community.
Nicola Bingham
The British Library
Preserving the public record vs the ‘right to be forgotten’: policies for dealing with notice & takedown requests
The mission of the UK Web Archive is to build web collections that are as comprehensive and as widely accessible as possible. However we must achieve this responsibly, lawfully and ethically. Increasingly, the public are concerned about their data privacy and the risk of exposure of sensitive personal data online.
The EU General Data Protection Regulation, and the new UK-only Data Protection Act which will align GDPR with UK law, have implications for web archiving. Most significantly, “a right [for the data subject] in certain circumstances to have inaccurate personal data rectified, blocked, erased or destroyed”
Public bodies will likely have derogation under performance of a task carried out in the public interest”. But the data subject has pre-eminence, and can request that information is removed, if they claim significant harm or distress”.
In light of this new legislation, we have been looking at tensions around the archival principles of preserving the public record vs the individual’s expectation of the right to be forgotten, i.e. withdrawing their content from the archive on request. Under what circumstances should we honour such requests?
The presentation will explore how we minimise the risk of crawling and exposing personal data in the first place and how we deal with requests for take down of material. What policies and procedures are in place? What criteria do we use to evaluate individual cases? Are we transparent and consistent in our take down policies?
Kathryn Stine, Stephen Abrams & Peter Broadwell
California Digital Library
Cobweb: collaborative collection development for web archives
The demands of archiving the web in comprehensive breadth or thematic depth easily exceed the capacity of any single institution. As such, collaborative approaches are necessary for the future of curating web archives, and their success relies on curators understanding what has already been, or is intended to be, archived, by whom, when, how often, and how. Collaborative web archiving projects such as the US End-of-Presidential-Term, IIPC CDG Olympics, and CA.gov (California state government) collecting endeavors demonstrate how curators working across multiple organizations in either coordinated efforts or direct partnerships, and with ad hoc collaboration methods, can accomplish much more than they might alone. With funding from the US Institute of Museum and Library Services, Cobweb (www.cdlib.org/services/cobweb/), a joint project of the California Digital Library, UCLA, and Harvard University, is a platform for supporting distributed web archive collecting projects, with an emphasis on complementary, coordinated, and collaborative collecting activities. Cobweb supports three key functions of collaborative collection development: suggesting nominations, asserting claims, and reporting holdings. This holdings information also supports a fourth Cobweb function, collection-level thematic search.
Curators establish thematic collecting projects in Cobweb and encourage nominators to suggest relevant seed websites as candidates for archiving. For any given collecting project, archival programs can claim their intention to capture a subset of nominated seeds. Once they have successfully captured seeds included in a given collecting project, descriptions of these holdings will become part of the Cobweb holdings registry. Cobweb interacts with external data sources to populate this registry, aggregating metadata about existing collections and crawled sites to support curators in planning future collecting activity and researchers in exploring descriptions of archived web resources useful to their research. Note that Cobweb is a metadata registry, rather than a repository or a collecting system; it aggregates and provides transparency regarding the independent web archiving activities of diverse and distributed archival programs and systems.
This presentation will include a walkthrough of the Cobweb platform, which is scheduled for production launch just prior to the IIPC General Assembly and Web Archiving Conference.
Samantha Abrams
Ivy Plus Libraries
One year down: taking the Ivy Plus Libraries web resources collection program from pilot to permanent
First established as a Mellon-funded project in 2013, the Web Resources Collection Program within Ivy Plus Libraries now finds itself at the end of its inaugural year as a permanent program. Founded and funded to explore the bounds of collaboration and web archiving, the Ivy Plus Libraries Web Collection Program bills itself as a collaborative collection development effort established to build curated, thematic collections of freely available, but at-risk, web content in order to support research at participating Libraries and beyond. Part of a partnership that stretches across the United States, Ivy Plus Libraries includes: Brown University, the University of Chicago, Columbia University, Cornell University, Dartmouth College, Duke University, Harvard University, Johns Hopkins University, the Massachusetts Institute of Technology, the University of Pennsylvania, Princeton University, Stanford University, and Yale University.
In order to successfully transition from its uncertain and grant-funded state, Ivy Plus Libraries focused on growth: in its first year, the Program hired both full-time and part-time staff members dedicated to web archiving, documented and created collection policies and communication strategies, and both expanded two pilot collections and built brand new, selector-curated collections. This presentation, delivered by the Program’s web archivist, focuses on these efforts and discusses the painstaking process of creating both a workable and centralized collaborative web archiving program, sharing the Program’s successes and areas in which it seeks to improve, touching on topics including: outreach and securing program buy-in, working with and educating subject specialists in order to create new and evolving collections, and shaping the Program’s reach and objectives, in addition to what the day-to-day work of web archiving — crawling, quality assurance, and metadata creation, to name a few — looks like when carried out on behalf of stakeholders at thirteen prestigious institutions.
No longer a pilot program, where has Ivy Plus Libraries succeeded and where can it continue to improve? What does web archiving look like in this collaborative state, and where might it take the partnership — and similar collaborative projects around the globe — as the Program embarks upon its second year?
Suzi Szabo
Nationaal Archief
Web archiving guide for governmental agencies: how to ensure sustainable accessibility of Dutch governmental websites
At The Dutch National Archives we are aware of the risks of losing web-information due to lack of proper guidelines and best practices. In light of this we have recently published a Web Archiving Guide, providing Dutch governmental organizations with a better understanding of the records management requirements for archiving websites, and the tools and advice to do so in practice. This helps to ensure sustainable accessibility of web-information.
Because of the way responsibilities are shared between Dutch governmental organizations and public archives, according to the Public Records Act, governmental agencies- and not archival institutions -are responsible for the archiving of the records they produce, which includes websites. To ensure that the information on Dutch governmental websites is and stays accessible for the purposes of accountability of government, business management, law & evidence, and future research, the National Archives has developed a guide with requirements for website archiving. The guide provides a set of requirements focusing on the responsibilities, process and result of website archiving for governmental agencies. The guide is based on the premise that harvesting is outsourced to a commercial party and that governmental websites are retained permanently. It also provides a roadmap describing all the steps in the process, from preparation for harvesting to the actual harvesting by a third party and eventually the transfer of the archived website to a public archive. Emphasis is put on the relevant stakeholders and their roles and responsibilities during the entire process. We will expand the scope of the Guide in the future to eventually be applicable for interactive websites and even social media.
In order to ensure usability and adoption the guide was created in co-operation with the intended users who took part in a large scale public review of the draft version. Simultaneously a Nation-wide implementation project has been started for Dutch central governmental agencies.
Chindarat Berpan, Wachiraporn Klungthanaboon & Sittisak Rungcharoensuksri
Chindarat Berpan, Chulalongkorn University
Wachiraporn Klungthanaboon, Chulalongkorn University
Sittisak Rungcharoensuksri, The Princess Maha Chakri Sirindhorn Anthropology Centre
Archiving web content in anthropology: lessons learned for a step in the right direction
The Princess Maha Chakri Sirindhorn Anthropology Centre (SAC), a leading research centre in Thailand in the disciplines of anthropology, history, archaeology and the arts, perceives the significance of information in hand for further research and development. Recognizing the vast amount of web content provided by key organizations at the local and global levels, dispersed online anthropological information, and the disadvantages of unavailable and inaccessible information, the SAC is attempting to archive online anthropological information in order to ensure long-term access. To assess the possibility of a web archiving initiative, the SAC started with a four-stage preliminary study:
1) explore the websites of selected organizations in the discipline of anthropology in Thailand and Southeast Asia;
2) analyze web content on the selected websites in terms of genres, file formats, and subjects;
3) compare and contrast web archiving tools and metadata schema by investigating key web archiving initiatives in Asia-Pacific region;
4) identify considerations and challenges of web archiving in general.
The results of this preliminary study will provide useful information for the SAC to proceed the next stage – policymaking and collaboration seeking. This presentation will review the project and will deliver some lessons learned from the preliminary study.
Martin Klein, Lyudmila Balakireva, Harihar Shankar, James Powell & Herbert Van De Sompel
Los Alamos National Laboratory, Research Library
Smart routing of Memento requests
The Memento protocol provides a uniform approach to query individual web archives. We have introduced the Memento Aggregator infrastructure to support distributed search and discovery of archived web resources (Mementos) across multiple web archives simultaneously. Given a request with a original URI and a preferred datetime, the Aggregator issues one request to each of the currently 22 Memento compliant archives and determines the best result from the individual responses. As the number of web archives grows, this distributed search approach is increasingly challenged. Varying network speeds and computational resources at the end of the archives make delivering such aggregate results with consistently low response times more and more difficult.
In order to optimize query routing and therefore lowering the burden on archives that are unlikely to hold a suitable Memento, we previously conducted research, in part supported by the IIPC, to profile web archives and their holdings. This work was based on the premise that if we knew what archive holds which URIs, we could make the Aggregator smarter. However, the sheer scale of archive holdings makes profiling a very time- and resource-intensive endeavor. In addition, the constantly changing index of web archives requires frequent re-profiling to reflect the current holdings of archives. These insights lead to the conclusion that our profiling approaches are impractical and are unsuited for deployment.
In this presentation we report on a lightweight, scalable, and efficient alternative to achieve smart request routing. This method is based on binary, archive-specific classifiers generated from log files of our Aggregator. We therefore get our “profiling” information from real usage data and apply the classifiers to determine whether or not to query an archive for a given URI. Our approach has been in production at the Memento TimeTravel service and related APIs for an extended period of time, which enables us to report on long-term performance evaluations. In addition, we report on further explorations using neural networks for smart request routing.
Our approach works to the benefit of both sides: the provider of the Aggregator infrastructure benefits as unnecessary requests are held to a minimum and responses can be provided more rapidly, and web archives benefit as they are not burdened with requests for Mementos they likely do not have. Given these advantages, we consider our method an essential contribution to the web archiving community.
Kit Condill
University of Illinois at Urbana-Champaign
Peeling back the onion (domes) in the North Caucasus: multi-layered obstacles to effective web-based research on marginalized ethnic groups
The North Caucasus is a remarkably multilingual and multi-ethnic region located on the southern borders of the Russian Federation. The predominantly-Muslim peoples of the region (which, for the purposes of this paper, includes Chechnya, the breakaway Georgian republics of South Ossetia and Abkhazia, and many other neighboring territories) have a fascinating and diverse online presence, generated by communities both within Russia and in emigration. Building on previous research into the vernacular-language online media environment of this strategic and conflict-prone region, my paper will argue for the importance of preserving online content in languages such as Chechen, Ossetian, Abkhaz, Avar, Kumyk, and Karachai-Balkar. The many-layered difficulties associated with establishing a web archive for the North Caucasus — and with making its contents easily and permanently accessible for scholars — will be outlined and discussed, along with possible solutions to these problems.
It will also be demonstrated that, despite the North Caucasus’ inherent intellectual appeal as an ancient melting pot of cultures, languages and religions (and its current significance in, among other arenas, the global battle against Islamic extremism), the region is woefully under-studied in English-speaking countries, and locally-produced vernacular-language sources are almost never used in contemporary scholarship. The creation of a web archive dedicated to the North Caucasus, therefore, would be an important step toward encouraging researchers to make systematic use of an already-existing corpus of primary-source material that is both substantial and readily (if not permanently) available. Presumably, North-Caucasus-related websites whose political or social stances and circumstances render them particularly vulnerable to disruption and content loss are (or should be) of particular interest to scholars, and these sites will be identified as priorities for future web-archiving efforts.
Tackling the problem of preserving web content in the languages of the North Caucasus also has broader implications, raising questions such as how well online content in minority languages is being preserved, the relationship between statehood/sovereignty and the feasibility of comprehensive web preservation efforts, and the role of web archiving in cultural and linguistic preservation.
Sara Elshobaky & Youssef Eldakar
Bibliotheca Alexandrina
A workflow for indexing and searching large-scale web archive content using limited resources
The primary challenge to allow searching web archives is their scale. This problem increases significantly for the Bibliotheca Alexandrina (BA), which holds more than 1 PB of archived web content. Currently, the BA archive is still missing full-text search capabilities. As learned from other web archives’ full-text search experiences, there is a high need for a powerful machine to build the search index in addition to a cluster of machines to host the resultant indices. In this work, we present a workflow that allows automating the indexing process in a reasonable time frame, while tuning the parameters for machines with limited resources.
At the BA we make use of our older High-Performance Computing (HPC) cluster to build Solr indices for our web archive. It is not so powerful by today’s standards but has up to 130 compute nodes, each with 8 GB of RAM and an overall limited storage of 13 TB. Hence, we are able to host on it up to 10 Solr nodes to build the indices before relocating them to their final destination. On the other hand, the BA is migrating its web archive content to a new 80-node cluster over the network, which is causing an extreme delay in the whole process.
Manually managing and monitoring all such processes at the same time would be very cumbersome. Hence, we propose an automated workflow that controls the indexing process of each file from the point it is stored to the web archive cluster until it is made searchable through the web interface. The workflow consists of two separate parts that coordinate together through a shared database. The first part is the processes that control the Solr index. The processes manage building the 10 solr indices in parallel in the HPC. Once an index is finalized, it moves its content to its final destination in the permanent SolrCloud. Then, it resets the Solr index and registers it as a new one. The second part is the processes running on the web archive cluster. Each process watches for new files on a specific drive to parse their content using the warc-indexer tool implemented by the UK web archive. Then, it queries the database for the least loaded Solr node to index the file’s content. All processes in the workflow log their status in the database to easily coordinate together and to allow future rebuilding of the index upon need.
Theoretically, building a SolrCloud, for the 1 PB of web archive files requires 100 Solr nodes with 1000-GB capacity each. Currently, the BA has at its disposal a part of that hardware from recycling older web archive machines. Each machine has a limited memory of only 2 GB. We overcame this limitation by revisiting the Solr schema and studying all the search features that overwhelm the memory. The result was a responsive full-text search for the whole content with limited features.
João Gomes
Arquivo.pt at FCCN-FCT
Arquivo.pt: taking a web archive to the next level
Arquivo.pt has now 10 years and contents from 22 years of the Portuguese web, with a mature technological architecture, a powerful full-text search engine and several innovative services available.
To reach all its potential and applicability, an archive needs to be live and widely used. In order to achieve this objective, Arquivo.pt has put in place several dissemination activities, to be known and used by the community, mainly among researchers, students and journalists.
This talk will focus on all the planned and executed activities of dissemination, advertisement and training, such as:
- Training sessions for journalists
- Digital Marketing campaigns
- Grants for researchers with works over Arquivo.pt assets
- Events
- Annual Prize Arquivo.pt 2018 with 15K€ on prizes
- Production of videos about best-practices
- Time travel on Media Sites anniversary
- Press Releases and Media Partnerships
These activities have improved a lot the awareness about the Arquivo.pt and it usage.
Thib Guicherd-Callin
Stanford Libraries
Sifting needles out of (well-formed) haystacks: using LOCKSS plugins for web archive metadata extraction
As the volume of web archives have grown and web archiving has matured from a supplementary to an increasingly essential mechanism for collection development, there has been growing attention to the challenge of curating that content at scale. National libraries engaged in national domain-scale harvesting are envisioning workflows to identify and meaningfully process the online successors to the offline documents they historically curated. There is elsewhere increasing interest in the application of artificial intelligence to making sense of digital collections, including archived web materials. Where automated or semi-automated technologies are not yet adequate, crowd-sourcing also remains a strategy for scaling curation of granular objects within web archives.
The LOCKSS Program has developed significant expertise and tooling for identifying and parsing metadata for web-accessible objects in the domain of scholarly works: electronic journals and books, both subscription and open access, as well as government information. This is enabled by means of a highly flexible plugin architecture in the LOCKSS software, which augments the traditional crawl configuration options of an archival harvester like Heritrix with additional functionality focused on content discovery, content filtering, logical normalization, and metadata extraction. A LOCKSS plugin specified for a given publishing platform or content management system encodes the rules for how a harvester can parse its content, allowing for extraction of bibliographic metadata from, e.g., HTML, PDF, RIS, XML, and other formats.
While the LOCKSS software has always fundamentally been a web preservation system, it has largely evolved in parallel with the tools and approaches of the larger web archiving community. However, a major re-architecture effort is currently underway that will bring the two into much closer alignment. The LOCKSS software is incorporating many of the technologies used by the web archiving community as it is re-implemented as a set of modular web services. The increased participation of LOCKSS in the broader community should bolster sustainability, but a more promising possibility is for cross-pollination of technical capabilities, with metadata extraction a component of probable interest to many web archiving initiatives.
This presentation will detail the capabilities of the LOCKSS plugin architecture, with examples of how it has been applied for LOCKSS use cases, how it will work as a standalone web service, and discussion with the audience of where and how such capabilities might be applied for broader web archiving use cases.
Abbie Grotke & Grace Thomas
Library of Congress
Expansion and exploration in 2018: processing the Library of Congress web archive
The Library of Congress Web Archive selects, preserves, and provides access to archived web content selected by subject experts from across the Library, so that it will be available for researchers today and in the future. The Library’s Digital Collecting Plan, produced in February 2017, calls for an expansion of the use of web archiving to acquire digital content. Our talk will focus on how the Library of Congress plans to accomplish this by expanding selective crawling practices and simultaneously scaling up technical infrastructure to better support program operations.
Web archiving efforts at the Library of Congress have, up to now, focused on highly selective collecting of specific, thematic collections proposed and described by subject specialists. An expansion of web archiving will require both enlisting additional subject specialists to engage in web archive collection development, and for those already engaged to broaden their web archiving selection to additional themes and subjects. The Library is also currently tackling the backlog of web archives in need of descriptive records for presentation on the Library’s website.
Expanding web archiving at the Library will also require finding solutions to analyze, process, and manage huge quantities of content. With over 1.3 PB of content and a current growth rate of more than 300 TB per year, the sheer size of the archive has begun to present technical challenges, particularly with rendering content on the public Wayback Machine and delivering research ready data to scholars. In 2018, new cloud infrastructure was made available to the Web Archiving Team for processing the archive. With this new capability, the team is exploring a variety of projects, including experimenting with alternate index models, generating multiple types of derivative files to gauge research engagement with the web archives content, and running analyses over the entire archive to gain deeper understanding of the content held within the collections.
In the coming months, the Library of Congress will ingest the web archive into the cloud and test new processes for managing the web archive at scale, and will be able to share stories of triumphs and challenges from this crucial transition with the greater web archiving community.
Jefferson Bailey
Internet Archive
Your web archives are your everything archives
As the web continues to consume all media — from publishing to video to music — large-scale web harvests are collecting a rich corpus of news, government publications, creative works, scholarly research, and other materials that formerly had their own defined dissemination and consumption frameworks. Large-scale harvesting efforts, however, are intentionally designed for breadth, scale, speed, and “content agnostic” collecting methods that treat different types of works similarly. Relatedly, the URL-centric nature of web collecting and access can limit how we curate and enable discovery of the materials within our web archives. Web archives themselves, as they grow, are evermore an agglomeration of diverse material whose only shared trait is publication via the web. How can we increase our knowledge of what is contained within the web collections we are building? And how can we gain a better understanding of what we have collected in order to inform improved description, curation, access and collection strategies?
This presentation will detail a number of both in-production and research and development projects by Internet Archive and international partners aimed at building strategies, tools, and systems for identifying, improving, and enhancing discovery of specific collections within large-scale web collections. It will outline new methods to situate and enhance the valuable content already collected in general web collections and implement automated systems to ensure future materials are well-collected and, when possible, are associated with the appropriate context and metadata. This work spans search, data mining, identifier association, integration with publishers, registries, and creator communities, machine learning, and technology and partnership development. The talk will outline potential approaches for moving from a concept of “archives of the web” to one of “archives from the web.”
Maria Ryan
National Library of Ireland
Working on a dream: the National Library of Ireland’s Web Archive
At a time when the National library of Ireland (NLI) is undergoing a physical transformation as part of its re-imagining the library-building programme, it is also changing the way in which it develops its approach to its online collecting activities, including developing its web-archiving programme. The NLI has transitioned from a resource-limited model of selective only web archiving to a larger scale process that now includes domain web archiving. This presentation will examine the development of the web archive over seven years, which began as a pilot project in selective web archiving, to becoming an established collection within the library, while struggling with limited resources and inadequate legislation. It will also examine how the change in web archiving strategy has resulted in wider implications for the NLI web archive.
Despite the lack of adequate Legal Deposit legislation for web archiving or digital publications, the NLI has a mandate to collect and protect the recorded memory of Ireland, a record that is now increasingly online. In 2017, additional resourcing in the form of a full time web archivist was secured and the NLI launched a domain crawl of the Irish web. Working with the Internet Archive, a crawl of the Irish top-level domain, relevant domains hosted in Ireland and websites in the Irish language was carried out. In total 39TB of data was captured and is due for release in 2018. This data is a unique resource in Ireland and with the addition of a previous 2007 crawl which will also be made available, it offers the researcher a decade long view of the Irish internet. The NLI now intends, resources permitting, to carry out a domain crawl each year going forward.
The addition of domain crawling has allowed the web archivist increase the amount of material it is preserving but also, greater freedom to develop its selective web archive. The web archive policy in the past attempted to gather as much diverse material as possible as well as adequately reflect important current events. However, with limited resources, this often meant small collections being archived on several different topics. By carrying out domain crawling, it allows a greater representation of data to be captured but also allows selective collections be more focused on key events such as elections and referendums. The change in web archiving strategy has resulted in a revision of the NLI’s web archiving policy. This proposal will examine the development of the web archive and examine the challenges and consequences of developing web archiving in the National Library of Ireland’s own context of limited resources.
Kees Teszelszky
Koninklijke Bibliotheek
How to harvest born digital conspiracy theories: webarchiving Dutch digital culture in the post-truth era
According to the Oxford Dictionaries, post-truth” is relating to or denoting circumstances in which objective facts are less influential in shaping public opinion than appeals to emotion and personal belief. These circumstances became especially present on the web in our time, as hate speech, extreme political beliefs and alternative facts are flooding web fora and social media. Jelle van Buuren (Leiden University) has done a PhD research on conspiracy theories on the web (2016). He addressed the question whether or not conspiracy constructions are boosting hatred against the political system. The online discourse in which these conspiracy theories get shape can be characterized as post-truth”.
The Koninklijke Bibliotheek – National Library of the Netherlands (KB-NL) has collected born digital material from the web since 2007 through web archiving. It makes a selection of websites with cultural and academic content from the Dutch national web. Most of the sites were harvested because of their value as cultural heritage. Due to these selection criteria of the past, many of the harvested websites are about history and heritage. As such, they were less suitable to use as primary sources for future historic research on our own digital age. This all changed by the new selection policy, which includes also conspiracy theories and other typical post truth phenomena from the Dutch web.
I will describe the methods and experience with selecting, harvesting and archiving websites of the Post-truth era. I discuss also the characteristics of web materials and archived web materials and will explain the use of these various materials (harvested websites, link clouds, context information) for digital humanities research.
Furthermore, I will describe the challenges of webarchiving in a country without a legal deposit. I will also argue that the combination of a growing variety of different kinds of digital materials and processes from the web calls for a reinterpretation of primary historic sources, stressing the question what can be regarded as an authentic born digital source of our time.
Grace Thomas & Trevor Owens
Library of Congress
What can tiny, transparent GIFs from the 1990s teach us about the future of access and use of web archives?
With petabytes of WARC files containing billions of archived resources from the web, it is often difficult to know where to start in researching web archives. For this paper, we started with the simplest kind of resource possible: the single-pixel, transparent GIF. Commonly known as spacer” GIFs, single-pixel transparent GIFs were used to format web pages before the advent of styling with CSS or JavaScript, among serving other functions.
While the size and invisible nature of this particular resource make it seem insignificant, single-pixel, transparent GIFs were an integral element of early web design. Their presence in web archives studied over time lends insight into the history of the web and can inform our future ability to come to understand our digital past.
For this case study, we decided to use a small, curated set of single-pixel GIF’s featured in Olia Lialina’s 2013 online art exhibit based on the GeoCities archive. In the exhibit, the ten transparent GIFs, two of which no longer resolve, are wrapped in frames and set against a tropical backdrop to show that, although invisible, they still take up space. We first assigned a digital fingerprint to the ten GIFs by computing an MD-5 cryptographic hash for each file discovering that, out of the original ten, there were only seven distinct files. We then used this unique identifier to search for the earliest appearances of the final seven files in the Library of Congress web archive and Internet Archive, respectively. In turn, Andy Jackson generously worked to trace the seven GIFs throughout the UK Web Archive, which resulted in fascinating trends over time.
Our study is key in beginning to understand how single-pixel GIFs were used across the web over time. However, we intend to present the paper, not only as a study of one resource, but also how the study of a singular resource revealed many more aspects of archiving and usage of born-digital materials. We will share our conclusions about hashing as a method of studying resources in the web archive, as well as skepticism of results being indicative of the true” live web at the time of archiving versus revealing collection practices. Additionally, we will express our call for more comprehensive notes on scoping and crawling decisions, content processing of an organization’s own web archive by its archivists, and the value of multiple web archiving initiatives collecting the same websites.
Emmanuel Cartier, Peter Stirling & Sara Aubry
Emmanuel Cartier, Laboratoire d’Informatique de Paris Nord – Université Paris 13
Peter Stirling, Bibliothèque nationale de France (BnF)
Sara Aubry, Bibliothèque nationale de France (BnF)
Néonaute: mining web archives for linguistic analysis
Néonaute is a project that seeks to study the use of neologisms in French using the web archive collections of the BnF. Initially a one-year project funded by the French Ministry of Culture, it uses a corpus drawn from the daily crawl of around one hundred news sites carried out by the BnF since December 2010. Building on the existing projects Neoveille and Logoscope which seek to detect and track the life-cycle of neologisms, Néonaute aims to use web archives to study the use of neologisms over time.
The objective of the project is to create a search engine, Néonaute, that allows researchers to analyse the occurrence of terms within the collection, with enriched information on the context of use (morphosyntactic analysis) and additional metadata (named entities, themes). Several specific use cases are included in the project:
analysis of the life-cycle of previously identified neologisms;
• comparative use of terms recommended by the DGLFLF, the body in charge of linguistic policy in France, versus terms already in circulation (especially Anglicisms);
• use of terms in feminine gender over the period.
The search engine interface is complemented with an interactive visualization module that allows users to explore the lifecycle of terms over the period, according to various parameters (themes of articles, journals, named entities implied, etc.).
Néonaute is based on the full-text indexing of the news collection carried out by the BnF, which represents 900 million files and 11TB of data. The presentation will discuss the technical challenges and the solutions adopted.
The project is one of an increasing number of uses of the BnF web archives that seek to use techniques of text and data mining (TDM), and the first to use linguistic analysis. Under legal deposit and intellectual property legislation only certain metadata can be used outside the library premises and this kind of project is carried out under research agreements that fix the conditions of use of the collections while respecting the relevant legislation. The presentation will also discuss the issues that the BnF faces in allowing researchers to use such methods on the web archives.
Russell Latham
National Library of Australia
“Who by fire”: lifespans of websites from a web archive perspective
Web archives provide a valuable resource for researchers by providing them with a contemporary snapshot of original online resources. The value of archived content increases as the original is altered, migrated or taken offline. These changes occur dynamically on the web. Sometimes change occurs quickly but sometimes it is a slow change with the website evolving through iterative updates until it eventually become unrecognisable from the archived copy. Some studies state that websites have a lifespan of only 40 to 100 days. If true, this would mean that most websites evolve very quickly after archiving. As any web archivist can tell you however, some websites have remarkable longevity and can remain live, accessible and little changed for many years. When can a website be considered gone or using an anthropomorphised term, ‘dead’? At its simplest, a website is dead when its URI (or domain) vanishes, along with all its hosted content. However, for many websites, the story is not this simple and they fall into a grey area. A URI may remain stable but the content on the website changes or, the content may remain but migrates to a new URI. In both these scenarios the website has changed substantially, but is this enough to say the website has ceased to exist and is ‘dead’?
In this presentation, I will look at the key characteristics in the lifespan of a website and its eventual ‘death’. It will also examine if a typology can be applied that will allow curators to identify websites that are most at risk of disappearing. I will do this by firstly examining a sample of the National Library of Australia’s twenty-two year old Pandora archive to find trends that lead to the end of its life. Second, I will apply this quantitatively to the NLA web archives to see if websites disappear by sudden death or slow decay.
By examining the lifecycle of a website we will better understand the critical junctures of a websites existence online and by virtue of this provide curators with greater understanding of the best timing when determining harvests schedules. Typologies assist curators to predict the likely future path of a website and allow them to make appropriate preservation actions ahead of time. As curators we are also always trying to determine what within our collection is unique material and what is no longer available and by being able to pinpoint an end point of a website we are better placed to answer this question.
Thomas Egense & Anders Klindt Myrvoll
Royal Library, Denmark
Demo of the SolrWayback search interface, tools and playback engine for WARCs
SolrWayback is an open source web-application for searching and viewing Arc/Warc files. It is both a search interface and a viewer for historical webpages. The Arc/Warc files must be indexed using the British Library Webarchive-discovery/Warc-indexer framework.
Features:
-
- Free text search in all Mime types
- Image search (similar to Google Images)
- Image search by GPS location using exif meta-data.
- Graph tools (domain links graph, statistics etc.)
- Streaming export of search result to a new Warc-file. Can be used to extract a corpus from a collection.
- Screenshot previews of a url over different harvest times.
- See harvest times for all resources on a webpage.
- Upload a resource (image etc.) to see if it exists in the corpus.
- Build in socks proxy to prevent leaking when viewing webpages.
-
- Out of the box solution for researchers to explore Arc/Warc files.
- Easy to install and use on Mac,Linux and Windows. Contains Webserver, Solr and warc-indexing tool. Just drop Arc/Warcs into a folder and start exploring the corpus.
SolrWayback on Github with screenshots: https://github.com/netarchivesuite/solrwayback
PANELS |
Yan Long, Regan Murphy Kao, Nicholas Taylor & Zhaohui Xue
Collaborative, selective, contemporary: lessons and outcomes from new web archiving forays focused on China and Japan
A reasonable assessment of web archiving efforts focused on China and Japan suggests that the level of collecting is not commensurate with the prominence of Chinese and Japanese web content broadly. Mandarin Chinese is the second-most common language of world internet users; Japanese is the seventh.[1] The distribution of the languages of websites is dominated by English, with other languages in the long-tail, but Mandarin Chinese and Japanese are both in the top nine languages, representing 2% and 5.1% of websites respectively.[2]
Meanwhile, the number of web archiving efforts focused on China and Japan is comparatively modest. The community-maintained list of web archiving initiatives highlights only three (out of 85) efforts focused on China or Japan.[3] A search for “china” or “chinese” on the Archive-It portal yielded 56 collections (out of 4,846); a search for “japan” or “japanese” yielded 43 – 1% or less of Archive-It collections for both.[4]
Recognizing the opportunity for more selective archiving of Chinese and Japanese web content, the Stanford East Asia Library has over the last several years led efforts to curate two major new collections, documenting Chinese civil society and contemporary Japanese affairs, respectively. This panel will discuss the particular motivations and impact of these collecting efforts, as well as address the following questions of more general interest to web archive curation practice:
- How can collaboration with researchers inform web content collecting efforts?
- What role do content creators themselves play in facilitating web content collecting efforts?
- How can coordination with and consideration of other institutions’ web content collecting efforts inform local collecting?
- What challenges – in terms of communications, funding, metadata, policy, quality assurance, staffing, workflow – are entailed in undertaking a new web archiving initiative and how can they be addressed?
- How is web content collecting continuous or discontinuous with the kinds of collecting that libraries have traditionally engaged in?
Apart from these questions of curatorial concern, this panel will also detail technical aspects of the two projects, including quality assurance observations and how Stanford Libraries has managed the collections through a hybrid infrastructure consisting of Archive-It, the Stanford Digital Repository,[5] a local OpenWayback instance,[6] and Blacklight-based discovery[7] and exhibits platforms.[8]
[1] https://www.internetworldstats.com/stats7.htm
[2] https://w3techs.com/technologies/overview/content_language/all
[3] https://en.wikipedia.org/wiki/List_of_Web_archiving_initiatives#Web_archiving_initiatives
[4] https://archive-it.org/explore?show=Collections
[5] https://library.stanford.edu/research/stanford-digital-repository
[6] https://swap.stanford.edu/
[7] https://searchworks.stanford.edu/
[8] https://exhibits.stanford.edu/
Gillian Lee
Idealism versus pragmatism: the challenges of collecting social media
The web archivists in the Alexander Turnbull Library within The National Library of New Zealand have been selectively harvesting websites using Web Curator Tool since 2007. It has worked well for us in many respects, but the challenge of collecting social media has caused the Library to look for alternative ways to collect this kind of material. This has led the Library to rethink some of the workflows around building our web archive collection and also some of the assumptions about the way they should be collected and viewed.
Websites are treated as publications in our collecting model, just like any other published publication, so we capture the content and try to maintain the look and feel of the site as it appeared online. However collecting social media this way is challenging, so we’re often caught between idealism and pragmatism. What is the best way to capture social media? What kind of access should we provide?
We are aware many researchers are using Twitter APIs to capture content or analyse content for their research. Since the Alexander Turnbull Library is a research library we want to ensure we can provide appropriate access to various kinds of researchers. So we decided to test this approach.
The presentation will cover two pilot studies undertaken to capture Twitter content using Twitter’s API.
1. The 2016 Kaikoura Earthquake Twitter crawl which was the first time we captured Twitter data relating to a significant event that involved hashtags and search terms.
2. The 2017 General Election twitter crawl when we crawled the Twitter accounts of every candidate and political party and also included a ‘hashtag’/search-term crawl of the event. In addition to this we requested the zipped files of Twitter accounts from candidates and political parties under Legal Deposit for comparison.
The presentation will look at the results of the pilot projects and the issues we encountered and resolved along the way. In particular the need for library staff to work more closely with technical staff to run the crawls and understand the new format we were dealing with; the kinds of access copies we created to enable access for different kinds of users; how we described the material; and what the next stage will be.
Jérôme Thièvre & Géraldine Camille
Jerome Thievre, Institut National de l’Audiovisuel
Géraldine Camille, Bibliothèque Nationale de France
Twitter challenges
The Bibliothèque Nationale de France (BnF) and the Institut National de l’Audiovisuel (INA) share the French Web Legal Deposit mission since 2009. Social networks have become a huge phenomenon since the early 2010s and the main entry point to the web for a significant number of users. Our two institutions had to find a way to archive these new platforms. The BnF and INA have tackled this challenge with different strategies and crawl methods, and we thought we had now enough hindsight to share our experiences with other institutions.
The BnF started crawling Twitter in 2011 for specific projects such as elections or Olympic games, and at the beginning of 2018 a twice-daily crawl was launched to collect throughout the year accounts selected according to the BnF’s collection policy. The BnF’s goal is to harvest the maximum number of tweets and their links with other accounts or hashtags, and to render them as they were published on the live web. For this reason the BnF chose to use Heritrix 3 (NetarchiveSuite 5.3) rather than an API service. 40 tweets per day per account or hashtag are available in the BnF web archives (based on OpenWayback). However this solution has some limits. The quality of the links in the archives depends on the duration of the crawl and the number of accounts selected (during the elections, 3 690 accounts were crawled). The videos and the external links (shortened URLs) are missing. A specific quality assurance was set up based on crawl logs. This crawl process allows the BnF to preserve this collection along with its other crawl data.
INA launched its first Twitter crawl in 2014 based on the different public Twitter APIs considering the tweets more as structured data than standard web resources. Unfortunately, these public APIs come with restrictions that not only introduce difficulties on the retrieval of live tweets, they also make it harder to retrieve historical data. INA’s crawler combines different APIs to ensure the completeness of the archive as far as possible. In cases where completeness cannot be achieved, the user interface is designed to make the gap visible to users. Today, 12000 selected accounts and 560 selected hashtags are crawled continuously as well as the videos and the images in tweets. INA’s user interface, presented before to IIPC, is still a work in progress and there are many open questions on the crawling, access and preservation methods. One of the next objectives is to simultaneously make available Twitter archive with TV archives, INA is therefore currently working on a Social TV user interface that will mash-up audiovisual programs and related tweet data.
These two approaches are complementary. INA and the BnF would like to share in a presentation about the challenges and opportunities related to this new type of collection and to open the discussion to other experiences. We will detail the different organizational and technical components that have been developed concerning curation, crawling, access and preservation.
Jasmine Mulliken, Anna Perricci, Sumitra Duncan & Nicole Coleman
Capturing complex websites and publications with Webrecorder
Jasmin Mulliken, Stanford University Press
Anna Perricci, Rhizome
Sumitra Duncan, New York Art Resources Consortium
Nicole Coleman, Stanford University
Recent advancements in web archiving technology are looking especially promising for the preservation of artistic and scholarly work. While the Internet Archive continues to play a crucial role in this endeavor, there are limitations that make it challenging for publishers, art museum libraries, and artists to capture the increasingly dynamic and complex interactive features that often define the work they produce, present and preserve. Fortunately, there is an option in addition to the Internet Archive for organizations working in art and scholarly publishing, two fields that often deal with unique, complex, and bespoke web content.
Webrecorder, a Rhizome project funded by the Andrew W. Mellon Foundation, offers a symmetrical approach to web archiving services in that a web browser is used both as a tool for capture of websites and access to the archived web content. Webrecorder’s tools fill some gaps left by large-scale services and allow a more granular and customizable experience for curators and publishers of digital content. There will be an updated version with improved functionality and features released in March, 2018. Rhizome’s planning for technical development and financial sustainability is in progress and by the end of 2018 a robust plan for growth over the next 3-5 years will be established.
Stanford University Press is breaking new ground in scholarly publishing with its Mellon-funded initiative for the publication of online interactive scholarly works. Unlike typical open-access textbooks or ebooks, these works carry all the heft of a traditional monograph but in a format that leverages the potential of web-based digital tools and platforms. In fact, these works could not be published in traditional monograph form because the arguments are embedded in the technology. Included in SUP’s grant is a mandate to archive and preserve these ephemeral works. Webrecorder is proving to be especially compatible with the bespoke projects the Press is publishing.
The New York Art Resources Consortium (NYARC)-the research libraries of the Brooklyn Museum, The Frick Collection, and The Museum of Modern Art-has developed a collaborative workflow for building web archive collections, with captures of several thousand websites now publicly available. NYARC’s web archive collections include the consortium’s institutional websites and six thematic collections pertaining to art and art history: art resources, artists’ websites, auction houses, catalogues raisonnés, NYC galleries, and websites related to restitution scholarship for lost or looted art. NYARC has actively contributed to user testing efforts of the Webrecorder tool over the past three years and has now integrated the tool into their workflow in compliment to their use of the Archive-It service. The use of Webrecorder has been especially pertinent to capturing complex museum exhibition sites, scholarly sites devoted to specific artists, and social media accounts of the museums.
This panel will include an overview of Webrecorder’s most significant new features and plans for sustainability. Co-panelists from Stanford University Press and NYARC will explain and demonstrate their uses of Webrecorder in the context of their unique projects, which represent their fields’ unique web archiving needs.
Amy Joseph, Nicola Bingham, Peter Stirling, Kristinn Sigurðsson & Maria Ryan
Legal deposit in an era of transnational content and global tech titans
Amy Joseph, National Library of New Zealand
Nicola Bingham, The British Library
Peter Stirling, BnF
Kristinn Sigurðsson, National and University Library of Iceland
Maria Ryan, National Library of Ireland
Legal deposit has evolved over time to provide a mandate for many national collecting agencies to collect content from the web, including web archives. The empowering legislation differs from country to country, so as a community we have both shared concerns and unique challenges. This panel will feature several national agencies that collect content from the web, including some who do not have legal deposit legislation covering online content. Participants will discuss issues such as:
- how well their legislation supports web archiving and other content collection from the web
- any grey areas in how their legislation applies to the contemporary global online publishing environment
- approaches to collecting web content when not firmly supported by empowering legislation
- challenges in providing access to material collected under legal deposit, and in communicating access restrictions to users.
The panel will not just be a presentation of issues, but also an exploration of our appetite for a joined-up approach to archiving websites and other content from the web. Could we do collaborative collection development for collecting areas that transcend national boundaries beyond major event-based collections? Could national deposit agencies make a collaborative commitment to approaching major technology companies (like Google, Amazon or Bandcamp) from global content platforms is collected and preserved?
POSTERS WITH LIGHTNING TALKS |
Alexis Antracoli & Sumitra Duncan
Alexis Antracoli, Princeton University
Sumitra Duncan, Frick Art Reference Library
It’s there, but can you find it? Usability testing the Archive-It public interface
Archive-It is used by governments, universities, and non-profit institutions in 17 countries. It is one of the most widely used web archiving tools in the world, but how easy is it for researchers to find what they are looking for in the discovery interface? Do researchers even know that the tool exists or how it differs from the Internet Archive? Seven archivists and librarians at six institutions in the United States spent the last year asking exactly these questions. Little is currently known about the usability of archival discovery systems, and even less about those devoted specifically to web archiving. This project filled in one of those knowledge gaps, answering questions about the usability of the Archive-It user interface for a range of users.
Representing Princeton, Temple, and Columbia universities, Swarthmore College, the Frick Art Reference Library, and the Delaware Public Archives, the team was constructed to include representatives from the range of institutions that make use of Archive-It as a web archiving tool. The team conducted usability tests with faculty, students, staff, and the general public at their respective institutions. Testing examined the ability of users to find specific archived websites using available search, browse, and sort features; to understand the language used by web archivists to describe websites (i.e. seed, capture), and to find groups of related websites and archived videos. The team also explored whether users knew about or had previously used archived websites in their work. After completing thirteen tests with at least two users representing each target audience (faculty, staff, students, and the general public) and each institution, the team coded and analyzed the results. This poster and talk will present both the methods and results of these tests, pinpointing areas where the Archive-It interface succeeds as well as specific suggestions for improvement.
Mark Boddington
Scientific Software and Systems Limited; Victoria University of Wellington
Legal deposit legislation for online publications: framing the issues
This lightning talk is about legal deposit legislation and what can be done to make it fit for purpose in the digital age. In many countries, including New Zealand, legal deposit law is struggling to keep up with new methods of publication enabled by the Internet. This is due to the shift from traditional publishing to self publishing via online platforms and the more proactive stance taken by libraries towards the acquisition of legal deposit materials. The nature and volume of publications available on the Internet means librarians do not expect to receive all qualifying material from online sources and instead use web harvesting systems to collect content that is available on the open web. Although national legal deposit laws have been amended to enable this to occur, some problems remain.
When legal deposit libraries harvest blogs and digital commentary they encounter a variety of legal issues that fall outside the scope of legislation. They include jurisdictional barriers to the acquisition of content arising from the global nature of Internet services, and difficulty complying with online platform’s terms and conditions of access. Legal matters become even more fraught when libraries permit public access to their collection of online publications. This talk will expand on these topics and will identify the underlying legal issues that need to be addressed to strengthen legal deposit laws.
Kathryn Stine, Kris Kasianovitz, Julie Lefevre & Lucia Orlando
Kathryn Stine, California Digital Library
Kris Kasianovitz, Stanford University
Julie Lefevre, University of California, Berkeley
Lucia Orlando, University of California, Santa Cruz
Crowdsourcing descriptive metadata for web archives: the CA.gov Archive
Librarians at the University of California, Stanford University, and the California State Archives and State Library are working to ensure that valuable evidence of California’s state government history are collected and preserved in the Archive of California government documents (CA.gov Archive). California state government publications have nearly ceased being distributed in print; instead, they are now almost exclusively born-digital and available only on agency websites requiring that this content be captured in a systematic way that ensures their longevity and accessibility. Building the CA.gov Archive entails a great deal of coordinated work from seed selection, running crawls, performing QA activities, as well as creating metadata for the collection.
Recognizing the need for enhanced seed-level metadata improve discovery of and access to this significant collection of state government information, the CA.gov project team established a crowdsourcing project to engage other library and archives professionals in working together to describe the archived sites. In December 2017, we leveraged the power of 120 librarians and library staff volunteers from around the state (and beyond!) in a weeklong Metadata Sprint to enhance description of archived websites in the CA.gov collection. This is a good example of what the library community can accomplish by working together and provides a roadmap for others wishing to initiate a similar crowdsourcing project. We look forward to sharing our successes as well as what we’ve learned from the challenges we encountered. This poster and lighting talk will cover the project team’s planning process, sprint organization method (including outreach approaches and the development and deployment of training materials), as well as how we incorporated emerging best practices for web archives metadata and approached getting enhanced metadata into Archive-It and other discovery environments. For more on the CA.gov Archive Metadata Sprint, visit the project website: http://guides.lib.berkeley.edu/ca-gov-sprint
Grace Thomas, Maria Praetzellis, Edward McCain & Matthew Farrell
Grace Thomas, Library of Congress
Maria Praetzellis, Internet Archive
Edward McCain, University of Missouri Libraries and Reynolds Journalism Institute
Matthew Farrell, Duke University
Tracking the evolution of web archiving activity in the United States
From 2011 to 2016, the National Digital Stewardship Alliance of the United States (NDSA) sponsored three surveys of web archiving activity throughout the United States as a longitudinal study of the evolution of web archiving practices and trends. This survey was again held in 2017 and the NDSA Web Archiving Survey Working Group is currently combing through the results in preparing to write the final report by summer 2018.
In order to track the evolution of web archiving activity, the survey studies similarities and differences in programmatic approaches, types of content being archived, tools and services being used, access modes being provided, and emerging best practices and challenges. The reports have been instrumental in advocating for web archiving resources at various institutions throughout the United States. They also allow for a formation of community among practitioners often scattered by geographical location as well as archiving objectives.
Already in our results, we have seen the continued maturity of the profession, diversification in the types of organizations engaged in web archiving, and some stagnation in the key areas of staffing and digital preservation. We have also seen a greater distribution of tools used, with a surge in WebRecorder usage, often paired with Heritrix. For the first time in the history of the survey, URL search is now the most popular means of access to web archives content, surpassing browse lists and full-text search.
This poster will review additional key findings from the 2017 survey and contextualize the significance of the results within the landscape of United States web archiving activity created from all four surveys. The poster will feature visualizations of the results in order to highlight areas of growth and potentials for further advancement. The survey working group believes the results from the United States survey can launch discussions about similar efforts in other areas of the globe, how the results fit into the international landscape of web archiving activity, and additional areas of inquiry for future surveys.
Zhenxin Wu & Xie Jing
National Science Library, Chinese Academy of Sciences
Poster 1- iPRES 2020 Introduction and Cooperation
An invitation of the Chinese Academy of Science (CAS) and eIFL (Electronic Information for Libraries) in 2003 provided the initial impetus for the iPRES series. Eight European experts in digital preservation contributed to the first iPRES conference in Beijing, July 2004. After the completion of a successful conference in Beijing, there are 15th iPRES conferences over Asia, Europe, America.
iPRES 2020, 17th international; conference on digital preservation, will go back to Beijing, where iPRES began in 2004. It will be in September 21 – 24, 2020, the most beautiful season in Beijing. National Science Library, Chinese Academy of Sciences (NSLC) and the National Science & Technology Library (NSTL) will co-hosted iPRES 2020.
The central theme of iPRES 2020 is “Empowering Digital Preservation for the Enriched Digital Ecosystem –meeting the preservation challenges of the evolving types of digital content”.
Now welcome your suggestions, proposals, and collaboration to make iPRES2020 successful. Please keep your eyes on our website. http://ipres2020.cn
Poster 2 – Chinese National Digital Preservation Programme for Scientific Literature
NDPP China is funded by National Science & Technology Library under Ministry of Science and Technology, China. It is a cooperative system participated by more than 200 research & academic libraries, operating with multiple preserving nodes at major institutions.
NDPP China aims to preserve in the mainland China digital scientific publications, including journals, books, patents, proceedings, reference works, and rich media publications, by major commercial and societal publishers, inside or outside China.
After 4 years ‘work, we set up Strategy-guided selection principles, form a cooperative responsibility system, develop an OAIS-based trusted tech platforms, implement trusted archival management, carry out Standard-based auditing & certification, and provide Triggered service management. Currently NDPP is covering more than 13000 thousand e-journals from 10 publishers.
NDPP strives to promote research and practice of digital resource preservation in China, and also looks forward to exchanges and cooperation with international experts and institutions.
Website: http://www.ndpp.ac.cn