Tuesday 13 Nov 2018
8:00 am - 9:00 am REGISTRATION & COFFEE
Auditorium Reception Desk & Tiakiwai Foyer
Registration (08:00-18:00): Auditorium Reception Desk
Arrival Tea & Coffee: Tiakiwai Foyer
9:00 am - 9:30 am PLENARY
Steve Knight & Bill MacnaughtAuditorium
9:30 am - 10:30 am PLENARY: KEYNOTE
Te Māwhai – te reo Māori, the Internet, archiving, and trust issues
Dr Rachael Ka’ai-MahutaAuditorium
Dr Rachael Ka’ai-Mahuta, Te Ipukarea, Auckland University of Technology introduced by Steve Knight, National Library of New Zealand
The Internet, much like formal education before it, has provided Māori with an opportunity to pick up a tool of English language dominance and re-purpose it to advance te reo Māori (the Māori language). There is now a wealth of knowledge online regarding te reo Māori, including new words and phrases, in-depth discussions about language rules, exemplars, and the development of language trends. In the Māori world, knowledge is not “owned” individually, but collectively. The Māori cultural concept of kaitiakitanga (guardianship) dictates that we have a responsibility to pass on community-held knowledge to the next generation. Therefore, it is imperative that we archive born-digital te reo Māori content, especially as often the digital copy is the only copy. However, the collection and storage of Indigenous knowledge and data raises questions regarding control, self-determination, and the right to free, prior and informed consent. This paper will explore these issues and others, which sit at the intersection of language revitalisation, web archiving, and Indigenous Peoples’ rights.
Dr Rachael Ka‘ai-Mahuta is of Māori (Ngāti Porou, Ngāi Tahu), Native Hawaiian, and Cook Island Māori descent. She is a Senior Researcher in Te Ipukarea, the National Māori Language Institute and Associate Director of the International Centre for Language Revitalisation at the Auckland University of Technology.
After attaining her Doctorate in 2010, Rachael was awarded a Te Wheke-a-Toi Post-Doctoral Fellowship, completed a certificate on Indigenous Peoples’ Rights and Policy at Columbia University, graduated from Te Panekiretanga o Te Reo, the Institute of Excellence in the Māori Language, and was appointed as a Commissioner to the Library and Information Advisory Commission.
Rachael’s research interests include: Indigenous Peoples’ rights (particularly those relating to the politics of identity and place), language revitalisation (specifically the revitalisation of te reo Māori), the Māori oral tradition as expressed through the Māori performing arts, and digital technology for the preservation and dissemination of Indigenous knowledge.
10:30 am - 11:00 am MORNING TEA
11:00 am - 12:30 pm SESSION 1
11:00 am - 12:30 pm NEW CRAWLING APPROACHES Chair: Kia Siang Hock
Continuous, incremental, scalable, higher-quality web crawls with Heritrix 11:00-11:30
Andrew Jackson, The British Library
Under Legal Deposit, our crawl capacity needs grew from a few hundred time-limited snapshot crawls to the continuous crawling of hundreds of sites every day, plus annual domain crawling. We have struggled to make this transition, as our Heritrix3 setup was cumbersome to work with when running large numbers of separate crawl jobs, and the way it managed the crawl process and crawl state made it difficult to gain insight into what was going on and harder still to augment the process with automated quality checks. To attempt to address this, we have combined three main tactics; we have moved to containerised deployment, reduced the amount of crawl state exclusively managed by Heritrix, and switched to a continuous crawl model where hundreds of sites can be crawled independently in a single crawl. These changes have significantly improved the quality and robustness of our crawl processes, while requiring minimal changes to Heritrix3 itself. We will present some results from this improved crawl engine, and explore some of the lessons learned along the way.
Archiving websites containing streaming media: the Music Composer Project 11:30-12:00
Howard Besser, New York University
Web Crawling software is notoriously deficient at capturing streaming media. For the past two years New York University Libraries has been working with the Internet Archive to replace the ubiquitous Heritrix web crawler with one that can better capture streaming audio and video. With funding from the Andrew W. Mellon Foundation they have created a new crawler (Brozzler) and tested this within the context of archiving the websites of contemporary young composers (showing how early-career composers represent themselves with a web presence).
This presentation will examine the deficiencies in current web crawlers for handling streaming media and presenting it in context, and explain how Brozzler addresses those deficiencies by extending existing web archiving tools and services to not only collect audio and video streams, but also to present the results in proper context. It will also explain the project to archive composer websites, touching on everything from contractual arrangements and working relationships with the composers, to tying together various NYU Library tools with the Internet Archive’s Archive-It, to assessing researcher satisfaction with the result. It will also cover the combinations of automated and manual methods for archiving composer websites.
Building event collections from crawling web archives 12:00-12:30
Martin Klein, Lyudmila Balakireva & Herbert Van De Sompel, Los Alamos National Laboratory, Research Library
Event-centric web collections are frequently built by crawling the live web with services such as Archive-It. This process most commonly starts with human experts such as librarians, archivists, and volunteers nominating seed URIs. The main drawback of this approach is that seed URIs are often collected manually and the notion of their relevance is solely based on human assessment.
Focused web crawling, guided by a set of reference documents that are exemplary of the web resources that should be collected, is an approach that is commonly used to build special-purpose collections. It entails an algorithmic assessment of the relevance of the content of a crawled resource rather than a manual selection of URIs to crawl. For both web crawling and focused web crawling, the time between the occurrence of the event and the start of the crawling process is a concern since stories disappear, links rot, and content drifts.
Web archives around the world routinely collect snapshots of web pages (Mementos) and hence potentially are repositories from which event-specific collections could be gathered some time after the event.
In this presentation, we discuss our framework to build event-specific collections by focused crawling web archives. We build event collections by utilizing the Memento protocol and the associated cross-web-archive infrastructure to crawl Mementos in 22 web archives. We will present our evaluation of the content-wise and temporal relevance of crawled resources and compare the resulting collections with collections created on the basis of live web crawls as well as a manually curated Archive-It crawl. As such, we provide novel contributions showing that:
Event collections can indeed be created by focused crawling web archives.
Collections built from the archived web can score better than those based on live web crawls and manually curated collections in terms of relevance of the crawled resources.
The amount of time passed since the event affects the size of the collection as well as the relevance of collected resources for both the live web and the web archive crawls.
11:00 am - 12:30 pm INSTITUTIONAL PROGRAM HISTORIES Chair: Alex Thurman
From e-publications librarian to web archivist: a librarian’s perspective on 10 years of web archiving at the National Library of New Zealand 11:00-11:30
Susana Joe, National Library of New Zealand
The National Library of New Zealand has been building thematic and event-based collections of New Zealand websites in an active selective web archiving programme since 2007. This presentation will give an overview of how the Library’s efforts to collect and preserve the nation’s digital documentary memory have grown and developed from the point of view of a librarian taken on in 2005 as one of the Library’s inaugural ‘E-Publications Librarian’ roles which focused on selecting websites and quality reviewing web harvests. It was a time when legislative changes combined with internal technological milestones in web harvesting and digital preservation propelled the advancement of the Library’s web harvesting activities and the growth of the Library’s Web Archive collection. Today the Web Archive is a highly curated collection and web archiving staff still both select and manually quality review all harvests to ensure good quality, comprehensive captures of selected sites.
In recent years the job title has changed to ‘Web Archivist’, but how have we adapted to tackle the changes and complexities in collecting born-digital content in the modern age? What has – or has not – changed in the Library’s curatorial approach, technical developments, and staff to reflect new capabilities, and how successful have we been? There have been many technical, legal, access and financial challenges and barriers along the way and we are still grappling with many of these issues so we will look at what we have learnt and where we are now.
Web archiving Australia’s Sunshine State: from vision to reality 11:30-12:00
Maxine Fisher, State Library of Queensland
Australia’s ‘Sunshine State’ Queensland is known for its warm climate, golden beaches and World Heritage natural assets such as the Great Barrier Reef. However, it is also a state vulnerable to natural disasters, facing economic and infrastructure pressures, and a growing list of environmental concerns. The State Library of Queensland has long been committed to collecting and providing access to resources documenting Queensland’s history, society and culture, and reflecting the Queensland experience. Web archiving has become an increasingly vital aspect of our contemporary collecting. Through participation in Australia’s PANDORA archive, State Library has captured a unique array of Queensland stories and events, as played out on the web, and collaborates with other contributing agencies to capture important Australian web content for current and future generations. This presentation describes how web archiving began at the State Library of Queensland in 2002, and key factors that have enabled growth of our web collecting, with a focus on how State Library of Queensland has collaborated with other PANDORA contributors.
PANDAS – the PANDORA Digital Archiving System – is the vehicle that enabled a national approach to web-archiving by facilitating distributed responsibility and collaborative collection building. Experiences, efficiencies and challenges of PANDAS as the day-to-day tool supporting individual user activities will be reflected upon from the viewpoint of a PANDORA curator at the State Library of Queensland. This includes management of cases when collecting intentions between institutions have overlapped; and how collection building through combined efforts has increased as the archive has matured. In closing, the future of web archiving at State Library is considered in the light of new opportunities and challenges.
Becoming a web archivist: my 10-years journey in the National Library of Estonia 12:00-12:30
Tiiu Daniel, National Library of Estonia
This presentation starts with my personal story of becoming a web archivist and working on this area for the last ten years in the National Library of Estonia, reflecting on how the work has changed over time in aspects including curating, harvesting, preserving, describing and giving access to the web content in our institution. I will also mark out major institutional, juridical and other impacts that have influenced my work.
I’ll look back to a decade-long journey of ups and downs for the team of web archivists. We are lucky to have curatorial and technical specialists working side by side in one little department, and our roles are often mixed (web curators do harvesting and configure Heritrix when needed). Personally, it has been rather difficult to handle the technical duties as I come from the library background. But just recently as I have the longest web archiving experience in my team, I realized that I finally have the big picture and am able to have a say in almost in every aspect of our work. It took me nearly ten years to get there!
In the second part of the talk I’ll present my personal observations about how the international web archiving field has developed over the years I have been involved in it. Looking back, awareness of the importance of preserving cultural heritage on the web has risen remarkably. But there are still countries that don’t have any national web preservation programs, or have only just started. So the longer players in the field, including Estonia, have had the opportunity to help them out by sharing their experiences. And even longer players have helped us to become better in our work – we owe a lot to the web archiving community, to the IIPC in particular.
12:30 pm - 1:30 pm LUNCH
1:30 pm - 3:00 pm SESSION 2
1:30 pm - 3:00 pm ARCHIVING COMMUNITIES AND DISSENT Chair: Samantha Abrams
Community Webs: empowering public libraries to create community history web archives 14:00-14:30
Makiba J. Foster & Jefferson BaileyAuditorium
Maria Praetzellis, Internet Archive
Makiba J. Foster, Schomburg Center for Research in Black Culture, New York Public Library
Many public libraries have active local history collections and have traditionally collected print materials that document their communities. Due to the technical challenges of archiving the web, lack of training and educational opportunities, and lack of an active community of public library-based practitioners, very few public libraries are building web archives. This presentation will review the grant funded Community Webs program working with 27 public library partners to provide education, training, professional networking, and technical services to enable public libraries to fulfill this vital role.
The Schomburg Center for Research in Black Culture (a Community Webs cohort member) will provide an example of a Community Webs project in action and discuss their innovative project to archive social media hashtagged syllabi related to race and social justice. With continued national dialogue around race and gender, The Schomburg Center is collecting and preserving web-based syllabi focused on race and social justice issues. The recent phenomena of the syllabi movement hashtagged on social media with crowdsourced Google Docs and blogged syllabi (e. g. #CharlestonSyllabus, #FergusonSyllabus, #KaepernickSyllabus, #TrumpSyllabus), represents an innovative way to create a more learned society regarding race and social justice. Web-based publishing of syllabi extends the traditional classroom and enables participation for those excluded from formal learning opportunities. The Internet Archive will talk about the development, group activities, and outcomes of the full Community Webs program.
There is great potential to apply the Community Webs educational and network model to other professional groups such as museums, historical societies or other community based groups in order to diversify institutions involved in collecting web content. There is an opportunity for IIPC members and their local constituencies to implement similar programs or play a network or leadership role in expanding the universe of web collecting. We will close the session with a call for partnerships to help bring this model to other IIPC member organizations and continue to grow the field of web archiving.
Curating dissent at the State Library of Victoria 14:30-15:00
Peter Jetnikoff, State Library of Victoria
The State Library of Victoria has been collecting web publications through PANDORA for twenty years. There are numerous themes discernible in this collection that express a timeline of web usage, design and behaviour. This paper will address one in particular: protest.
Material of political dissent and social action in the state of Victoria has been collected by the Library from the nineteenth century onwards. The online material is an extension of that in terms of technology and also a continuance of tradition. But perhaps the most important aspect of this material is that, representing as completely as it can one side of the dispute, it emerges as primary historical source material, aligning it with manuscripts, small press and pamphlets in the greater collection. More than a simple record of the times, the look and feel of the technology, this material is witness to the timeline of dissent as a series of modes along with shifts in content.
This paper will discuss some of the more significant items collected by the Library over the past two decades such as Residents Against McDonalds, Occupy Melbourne and other protest publishing as well as dissenting material that appears at election time (with particular attention given to the 1999 Victorian state poll). The collection of this material offers its own peculiar issues and challenges, sometimes involving the Library itself being perceived as partisan coupled with the ongoing need to convince online publishers that they are, in fact, publishing. The issue of the need to secure publisher permission will continue but recent developments within the PANDORA partnership have provided new options. The intersection of political activity and the increasing utility of emerging technology has seen a steady shift from websites to social media which, in turn, offers new challenges to collect moments of dissent for permanent curation.
1:30 pm - 3:00 pm ACCESS PORTALS AND APIs Chair: Martin Klein
Nation wide webs 13:30-14:00
Jefferson Bailey, Internet Archive
This talk will outline efforts to build new nation-specific web archive access portals with enhanced aggregation, discovery, and capture methods. Many national libraries have been conducting web harvests of their ccTLD for years. These collections are often composed solely of materials collected from internally-managed crawling activities and have access endpoints that are highly restricted to reading-room-only viewing. These local-access portal largely adhere to the “known-URL” lookup and replay paradigm of traditional web archive access tools.
Working with partners, and as part of advancing R&D on improving access to web collections, the Internet Archive have been developing new portals to national web domains in concert with the work of national libraries with the mandate to archive their websphere. These collections are “sourced” from a variety of past and scheduled crawling activities — historical collections, specific domain harvests, relevant content from global crawling, in-scope donated and contributed web data, curatorial web collecting, user-submitted URL contributions, and other acquisition methods. In addition, these portals leverage new search tools including both full-text search, non-text item (image, audio, etc) search, linkback from embedded resources, relevant content identified by geoIP matching or PageRank-style scoring, and categorization such as “highly visited” or “no longer on the live web.” While giving new life to the discovery and use of ccTLD-specific web access portals, the project is also exploring how new features, functionality, profiling, and enhanced discovery and reporting methods can advance how we think of access to web archives.
Cultivating open-access through innovative services 14:00-14:30
Fernando Melo & Daniel Gomes, Arquivo.pt at FCCN-FCT
Arquivo.pt preserves more than 4 billion files in several languages preserved from the web since 1996 and provides a public search service that enables open-access to this information. The service provides user interfaces for textual, URL and advanced search. It also provides Application Programming Interfaces (APIs) to enable fast development of value-added applications over the preserved information by third-parties.
However, the constant evolution of the web and of society demands constant development of a web archive to follow its pace of evolution and maintain the accessibility of the preserved content. Thus, creating a new mobile version and improving our APIs were mandatory steps to support broad open access to our collections.
In November 2017, Arquivo.pt launched a new mobile version. The main novelty was the adaptation of user interfaces to mobile devices and preservation of the mobile web. In order to achieve responsive and mobile-friendly user interfaces for Arquivo.pt we had to address questions such as:
Who wants to view archived web pages on such a small device?
Should we replay archived web pages on mobile web pages inside an iframe or full screen?
What additional services/functionalities can we add to a mobile version?
Should we privilege the replay of older archived web pages, or newer responsive ones?
How can we show an extensive list of archived versions of a given URL in such a small device?
In order to facilitate automatic access to our full-text search capabilities we decided to release a new text search API in JSON. One can perform automatic queries from a simple word search, to more complex ones such as finding all archived web pages from a given site that contain a text expression in a certain time range.
We would like to present this new API, and show how anyone can easily integrate their work with our preserved information.
Despite the challenges, it is very gratifying to develop an open-access web archive, which can reach a large audience of users and researchers. Arquivo.pt reached 100 000 users for the first time in 2017, and there was a significant increase of research projects using Arquivo.pt as a source of information.
Creating a new user interface for the UK Web Archive 14:30-15:00
Jason Webber, The British Library
The UK Web Archive (UKWA) started collecting selected websites (with owners’ permission) in 2005. In the subsequent 13 years this ‘Open UK Web Archive’ has grown to approximately 15,000 websites, all of which are available publicly through www.webarchive.org.uk.
In 2013 UK law changed to allow the collection of all websites that can be identified as owned or produced in the UK. Since then the ‘Legal Deposit Web Archive’, through an annual domain crawl, has added millions of websites (and billions of individual items). This collection, however, can only be viewed in the reading rooms of UK Legal Deposit Libraries (seven locations in the UK and the Republic of Ireland).
It is a key aim of the UKWA to be practically useful to researchers and give the best access that is possible given the legal restrictions. Up to now this has been a considerable challenge and in order to attempt an answer to this challenge UKWA have worked for two years on a new user interface.
This talk aims highlight the challenges of using a large national collection for research and how UKWA have resolved or mitigated these difficulties, including:
The UKWA service has multiple collections (‘Open’ and ‘Legal Deposit’) that offer different content and have different access conditions. How best to communicate these differences to researchers?
The ‘Open’ and ‘Legal Deposit’ collections are viewed through two different interfaces. Can or should there be a single interface for both collections?
The UKWA service has fully indexed both ‘Open’ and ‘Legal Deposit’ collections that gives enormous potential for researchers to search by keyword or phrase. Any search, however, results in thousands or even millions of returns. Without Google style relevance, how do researchers find meaningful results?
UKWA has over 100 curated collections on a wide scope of subject areas. How should these collections be highlighted and presented to researchers?
3:00 pm - 3:30 pm AFTERNOON TEA
3:30 pm - 5:30 pm SESSION 3
3:30 pm - 5:30 pm PANELS Chair: Flora Feltham
Collaborative, selective, contemporary: lessons and outcomes from new web archiving forays focused on China and Japan 15:30-16:30
Nicholas Taylor & Zhaohui XueAuditorium
Yan Long, Regan Murphy Kao, Nicholas Taylor & Zhaohui Xue, Stanford University Libraries
A reasonable assessment of web archiving efforts focused on China and Japan suggests that the level of collecting is not commensurate with the prominence of Chinese and Japanese web content broadly. Mandarin Chinese is the second-most common language of world internet users; Japanese is the seventh. The distribution of the languages of websites is dominated by English, with other languages in the long-tail, but Mandarin Chinese and Japanese are both in the top nine languages, representing 2% and 5.1% of websites respectively.
Meanwhile, the number of web archiving efforts focused on China and Japan is comparatively modest. The community-maintained list of web archiving initiatives highlights only three (out of 85) efforts focused on China or Japan. A search for “china” or “chinese” on the Archive-It portal yielded 56 collections (out of 4,846); a search for “japan” or “japanese” yielded 43 – 1% or less of Archive-It collections for both.
Recognizing the opportunity for more selective archiving of Chinese and Japanese web content, the Stanford East Asia Library has over the last several years led efforts to curate two major new collections, documenting Chinese civil society and contemporary Japanese affairs, respectively. This panel will discuss the particular motivations and impact of these collecting efforts, as well as address the following questions of more general interest to web archive curation practice:
- How can collaboration with researchers inform web content collecting efforts?
- What role do content creators themselves play in facilitating web content collecting efforts?
- How can coordination with and consideration of other institutions’ web content collecting efforts inform local collecting?
- What challenges – in terms of communications, funding, metadata, policy, quality assurance, staffing, workflow – are entailed in undertaking a new web archiving initiative and how can they be addressed?
- How is web content collecting continuous or discontinuous with the kinds of collecting that libraries have traditionally engaged in?
Apart from these questions of curatorial concern, this panel will also detail technical aspects of the two projects, including quality assurance observations and how Stanford Libraries has managed the collections through a hybrid infrastructure consisting of Archive-It, the Stanford Digital Repository, a local OpenWayback instance, and Blacklight-based discovery and exhibits platforms.
3:30 pm - 5:30 pm ARCHIVES UNLEASHED TOOLKIT Chair: Olga Holownia
Opening up WARCs: The Archives Unleashed Cloud and Toolkit project 15:30-16:00
Ian Milligan, University of Waterloo
Since 2013, our research team has been exploring web archives analytics through the Warcbase project, an open-source platform that we have developed in conjunction with students, librarians, and contributors. Through in-person presentations, workshops, and GitHub issues and tickets, we identified several barriers to scholarly engagement with web archives: the complexity of tools themselves and the complexity of deployment.
Our Archives Unleashed Project, funded by the Andrew W. Mellon Foundation, aims to tackle tool complexity and deployment through two main components, the Archives Unleashed Toolkit and the Archives Unleashed Cloud. This presentation introduces these two projects both through a conceptual introduction, as well as a running in-depth live demo of what the Toolkit and Cloud can do. Our approach presents one model of how institutions can facilitate the scholarly use of ARC and WARC files.
The Archives Unleashed Toolkit is the new, cleaner, and more coherent version of Warcbase. Starting with a clean slate in our redesign, we are adopting Python as the primary analytics language. This offers advantages in that it can reach out to digital humanists and social scientists, and also allow us to tap into a broad ecosystem of Python tools for linguistic analysis, machine learning, visualization, etc. It supports a combination of content-based analysis (i.e. selecting pages with certain keywords or sentiment) and metadata-based analysis (particular date ranges or hyperlinking behaviour).
Yet we realized that the command-line based Archives Unleashed Toolkit presents difficulties for many users in that it requires technical knowledge, developer overhead, and a knowledge of how to run and deploy a system.
The Archives Unleashed Cloud thus bridges the gap between easy-to-use curatorial tools and developer-focused analytics platforms like our Toolkit. Archivists can collect their data using GUI interfaces like that of Archive-It. Our vision is that the Archives Unleashed Cloud brings that ease to analytics – taking over where existing collection and curatorial dashboards end.
While the Cloud is an open-source project – anybody can clone, build, and run it on their own laptop, desktop, server, or cluster – we are also developing a canonical version that anybody can use. We will note our sustainability discussions around how we can make this project viable after the length of our project.
What can you do with WARCs? 16:00-17:30
Andrew N. Jackson, Ian Milligan & Olga HolowniaTiakiwai
This workshop will introduce a range of tools for full-text indexing and analysis of web archived material. For full-text search and visualisation, this will be based on the webarchive-discovery indexing system, and the Shine and Warclight user interfaces that enable the exploration of the archived data.
For general analysis,we will look at the Archives Unleashed Toolkit and its front end, the Archives Unleashed Cloud. In this workshop, we will go through the following process on sample data (or a selection of attendees own WARCs if they bring a few):
- Discovering the frequency of domains within a collection;
- Extracting plain text of HTML pages from a web archive based on:
- Particular domains (i.e. all pages from archive.org);
- Date (i.e. all pages from 2009); and
- Language (i.e. French or English-language pages as detected by Tika)
- Extracting and visualizing a hyperlink network.
6:30 pm - 8:00 pm PUBLIC EVENT
Truth Justice and the Internet
Jefferson Bailey, Vint Cerf, Dr Rachael Ka’ai-Mahuta, Wendy Seltzer & Andrew CushenAuditorium
More information and registration at InternetNZ.
This public event, hosted by InternetNZ and the National Library of New Zealand, and sponsored by GOVIS, brings together an international panel to explore integrity in the online world. What has happened to truth and should we care? To whom does truth belong?
- Andrew Cushen: InternetNZ
- Jefferson Bailey: Director, Web Archiving, Internet Archive
- Vint Cerf: Chief Internet Evangelist, Google
- Dr Rachael Ka’ai-Mahuta: Senior Researcher, Te Ipukarea, the National Māori Language Institute, Auckland University of Technology
- Wendy Seltzer: Strategy Lead and Counsel, World Wide Web Consortium
18:30– 19:00: Introduction and 5 minute pitch from panellists
19:00– 20:00: Questions and discussion
(nibbles and drinks provided from 6pm)
The event will be held at the National Library of New Zealand, 100 Molesworth Street, Wellington
Truth and understanding are not such wares as to be monopoliz’d and traded in by tickets and statutes and standards. We must not think to make a staple commodity of all the knowledge in the land, to mark and license it like our broadcloth and our woolpacks.
John Milton, Areopagitica. A Speech For The Liberty Of Unlicensed Printing To The Parliament Of England 1644.
The internet has always been a place grounded (ironically) in the postmodern. It should be of little surprise therefore that current debate around the globe is focussed on the facts and fictions that people and communities are promulgating online.
From the most powerful to the most marginalised, the internet is where people can shape their own image and reality. They do this to subvert notions of fact and truth, challenge dominant voices, and every other reason on the scale from nefarious to righteous).
During the second week of November, many of the institutions attempting to collect and preserve the Internet are meeting in Wellington at the International Internet Preservation Consortium Web Archiving Conference. One of the major tensions for these institutions is how best to traverse the multi-faceted and often “monopoliz’d and traded” notion of truth on the Internet. What are the responsibilities of these organisations? Are they mere collectors, or do they have a role to surface truth, and privilege it over other, deliberately created, fictions?
Wednesday 14 Nov 2018
8:45 am - 9:00 am WELCOME TO DAY 2
Registration: Auditorium Reception Desk
Tea & Coffee: Programmes Room
9:00 am - 10:30 am PLENARY TALKS
Platinum sponsor: Hitachi Vantara and Revera: Evolving solutions for cultural preservation 9:00-9:45
Jon Chinitz, Product Manager, Cloud and Data Intelligence Group (Hitachi Vantara)
Digital record keeping and archiving must transcend generations and last for many decades to come. Long-term digital preservation challenges go beyond backup & restore or disaster recovery and extend to principles such as context and provenance. The National Library of New Zealand is well recognized for its pioneering work in ensuring the ongoing preservation, protection and accessibility of the documentary heritage of New Zealand. This talk will highlight its collaboration with Hitachi Vantara and Revera Cloud Services, and the development of a bespoke digital storage solution drawing on cutting edge technology for highly secure object storage. The Library’s growing digital heritage collections are now supported with an agile, scalable solution that will enable it to connect national and international researchers with the digital taonga (treasures) of New Zealand, and ensure the ongoing use and re-use of this material.
Employed with Hitachi since 2005, Jonathan has over 30 years of Industry expertise in the areas of distributed computing, network security, storage systems and cloud technologies. He is currently managing Hitachi’s new Content Intelligence offering, a cloud-ready search and discovery application.
A veteran of several software startups, Jonathan has held technical and management positions with
- Hitachi Vantara (Product Manager, Cloud and Data Intelligence Group)
- Hitachi Data Systems (Technical Evangelist, File & Content Solutions)
- Archivas (Business Development Manager)
- Secured Services (Director Product Management)
- Vasco Data Security (Vice President and General Manager, Software Group)
- IntelliSoft (Founder and President)
- Open Software Foundation (Consultant)
In his spare time Jonathan loves all things Boston Red Sox and New England Patriots, watching movies and spending time with his family including two English Springer Spaniels.
Concrete steps for the long-term preservation of digital Information 9:45-10:30
Vint Cerf, Vice President and Chief Internet Evangelist, Google introduced by Steve Knight, National Library of New Zealand
It is becoming more widely understood that bits are not impervious to degradation and loss. If we are to develop methods to cope with this fact, we will need to establish a set of desirable properties of a digital preservation ecosystem that will deal with technical issues, business models and legal frameworks friendly to a preservation regime. In this talk, I hope to lay out at least the properties that I believe will be conducive to successful creation of such a regime and perhaps some specifics that could be pursued in the near term.
Vinton G. Cerf co-designed the TCP/IP protocols and the architecture of the Internet and is Chief Internet Evangelist for Google. He is a member of the National Science Board and National Academy of Engineering and Foreign Member of the British Royal Society and Swedish Royal Academy of Engineering, and Fellow of ACM, IEEE, AAAS, and BCS. Cerf received the US Presidential Medal of Freedom, US National Medal of Technology, Queen Elizabeth Prize for Engineering, Prince of Asturias Award, Japan Prize, ACM Turing Award, Legion d’Honneur, the Franklin Medal and 29 honorary degrees.
11:00 am - 12:30 pm SESSION 4
11:00 am - 12:30 pm NATIONAL COLLABORATION Chair: Paul Koerbin
The End of Term Archive: collaboratively preserving the United States government web 11:00-11:30
Abbie Grotke, Library of Congress
Mark Phillips, University of North Texas Libraries
In the fall of 2016 a group of IIPC members in the United States organized to preserve a snapshot of the United States federal government web (.gov). This is the third time the End of Term (EOT) project members have come together with the goals of identifying, harvesting, preserving and providing access to a snapshot of the federal government web presence. The project is a way of documenting the changes caused by the transition of elected officials in the executive branch of the government, and provides a broad snapshot of the federal domain once every four years that is ultimately replicated among a number of organizations for long-term preservation.
Presenters from lead institutions on the project will discuss its methods for identifying and selecting in-scope content (including using registries, indices, and crowdsourcing URL nominations through a web application called the URL Nomination Tool), new strategies for capturing web content (including crawling, browser rendering, and social media tools), and preservation data replication between partners using new export APIs and experimental tools developed as part of the IMLS-funded WASAPI project.
The breadth and size of the End of Term Web Archive has informed new models for data-driven access and analysis by researchers. Access models that have included an online portal, research datasets for use in computational analysis, and integration with library discovery layers will be discussed.
Presenters will speak to how the project illuminates the challenges and opportunities of large-scale, distributed, multi-institutional, born-digital collecting and preservation efforts. A core component has also been how the project activities align with participant institutions collection mandates, as well as with other similar efforts in 2016-2017, such as the Data Refuge project, to preserve government web content and datasets. The EOT, along with related projects, has raised awareness about the importance of archiving historically-valuable but highly-ephemeral web content without a clear steward, resulting in a dramatic increase in the awareness of the importance of web archiving during times of transition of government.
Addressing our many solitudes: building a web archives community of practice in Canada 11:30-12:00
Nich Worby & Jeremy HeilAuditorium
Corey Davis, Council of Prairie and Pacific University Libraries (COPPUL)
Nich Worby, University of Toronto
Jeremy Heil, Queen’s University
Canada is a large country with many stakeholders involved in web archiving, from city archives to national libraries. Until 2017, most of these efforts took place in relative isolation, which resulted in needless duplication of efforts and significant collection gaps. This session will provide an overview of the establishment of the Canadian Web Archiving Coalition (CWAC), a national effort to formalize collaboration and coordination for web archiving across the country.
Under the auspices of the Canadian Association of Research Libraries Digital Preservation Working Group and Advancing Research Committee (CARL DPWG and CARL ARC), the Canadian Web Archiving Coalition (CWAC) was established in 2017 to develop an inclusive community of practice within Canadian libraries, archives, and other memory institutions engaged or otherwise interested in web archiving, in an effort to identify gaps and opportunities best addressed by nationally coordinated strategies, actions, and services, including collaborative collection development, training, infrastructure development, and support for practitioners.
This session will provide an overview of the CWAC in an effort to help our international colleagues understand and connect with web archiving efforts in Canada, but also to serve as an example for other jurisdictions attempting to develop an effective national community of practice and coordinating mechanism where before there was only haphazard and informal collaboration and coordination.
National Mini-IIPC: setting up collaboration in web archiving in The Netherlands 12:00-12:30
Arnoud Goos, Netherlands Institute for Sound and Vision
Of all the Dutch websites, only one percent is actually archived by one of the national web archives. A lot goes lost, or is already gone. In The Netherlands there are many organisations that have relatively small web archives. Collaboration between them is important. Each of these national or regional archives have their own reasons for archiving websites and they have their own collection scope and selection criteria. The Royal Library is by far the largest web archive in The Netherlands, but besides them there is the Netherlands Institute for Sound and Vision that is collecting media related websites, the University of Groningen that is collecting political websites and quite a few regional archives that collect websites from local schools, sport clubs, local festivals or other websites on the life in the city or region. Up until recently these web archives were acting on their own, sometimes even without knowing each other’s existence.
This has changed over the past two years. The Digital Heritage Network and the Netherlands Institute for Sound and Vision have worked on setting up collaboration between these different initiatives. With starting an web archiving expert group, a sort of national mini-IIPC has been created. Besides organizing conferences, the Digital Heritage Network has produced videos on the importance of web archiving, and developed the National Register for Archived Websites. This register is an overview of all the websites that have been archived in the Netherlands. It shows the archived URL, the period in which it was crawled, the tools that were used, how accessible the archived website is and the reason for archiving the website. This overview is at first accessible for web archiving professionals (to see what colleagues are archiving and what not), but when all the data is up to date, it will be made accessible for the public. Because of copyright issues, the register for now only contains metadata and links to the web archives’ live websites instead of the archives pages. Hopefully, this can change in the near future.
11:00 am - 12:30 pm DIGITAL PRESERVATION Chair: Barbara Sierman
A digital preservation paradigm shift for academic publishers and libraries 10:30-11:00
Catherine Nicole ColemanTiakiwai
Catherine Nicole Coleman, Stanford University
The Stanford University Press and Stanford University Libraries are engaged in a grant-funded partnership to pave the way for publishing book-length peer reviewed online academic projects that we are calling interactive scholarly works. University presses and libraries have well established protocols and processes for print publication, many of which are rooted in our assumptions about the durability and longevity of the printed word. With the advent of electronic books, we have had to find ways to preserve not bound paper, but the bits. Now interactive scholarly works present an entirely new set of challenges for preservation because the scholarship is embedded in the digital form. It is not possible to have a print version of the original to fall back to since the online interactive presentation of the work—its unique format—is an essential part of the argument.
University presses and university libraries are close collaborators in scholarly publishing. Libraries acquire the books, then provide access, discovery, preservation and conservation. New processes and workflows are required if we are to provide those same library services for these new interactive scholarly works, like web archiving. Since Stanford Libraries has an existing web archiving service, we had hoped that our solution would start with a web archived version of each project as the published output. But what about the many projects that resist web archiving? A productive tension arose between nudging authors to produce works that fit our current preservation strategies and giving authors the freedom to produce innovative works that require new preservation strategies.
We do not yet know how researchers will want to explore Interactive Scholarly Works five, ten, fifty, or a hundred years from now. If we follow the example of print, we can assume that some will be interested in the original format while others will be interested in the underlying code; some will be interested in the author’s argument and intent, but see the vehicle of expression outdated and irrelevant. In anticipation of this uncertain future, we are exploring an approach to publication that anticipates multi-modal access and preservation strategies, with attention to the perceptual and conceptual aspects of the work as well as the constituent content elements. This paper will present perspectives from authors, the publisher, and the library (including the web archiving team, digital forensics, and operations) that have driven our design of a preservation strategy for these innovative works. The paper will also address the conflicting assumptions about what we are preserving and why.
Utilising the Internet Archive while retiring legacy websites and establishing a digital preservation system 11:30-12:00
Michael Parry, Max Sullivan & Stuart Yeates, Victoria University of Wellington Library
In 2017 Victoria University of Wellington Library implemented Wairētō, an installation of the Rosetta Digital Preservation System from Ex Libris. Two of the core collections to be migrated into the new system are The New Zealand Electronic Text Collection and the ResearchArchive. Both of these collections have legacy websites that need to be decommissioned.
In this presentation we will discuss using the Internet Archive and Wairētō to archive these websites ensuring ongoing access and long term preservation. The process has four stages, with the first completed and the second and third to be complete before the end of 2018. The fourth is planned for 2019.
First stage: subscribe to the Internet Archive and ensure the websites are archived by creating collections and crawling each site.
Second stage: download the (W)ARC file for each site from the Internet Archive into Wairētō
Third stage: ensure that redirects from links to the original sites are either pointed to the new equivalent within Wairētō or the Internet Archive
Fourth stage: integrate the open source Wayback Machine from the Internet Archive into the Wairētō infrastructure.
This four stage process will also act as a pilot for the Library to potentially establish a web archiving service for the wider University.
The presenters will share how this process has been implemented, issues and solutions raised, and where to next. We will be discussing the use of third party tools for web archiving and how to link them into internal tools and workflows.
Digital resources – the national project of webharvesting and webarchiving in Slovakia 12:00-12:30
Peter Hausleitner & Jana MatúškováTiakiwai
Andrej Bizík, Peter Hausleitner & Jana Matúšková, University Library in Bratislava
In April 2015 the University Library in Bratislava (ULB) was charged with the national project ‘Digital Resources – Web Harvesting and E-Born Content Archiving.’ The goals of the project were acquisition, processing, trusted storage and usage of the original Slovak digital resources. Its ambition was to establish a complex information system for harvesting, identification, management and long term preservation of web resources and e-Born documents (a platform for controlled web harvesting and e-Born archiving). The Digital Resources Information System consists of specialised, mainly open source software modules in a modular system with a high level of resource virtualization. The basis represents the server cluster, which consists of dedicated public and internal portal server and a form of work” servers for running the system processes. The system management is optimized for parallel web harvesting. This enables the system to carry out the full domain harvest with required politeness in acceptable time.
At present, the ULB web archiving system disposes with 800 TB storage. The application is supported by a powerful HW infrastructure. There is a form of 21 blade servers representing a virtual environment for multiple harvesting processes and 3 standalone database servers. The HW components are interconnected via high speed channels. The system consists further of the support modules for communication, monitoring, backup and reporting. A very useful system feature is a functionally identical parallel testing environment, which enables preventive harvest and problem analysis without interference of the production processes.
A substantial part of the system is the catalogue of websites, which is regularly updated during the automated survey of the national domain .sk. Domains that match our policy criteria are added to the catalogue manually (e.g. .org, .net, .com, .eu).
The operation, management and development of the Digital Resources Information System performs the department Deposit of Digital Resources of ULB with one head, three specialised digital curators and one part-time person for born-digital titles.
The project finished in the fall of 2015. At present the routine practice continues. Since 2015 ULB has performed three full-domain harvests (harvesting of the national .sk domain), multiple selective and thematic crawls.
12:30 pm - 1:30 pm LUNCH
1:30 pm - 3:00 pm SESSION 5
1:30 pm - 3:00 pm WEBRECORDER Chair: Jan Hutař
Pywb 2.0: technical overview and Q&A, or everything you wanted to know about high-fidelity web archiving but were afraid to ask 13:30-14:00
Ilya Kreymer, Rhizome
Webrecorder pywb (python wayback) is a fully open-source Python package designed to provide state-of-the-art high fidelity web archive replay. The pywb 2.0 version was released in the beginning of 2018 with an extensive list of new features.
Originally developed as a replacement for the classic Wayback Machine, the latest release includes several new features going beyond that original scope, including a built-in capture mode for on-the-fly WARC capture and patching, full HTTP/S proxy mode, a Memento aggregation and fallback framework, an access control system, and a customizable rewriting system.
The presentation will briefly discuss the new features in pywb and how they can help institutions provide high fidelity web archive replay and capture. However, the purpose of the talk is not to be a tutorial on how to use pywb, but rather to share the knowledge of the many difficult problems facing web archive capture and replay for an ever-evolving web and to present the possible solutions that have worked in pywb to solve them. Topics covered will include the mechanics of the fuzzy matching in the rewriting system present in pywb, client-side rewriting, video stream rewriting, and domain-specific rules. Ongoing work and and remaining unsolved technical challenges facing web archives in the future will be discussed as well.
The talk will end with a Q&A portion that will help inform future pywb development and help the project become more useful to the IIPC community.
PANEL: Capturing complex websites and publications with Webrecorder 14:00-15:00
Jasmine Mulliken, Anna Perricci, Sumitra Duncan & Nicole ColemanAuditorium
Jasmin Mulliken, Stanford University Press
Anna Perricci, Rhizome
Sumitra Duncan, New York Art Resources Consortium
Nicole Coleman, Stanford University
Recent advancements in web archiving technology are looking especially promising for the preservation of artistic and scholarly work. While the Internet Archive continues to play a crucial role in this endeavor, there are limitations that make it challenging for publishers, art museum libraries, and artists to capture the increasingly dynamic and complex interactive features that often define the work they produce, present and preserve. Fortunately, there is an option in addition to the Internet Archive for organizations working in art and scholarly publishing, two fields that often deal with unique, complex, and bespoke web content.
Webrecorder, a Rhizome project funded by the Andrew W. Mellon Foundation, offers a symmetrical approach to web archiving services in that a web browser is used both as a tool for capture of websites and access to the archived web content. Webrecorder’s tools fill some gaps left by large-scale services and allow a more granular and customizable experience for curators and publishers of digital content. There will be an updated version with improved functionality and features released in March, 2018. Rhizome’s planning for technical development and financial sustainability is in progress and by the end of 2018 a robust plan for growth over the next 3-5 years will be established.
Stanford University Press is breaking new ground in scholarly publishing with its Mellon-funded initiative for the publication of online interactive scholarly works. Unlike typical open-access textbooks or ebooks, these works carry all the heft of a traditional monograph but in a format that leverages the potential of web-based digital tools and platforms. In fact, these works could not be published in traditional monograph form because the arguments are embedded in the technology. Included in SUP’s grant is a mandate to archive and preserve these ephemeral works. Webrecorder is proving to be especially compatible with the bespoke projects the Press is publishing.
The New York Art Resources Consortium (NYARC)-the research libraries of the Brooklyn Museum, The Frick Collection, and The Museum of Modern Art-has developed a collaborative workflow for building web archive collections, with captures of several thousand websites now publicly available. NYARC’s web archive collections include the consortium’s institutional websites and six thematic collections pertaining to art and art history: art resources, artists’ websites, auction houses, catalogues raisonnés, NYC galleries, and websites related to restitution scholarship for lost or looted art. NYARC has actively contributed to user testing efforts of the Webrecorder tool over the past three years and has now integrated the tool into their workflow in compliment to their use of the Archive-It service. The use of Webrecorder has been especially pertinent to capturing complex museum exhibition sites, scholarly sites devoted to specific artists, and social media accounts of the museums.
This panel will include an overview of Webrecorder’s most significant new features and plans for sustainability. Co-panelists from Stanford University Press and NYARC will explain and demonstrate their uses of Webrecorder in the context of their unique projects, which represent their fields’ unique web archiving needs.
1:30 pm - 3:00 pm LEGAL CONSIDERATIONS Chair: Gillian Lee, Moderator: Wendy Seltzer
Preserving the public record vs the ‘right to be forgotten’: policies for dealing with notice & takedown requests 13:30-14:00
Nicola Bingham, The British Library
Preserving the public record vs the 'right to be forgotten': policies for dealing with notice & takedown requests
The mission of the UK Web Archive is to build web collections that are as comprehensive and as widely accessible as possible. However we must achieve this responsibly, lawfully and ethically. Increasingly, the public are concerned about their data privacy and the risk of exposure of sensitive personal data online.
The EU General Data Protection Regulation, and the new UK-only Data Protection Act which will align GDPR with UK law, have implications for web archiving. Most significantly, "a right [for the data subject] in certain circumstances to have inaccurate personal data rectified, blocked, erased or destroyed"
Public bodies will likely have derogation under performance of a task carried out in the public interest”. But the data subject has pre-eminence, and can request that information is removed, if they claim significant harm or distress”.
In light of this new legislation, we have been looking at tensions around the archival principles of preserving the public record vs the individual’s expectation of the right to be forgotten, i.e. withdrawing their content from the archive on request. Under what circumstances should we honour such requests?
The presentation will explore how we minimise the risk of crawling and exposing personal data in the first place and how we deal with requests for take down of material. What policies and procedures are in place? What criteria do we use to evaluate individual cases? Are we transparent and consistent in our take down policies?
Legal deposit in an era of transnational content and global tech titans 14:00-15:00
Amy Joseph, Nicola Bingham, Peter Stirling, Kristinn Sigurðsson, Tom Smyth & Maria RyanTiakiwai
Amy Joseph, National Library of New Zealand
Nicola Bingham, The British Library
Peter Stirling, BnF
Kristinn Sigurðsson, National and University Library of Iceland
Maria Ryan, National Library of Ireland
Legal deposit has evolved over time to provide a mandate for many national collecting agencies to collect content from the web, including web archives. The empowering legislation differs from country to country, so as a community we have both shared concerns and unique challenges. This panel will feature several national agencies that collect content from the web, including some who do not have legal deposit legislation covering online content. Participants will discuss issues such as:
- how well their legislation supports web archiving and other content collection from the web
- any grey areas in how their legislation applies to the contemporary global online publishing environment
- approaches to collecting web content when not firmly supported by empowering legislation
- challenges in providing access to material collected under legal deposit, and in communicating access restrictions to users.
The panel will not just be a presentation of issues, but also an exploration of our appetite for a joined-up approach to archiving websites and other content from the web. Could we do collaborative collection development for collecting areas that transcend national boundaries beyond major event-based collections? Could national deposit agencies make a collaborative commitment to approaching major technology companies (like Google, Amazon or Bandcamp) from global content platforms is collected and preserved?
3:00 pm - 3:30 pm AFTERNOON TEA
3:30 pm - 5:30 pm SESSION 6: WORKSHOPS
3:30 pm - 5:30 pm WEBRECORDER TUTORIAL
Human scale web collecting for individuals and institutions 15:30-17:30
Anna PerricciTiakiwai II
Anna Perricci, Rhizome
Over the past 15 years web archiving tools, standards and practices have changed significantly in some ways but in other respects they have remained fairly consistent. One example of consistency is the use of crawlers or other automated software to harvest websites. While most of the content brought into web archives to date has been harvested using crawlers there is a new approach that can lower the barrier to entry for collecting materials from the web. This approach, ‘symmetrical web archiving’, allows users to both collect and access web archives via web browsers.
In this session we will talk about ways to make high fidelity interactive captures of web content using Webrecorder.io, which is a free, open source software that runs in a web browser. How to manage the collected materials, download them (as a WARC file) and open a local copy offline using Webrecorder Player will also be covered. With a model of human scale collecting (one page at a time) Webrecorder makes it possible to acquire and save web content as you browse it on the live web. Webrecorder’s high fidelity capture has been developed to be precise enough to capture internet art and serve as a ‘museum quality’ tool for collecting, which benefits all users who wish to have a high level of similarity between the original and archived copy. Tutorial attendees will be given a high level overview of Webrecorder’s features then engaged in hands-on activities and discussions.
Human scale web collecting with Webrecorder is not expected to meet all the requirements of a large web archiving program but can satisfy many needs of researchers, smaller web collecting initiatives and be used in personal digital archiving projects. Larger collecting programs, such as those at national libraries, can still use Webrecorder to capture dynamic content and user triggered behaviors on websites. The WARC files created in Webrecorder then can be downloaded and ingested to join WARCs that have been created using crawler based systems. With a tool like Webrecorder anyone can get started with web archiving quickly, which is empowering both to web archivists/information professionals/librarians and their stakeholders (especially if stakeholders are given information about using Webrecorder themselves).
Features that are due for release in April, 2018, for describing and managing one’s collections in Webrecorder will be covered in addition to Webrecorder’s current core tool set. If automated tools that are projected for development over the next six months are stable the usage of these features also can be addressed in this workshop.
This tutorial will be a mix of demos and hands on activities accompanied by discussions. Materials can be delivered in units customized for audiences (e.g. those with experience with web archiving or participants who are new to web archiving).
3:30 pm - 5:30 pm WEB CURATOR TOOL & WARC
The Web Curator Tool relaunch 15:30-16:30
Ben O'Brien & Hanna KoppelaarTiakiwai I
Ben O’Brien, National Library of the Netherlands
Hanna Koppelaar, Koninklijke Bibliotheek - National Library of the Netherlands
This tutorial will highlight the new features of the Web Curator Tool (WCT), added from January 2018 onwards through collaboration between the National Library of New Zealand (NLNZ) and the Koninklijke Bibliotheek - National Library of the Netherlands (KB-NL). One of the themes from the collaboration has been to future proof the WCT. This involves learning the lessons from the previous development and recognising the advancements and trends occurring in the web archiving community. The objective is to get the WCT to a platform where it can keep pace with the requirements of archiving the modern web. The first step in that process was decoupling the integration with the old Heritrix 1.x web crawler, and allowing the WCT to harvest using the more modern Heritrix 3.x version. A proof of concept for this change was successfully developed and deployed by the NLNZ, and has been the basis for a joint development work plan. While it will primarily be a demonstration, the tutorial is intended to be an interactive session with the audience and a showcase of how to work collaboratively on opposite sides of the world.
The target audience for this tutorial is existing WCT users and entry-level organisations/institutions wanting to start web archiving. Some interest from the wider web archiving community would also be expected. We regularly find through the WCT support channels, people and institutions that are new to web archiving wanting to try the WCT. It is often viewed as having a low technical barrier to general use, which we believe is important in bringing in new participants to web archiving. Even after 10 plus years, the WCT still contributes as one of the most common, open-source enterprise solutions for web archiving.
Participant numbers estimated at 10-15.
In 2006 the NLNZ and the British Library developed the WCT, a collaborative open-source software project conducted under the auspices of the IIPC. The WCT managed the web harvesting workflow, from selecting, scoping and scheduling crawls, through to harvesting, quality assurance and archiving to a preservation repository. The NLNZ has used the WCT for its selective web archiving programme since January 2007. However, the software had fallen into a period of neglect, with mounting technical debt: most notably its tight integration with an out-dated version of the Heritrix web crawler. While the WCT is still used day-to-day in various institutions, it had essentially reached its end-of-life as it has fallen further and further behind the requirements for harvesting the modern web. The community of users have echoed these sentiments over the last few years.
During 2016/17 the NLNZ conducted a review of the WCT and how it fulfils business requirements, and compared the WCT to alternative software/services. The NLNZ concluded that the WCT was still the closest solution to meeting its requirements - provided the necessary upgrades could be made to it, namely a change to use the modern Heritrix 3 web crawler. Through a series of fortunate conversations the NLNZ discovered that another WCT user, the National Library of the Netherlands, was going through a similar review process and had reached the same conclusions. This led to collaborative development between the two institutions to uplift the WCT technically and functionally to be a fit for purpose tool within these institutions’ respective web archiving programmes.
• An overview of the WCT
• A brief re-cap of the state of the WCT prior to development.
• Motivations for keeping the WCT and developing it further.
• Run through a basic setup of the WCT, demonstrating the improvements we have made to this process. This is important because while the application itself was reasonably user friendly, the installation and setup was less than straight forward, even for a system administrator. This was often a pain point for new users, who wanted to get started with web archiving but lacked the technical knowledge to setup WCT.
• Demonstrate new WCT functionality and other improvements, with a particular focus on crawling with Heritrix 3 (and any other crawl tools - pending development).
• Discuss future direction of WCT development, and current NLNZ-KB-NL work plan.
• Any further questions.
The WARC file format: preparing next steps 16:30-17:30
Sara AubryTiakiwai I
Sara Aubry, Bibliothèque nationale de France (BnF)
For more than 20 years, memory storage organizations have been collecting and keeping track of World Wide Web material using web-scale tools such as web crawlers. At the same time, these same organizations archive large numbers of born-digital and digitized files. The WARC (Web ARChive) file format was defined to support these activities: it is a container format that permits one file simply and safely to carry a very large number of constituent data objects of unrestricted type for the purpose of storage, management, and exchange. Today, the WARC file format is extensively used within the web archiving community to support applications for harvesting web resources and accessing web archives in a variety of ways.
The WARC file format was initially released as an ISO international standard in May 2009 named 28500:2009 (also known as WARC 1.0). As with all ISO standards, the WARC standard is periodically reviewed to ensure that it continues to meet the changing needs that emerge from practice. The first revision, supported by an IIPC task force and the subcommittee in charge of technical interoperability within ISO information and documentation technical committee (ISO/TC46/SC4), was published in August 2017 as ISO 28500:2017 (also known as WARC 1.1). The next regular ISO vote to start another revision process is currently scheduled for 2020.
This discussion aims at gathering IIPC members interested in and working with the WARC format to inventory needs for either small or larger evolutions, share them within the group to identify common interests and start shaping the scope of the upcoming revision. Exchanges on IIPC Github and Slack channels will be used to prepare and structure the discussion before the face-to-face meeting.
3:30 pm - 5:30 pm COBWEB
Cobweb: collaborative collection development for web archives 15:30-16:00
Kathryn Stine & Peter BroadwellPipitea Street 1:14&1.15
Kathryn Stine, California Digital Library
Peter Broadwell, University of California, Los Angeles
The demands of archiving the web in comprehensive breadth or thematic depth easily exceed the capacity of any single institution. As such, collaborative approaches are necessary for the future of curating web archives, and their success relies on curators understanding what has already been, or is intended to be, archived, by whom, when, how often, and how. Collaborative web archiving projects such as the US End-of-Presidential-Term, IIPC CDG Olympics, and CA.gov (California state government) collecting endeavors demonstrate how curators working across multiple organizations in either coordinated efforts or direct partnerships, and with ad hoc collaboration methods, can accomplish much more than they might alone. With funding from the US Institute of Museum and Library Services, Cobweb (www.cdlib.org/services/cobweb/), a joint project of the California Digital Library, UCLA, and Harvard University, is a platform for supporting distributed web archive collecting projects, with an emphasis on complementary, coordinated, and collaborative collecting activities. Cobweb supports three key functions of collaborative collection development: suggesting nominations, asserting claims, and reporting holdings. This holdings information also supports a fourth Cobweb function, collection-level thematic search.
Curators establish thematic collecting projects in Cobweb and encourage nominators to suggest relevant seed websites as candidates for archiving. For any given collecting project, archival programs can claim their intention to capture a subset of nominated seeds. Once they have successfully captured seeds included in a given collecting project, descriptions of these holdings will become part of the Cobweb holdings registry. Cobweb interacts with external data sources to populate this registry, aggregating metadata about existing collections and crawled sites to support curators in planning future collecting activity and researchers in exploring descriptions of archived web resources useful to their research. Note that Cobweb is a metadata registry, rather than a repository or a collecting system; it aggregates and provides transparency regarding the independent web archiving activities of diverse and distributed archival programs and systems.
This presentation will include a walkthrough of the Cobweb platform, which is scheduled for production launch just prior to the IIPC General Assembly and Web Archiving Conference.
TUTORIAL: Using Cobweb to manage collaborative or complementary web archive collecting projects 16:30-17:30
Kathryn StinePipitea Street 1:14&1.15
Kathryn Stine, California Digital Library
With funding from the US Institute of Museum and Library Services, Cobweb, a joint project of the California Digital Library, UCLA, and Harvard University, is a platform for supporting thematic web archive collecting projects, with an emphasis on complementary, coordinated, and collaborative collecting activities. Cobweb is an open source, web-based community tool that is improved with community involvement and input. This tutorial therefore will provide participants with opportunities to explore and familiarize themselves with the Cobweb platform, establish sample collecting projects, and navigate the Cobweb registry of aggregated metadata about existing collections and crawled seeds held by archival programs across the world.
Cobweb supports three key functions of collaborative collection development: suggesting nominations, asserting claims, and reporting holdings. Curators establish thematic collecting projects in Cobweb and encourage nominators to suggest relevant seed web sites as candidates for archiving. For any given collecting project, archival programs can claim their intention to capture a subset of nominated seeds. Once they have successfully captured seeds included in a given collecting project, descriptions of these holdings will become part of the Cobweb holdings registry. Cobweb interacts with external data sources to populate this registry, which curators can then search and browse to inform their planning for future collecting activity and which researchers can use to explore descriptions of archived web resources useful to their research.
Participants can expect orientations to setting up Cobweb accounts; establishing and updating collecting projects; determining and setting approaches for soliciting nominations to their projects; assigning descriptive metadata to projects, nominations, and holdings; understanding metadata flows into and out of Cobweb; and advanced searching within and across the Cobweb registry. Some time will also be spent on exploring how Cobweb supports multi-participant communication within and across the activities involved in establishing and managing collecting projects. The tutorial facilitator will provide overviews of Cobweb documentation, how Cobweb relates to or interacts with complementary web archiving systems and tools, and the roadmap for continued maintenance and enhancement of the Cobweb platform.
6:00 pm - 11:00 pm CONFERENCE DINNER
Dinner will be hosted at Pencarrow Lodge. Dinner transport departs National Library at 6:00pm promptly.
Thursday 15 Nov 2018
8:45 am - 9:00 am WELCOME TO DAY 3
9:00 am - 10:00 am PLENARY: KEYNOTE
Archiving the future: law, technology, and practice at the web’s edge
Wendy Seltzer, W3C
Do parties have the right to be forgotten from a smart contract’s ledger? Will taxidermy preserve a Pokémon or a Cryptokitty outside its native habitat? How do we maintain historical records when “truth isn’t truth” and social networks claim sovereignty over the platforms from which they irregularly block users? Can we inoculate before copyright takedown-bots go viral?
By examining some of the challenges to archiving the near and possible futures, this talk aims to help us prepare. We’ll look at some bugs and features of current legal code, and consider law and policy recommendations to help keep the web open to archiving even as it changes.
Wendy Seltzer is Strategy Lead and Counsel to the World Wide Web Consortium (W3C) at MIT, improving the Web’s security, availability, and interoperability through standards. As a Fellow with Harvard’s Berkman Klein Center for Internet & Society, Wendy founded the Lumen Project (formerly Chilling Effects Clearinghouse), helping to measure the impact of legal takedown demands on the Internet. She seeks to improve technology policy in support of user-driven innovation and secure communication.
She serves on the Advisory Board of Simply Secure; served on the founding boards of the Tor Project and the Open Source Hardware Association, and on the boards of ICANN and the World Wide Web Foundation.
Wendy has been researching openness in intellectual property, innovation, privacy, and free expression online as a Fellow with Harvard’s Berkman Klein Center for Internet & Society, Yale Law School’s Information Society Project, Princeton University’s Center for Information Technology Policy and the University of Colorado’s Silicon Flatirons Center for Law, Technology, and Entrepreneurship in Boulder. She has taught Intellectual Property, Internet Law, Antitrust, Copyright, and Information Privacy at American University Washington College of Law, Northeastern Law School, and Brooklyn Law School and was a Visiting Fellow with the Oxford Internet Institute, teaching a joint course with the Said Business School, Media Strategies for a Networked World. Previously, she was a staff attorney with online civil liberties group Electronic Frontier Foundation, specializing in intellectual property and First Amendment issues, and a litigator with Kramer Levin Naftalis & Frankel.
Wendy speaks and writes on copyright, trademark, patent, open source, privacy and the public interest online. She has an A.B. from Harvard College and J.D. from Harvard Law School, and occasionally takes a break from legal code to program (Perl and MythTV).
10:00 am - 10:30 am LIGHTNING TALKS Chair: Martin Klein
It's There, But Can You Find It?: Usability Testing the Archive-It Public Interface
Alexis Antracoli, Princeton University
Sumitra Duncan, Frick Art Reference Library
Archive-It is used by governments, universities, and non-profit institutions in 17 countries. It is one of the most widely used web archiving tools in the world, but how easy is it for researchers to find what they are looking for in the discovery interface? Do researchers even know that the tool exists or how it differs from the Internet Archive? Seven archivists and librarians at six institutions in the United States spent the last year asking exactly these questions. Little is currently known about the usability of archival discovery systems, and even less about those devoted specifically to web archiving. This project filled in one of those knowledge gaps, answering questions about the usability of the Archive-It user interface for a range of users.
Representing Princeton, Temple, and Columbia universities, Swarthmore College, the Frick Art Reference Library, and the Delaware Public Archives, the team was constructed to include representatives from the range of institutions that make use of Archive-It as a web archiving tool. The team conducted usability tests with faculty, students, staff, and the general public at their respective institutions. Testing examined the ability of users to find specific archived websites using available search, browse, and sort features; to understand the language used by web archivists to describe websites (i.e. seed, capture), and to find groups of related websites and archived videos. The team also explored whether users knew about or had previously used archived websites in their work. After completing thirteen tests with at least two users representing each target audience (faculty, staff, students, and the general public) and each institution, the team coded and analyzed the results. This poster and talk will present both the methods and results of these tests, pinpointing areas where the Archive-It interface succeeds as well as specific suggestions for improvement.
Legal Deposit Legislation for Online Publications: Framing the Issues
Mark Boddington, Scientific Software and Systems Limited; Victoria University of Wellington
This lightning talk is about legal deposit legislation and what can be done to make it fit for purpose in the digital age. In many countries, including New Zealand, legal deposit law is struggling to keep up with new methods of publication enabled by the Internet. This is due to the shift from traditional publishing to self publishing via online platforms and the more proactive stance taken by libraries towards the acquisition of legal deposit materials. The nature and volume of publications available on the Internet means librarians do not expect to receive all qualifying material from online sources and instead use web harvesting systems to collect content that is available on the open web. Although national legal deposit laws have been amended to enable this to occur, some problems remain.
When legal deposit libraries harvest blogs and digital commentary they encounter a variety of legal issues that fall outside the scope of legislation. They include jurisdictional barriers to the acquisition of content arising from the global nature of Internet services, and difficulty complying with online platform’s terms and conditions of access. Legal matters become even more fraught when libraries permit public access to their collection of online publications. This talk will expand on these topics and will identify the underlying legal issues that need to be addressed to strengthen legal deposit laws.
Tracking the Evolution of Web Archiving Activity in the United States
Grace Thomas, Library of Congress
Maria Praetzellis, Internet Archive
Edward McCain, University of Missouri Libraries and Reynolds Journalism Institute
Matthew Farrell, Duke University
From 2011 to 2016, the National Digital Stewardship Alliance of the United States (NDSA) sponsored three surveys of web archiving activity throughout the United States as a longitudinal study of the evolution of web archiving practices and trends. This survey was again held in 2017 and the NDSA Web Archiving Survey Working Group is currently combing through the results in preparing to write the final report by summer 2018.
In order to track the evolution of web archiving activity, the survey studies similarities and differences in programmatic approaches, types of content being archived, tools and services being used, access modes being provided, and emerging best practices and challenges. The reports have been instrumental in advocating for web archiving resources at various institutions throughout the United States. They also allow for a formation of community among practitioners often scattered by geographical location as well as archiving objectives.
Already in our results, we have seen the continued maturity of the profession, diversification in the types of organizations engaged in web archiving, and some stagnation in the key areas of staffing and digital preservation. We have also seen a greater distribution of tools used, with a surge in WebRecorder usage, often paired with Heritrix. For the first time in the history of the survey, URL search is now the most popular means of access to web archives content, surpassing browse lists and full-text search.
This poster will review additional key findings from the 2017 survey and contextualize the significance of the results within the landscape of United States web archiving activity created from all four surveys. The poster will feature visualizations of the results in order to highlight areas of growth and potentials for further advancement. The survey working group believes the results from the United States survey can launch discussions about similar efforts in other areas of the globe, how the results fit into the international landscape of web archiving activity, and additional areas of inquiry for future surveys.
iPRES 2020 introduction and cooperation & Chinese National Digital Preservation Programme for Scientific Literature
Zhenxin Wu & Xie JingAuditorium
Zhenxin Wu & Xie Jing, National Science Library, Chinese Academy of Sciences
Poster 1- iPRES 2020 Introduction and Cooperation
An invitation of the Chinese Academy of Science (CAS) and eIFL (Electronic Information for Libraries) in 2003 provided the initial impetus for the iPRES series. Eight European experts in digital preservation contributed to the first iPRES conference in Beijing, July 2004. After the completion of a successful conference in Beijing, there are 15th iPRES conferences over Asia, Europe, America.
iPRES 2020, 17th international; conference on digital preservation, will go back to Beijing, where iPRES began in 2004. It will be in September 21 – 24, 2020, the most beautiful season in Beijing. National Science Library, Chinese Academy of Sciences (NSLC) and the National Science & Technology Library (NSTL) will co-hosted iPRES 2020.
The central theme of iPRES 2020 is “Empowering Digital Preservation for the Enriched Digital Ecosystem –meeting the preservation challenges of the evolving types of digital content”.
Now welcome your suggestions, proposals, and collaboration to make iPRES2020 successful. Please keep your eyes on our website. http://ipres2020.cn
Poster 2 - Chinese National Digital Preservation Programme for Scientific Literature
NDPP China is funded by National Science & Technology Library under Ministry of Science and Technology, China. It is a cooperative system participated by more than 200 research & academic libraries, operating with multiple preserving nodes at major institutions.
NDPP China aims to preserve in the mainland China digital scientific publications, including journals, books, patents, proceedings, reference works, and rich media publications, by major commercial and societal publishers, inside or outside China.
After 4 years ‘work, we set up Strategy-guided selection principles, form a cooperative responsibility system, develop an OAIS-based trusted tech platforms, implement trusted archival management, carry out Standard-based auditing & certification, and provide Triggered service management. Currently NDPP is covering more than 13000 thousand e-journals from 10 publishers.
NDPP strives to promote research and practice of digital resource preservation in China, and also looks forward to exchanges and cooperation with international experts and institutions.
10:30 am - 11:00 am MORNING TEA
11:00 am - 12:30 pm SESSION 7
11:00 am - 12:30 pm THEMATIC COLLECTING INITIATIVES Chair: Alex Thurman
One year down: taking the Ivy Plus Libraries web resources collection program from pilot to permanent 11:00-12:30
Samantha Abrams, Ivy Plus Libraries
First established as a Mellon-funded project in 2013, the Web Resources Collection Program within Ivy Plus Libraries now finds itself at the end of its inaugural year as a permanent program. Founded and funded to explore the bounds of collaboration and web archiving, the Ivy Plus Libraries Web Collection Program bills itself as a collaborative collection development effort established to build curated, thematic collections of freely available, but at-risk, web content in order to support research at participating Libraries and beyond. Part of a partnership that stretches across the United States, Ivy Plus Libraries includes: Brown University, the University of Chicago, Columbia University, Cornell University, Dartmouth College, Duke University, Harvard University, Johns Hopkins University, the Massachusetts Institute of Technology, the University of Pennsylvania, Princeton University, Stanford University, and Yale University.
In order to successfully transition from its uncertain and grant-funded state, Ivy Plus Libraries focused on growth: in its first year, the Program hired both full-time and part-time staff members dedicated to web archiving, documented and created collection policies and communication strategies, and both expanded two pilot collections and built brand new, selector-curated collections. This presentation, delivered by the Program’s web archivist, focuses on these efforts and discusses the painstaking process of creating both a workable and centralized collaborative web archiving program, sharing the Program’s successes and areas in which it seeks to improve, touching on topics including: outreach and securing program buy-in, working with and educating subject specialists in order to create new and evolving collections, and shaping the Program’s reach and objectives, in addition to what the day-to-day work of web archiving — crawling, quality assurance, and metadata creation, to name a few — looks like when carried out on behalf of stakeholders at thirteen prestigious institutions.
No longer a pilot program, where has Ivy Plus Libraries succeeded and where can it continue to improve? What does web archiving look like in this collaborative state, and where might it take the partnership — and similar collaborative projects around the globe — as the Program embarks upon its second year?
Web Content in Anthropology: Lessons Learned for a Step in the Right Direction 11:30-12:00
Wachiraporn Klungthanaboon & Sittisak RungcharoensuksriAuditorium
Chindarat Berpan, Chulalongkorn University
Wachiraporn Klungthanaboon, Chulalongkorn University
Sittisak Rungcharoensuksri, The Princess Maha Chakri Sirindhorn Anthropology Centre
The Princess Maha Chakri Sirindhorn Anthropology Centre (SAC), a leading research centre in Thailand in the disciplines of anthropology, history, archaeology and the arts, perceives the significance of information in hand for further research and development. Recognizing the vast amount of web content provided by key organizations at the local and global levels, dispersed online anthropological information, and the disadvantages of unavailable and inaccessible information, the SAC is attempting to archive online anthropological information in order to ensure long-term access. To assess the possibility of a web archiving initiative, the SAC started with a four-stage preliminary study:
1) explore the websites of selected organizations in the discipline of anthropology in Thailand and Southeast Asia;
2) analyze web content on the selected websites in terms of genres, file formats, and subjects;
3) compare and contrast web archiving tools and metadata schema by investigating key web archiving initiatives in Asia-Pacific region;
4) identify considerations and challenges of web archiving in general.
The results of this preliminary study will provide useful information for the SAC to proceed the next stage – policymaking and collaboration seeking. This presentation will review the project and will deliver some lessons learned from the preliminary study.
Peeling back the onion (domes) in the North Caucasus: multi-layered obstacles to effective web-based research on marginalized ethnic groups 12:00-12:30
Kit Condill, University of Illinois at Urbana-Champaign
The North Caucasus is a remarkably multilingual and multi-ethnic region located on the southern borders of the Russian Federation. The predominantly-Muslim peoples of the region (which, for the purposes of this paper, includes Chechnya, the breakaway Georgian republics of South Ossetia and Abkhazia, and many other neighboring territories) have a fascinating and diverse online presence, generated by communities both within Russia and in emigration. Building on previous research into the vernacular-language online media environment of this strategic and conflict-prone region, my paper will argue for the importance of preserving online content in languages such as Chechen, Ossetian, Abkhaz, Avar, Kumyk, and Karachai-Balkar. The many-layered difficulties associated with establishing a web archive for the North Caucasus — and with making its contents easily and permanently accessible for scholars — will be outlined and discussed, along with possible solutions to these problems.
It will also be demonstrated that, despite the North Caucasus’ inherent intellectual appeal as an ancient melting pot of cultures, languages and religions (and its current significance in, among other arenas, the global battle against Islamic extremism), the region is woefully under-studied in English-speaking countries, and locally-produced vernacular-language sources are almost never used in contemporary scholarship. The creation of a web archive dedicated to the North Caucasus, therefore, would be an important step toward encouraging researchers to make systematic use of an already-existing corpus of primary-source material that is both substantial and readily (if not permanently) available. Presumably, North-Caucasus-related websites whose political or social stances and circumstances render them particularly vulnerable to disruption and content loss are (or should be) of particular interest to scholars, and these sites will be identified as priorities for future web-archiving efforts.
Tackling the problem of preserving web content in the languages of the North Caucasus also has broader implications, raising questions such as how well online content in minority languages is being preserved, the relationship between statehood/sovereignty and the feasibility of comprehensive web preservation efforts, and the role of web archiving in cultural and linguistic preservation.
11:00 am - 12:30 pm OPERATIONAL WORKFLOWS Chair: Nicholas Taylor
Web archiving guide for governmental agencies 11:00-11:30
Suzi Szabo, Nationaal Archief
At The Dutch National Archives we are aware of the risks of losing web-information due to lack of proper guidelines and best practices. In light of this we have recently published a Web Archiving Guide, providing Dutch governmental organizations with a better understanding of the records management requirements for archiving websites, and the tools and advice to do so in practice. This helps to ensure sustainable accessibility of web-information.
Because of the way responsibilities are shared between Dutch governmental organizations and public archives, according to the Public Records Act, governmental agencies- and not archival institutions -are responsible for the archiving of the records they produce, which includes websites. To ensure that the information on Dutch governmental websites is and stays accessible for the purposes of accountability of government, business management, law & evidence, and future research, the National Archives has developed a guide with requirements for website archiving. The guide provides a set of requirements focusing on the responsibilities, process and result of website archiving for governmental agencies. The guide is based on the premise that harvesting is outsourced to a commercial party and that governmental websites are retained permanently. It also provides a roadmap describing all the steps in the process, from preparation for harvesting to the actual harvesting by a third party and eventually the transfer of the archived website to a public archive. Emphasis is put on the relevant stakeholders and their roles and responsibilities during the entire process. We will expand the scope of the Guide in the future to eventually be applicable for interactive websites and even social media.
In order to ensure usability and adoption the guide was created in co-operation with the intended users who took part in a large scale public review of the draft version. Simultaneously a Nation-wide implementation project has been started for Dutch central governmental agencies.
Smart Routing of Memento Requests 11:30-12:00
Martin Klein, Lyudmila Balakireva, Harihar Shankar, James Powell & Herbert Van De Sompel
Los Alamos National Laboratory, Research Library
The Memento protocol provides a uniform approach to query individual web archives. We have introduced the Memento Aggregator infrastructure to support distributed search and discovery of archived web resources (Mementos) across multiple web archives simultaneously. Given a request with a original URI and a preferred datetime, the Aggregator issues one request to each of the currently 22 Memento compliant archives and determines the best result from the individual responses. As the number of web archives grows, this distributed search approach is increasingly challenged. Varying network speeds and computational resources at the end of the archives make delivering such aggregate results with consistently low response times more and more difficult.
In order to optimize query routing and therefore lowering the burden on archives that are unlikely to hold a suitable Memento, we previously conducted research, in part supported by the IIPC, to profile web archives and their holdings. This work was based on the premise that if we knew what archive holds which URIs, we could make the Aggregator smarter. However, the sheer scale of archive holdings makes profiling a very time- and resource-intensive endeavor. In addition, the constantly changing index of web archives requires frequent re-profiling to reflect the current holdings of archives. These insights lead to the conclusion that our profiling approaches are impractical and are unsuited for deployment.
In this presentation we report on a lightweight, scalable, and efficient alternative to achieve smart request routing. This method is based on binary, archive-specific classifiers generated from log files of our Aggregator. We therefore get our “profiling” information from real usage data and apply the classifiers to determine whether or not to query an archive for a given URI. Our approach has been in production at the Memento TimeTravel service and related APIs for an extended period of time, which enables us to report on long-term performance evaluations. In addition, we report on further explorations using neural networks for smart request routing.
Our approach works to the benefit of both sides: the provider of the Aggregator infrastructure benefits as unnecessary requests are held to a minimum and responses can be provided more rapidly, and web archives benefit as they are not burdened with requests for Mementos they likely do not have. Given these advantages, we consider our method an essential contribution to the web archiving community.
A workflow for indexing and searching large-scale web archive content using limited resources 12:30-13:00
Sara Elshobaky & Youssef Eldakar, Bibliotheca Alexandrina
The primary challenge to allow searching web archives is their scale. This problem increases significantly for the Bibliotheca Alexandrina (BA), which holds more than 1 PB of archived web content. Currently, the BA archive is still missing full-text search capabilities. As learned from other web archives’ full-text search experiences, there is a high need for a powerful machine to build the search index in addition to a cluster of machines to host the resultant indices. In this work, we present a workflow that allows automating the indexing process in a reasonable time frame, while tuning the parameters for machines with limited resources.
At the BA we make use of our older High-Performance Computing (HPC) cluster to build Solr indices for our web archive. It is not so powerful by today’s standards but has up to 130 compute nodes, each with 8 GB of RAM and an overall limited storage of 13 TB. Hence, we are able to host on it up to 10 Solr nodes to build the indices before relocating them to their final destination. On the other hand, the BA is migrating its web archive content to a new 80-node cluster over the network, which is causing an extreme delay in the whole process.
Manually managing and monitoring all such processes at the same time would be very cumbersome. Hence, we propose an automated workflow that controls the indexing process of each file from the point it is stored to the web archive cluster until it is made searchable through the web interface. The workflow consists of two separate parts that coordinate together through a shared database. The first part is the processes that control the Solr index. The processes manage building the 10 solr indices in parallel in the HPC. Once an index is finalized, it moves its content to its final destination in the permanent SolrCloud. Then, it resets the Solr index and registers it as a new one. The second part is the processes running on the web archive cluster. Each process watches for new files on a specific drive to parse their content using the warc-indexer tool implemented by the UK web archive. Then, it queries the database for the least loaded Solr node to index the file’s content. All processes in the workflow log their status in the database to easily coordinate together and to allow future rebuilding of the index upon need.
Theoretically, building a SolrCloud, for the 1 PB of web archive files requires 100 Solr nodes with 1000-GB capacity each. Currently, the BA has at its disposal a part of that hardware from recycling older web archive machines. Each machine has a limited memory of only 2 GB. We overcame this limitation by revisiting the Solr schema and studying all the search features that overwhelm the memory. The result was a responsive full-text search for the whole content with limited features.
12:30 pm - 1:30 pm LUNCH
1:30 pm - 3:00 pm SESSION 8
1:30 pm - 3:00 pm LEVERAGING ARCHIVED CONTENT Chair: Samantha Abrams
Sifting needles out of (well-formed) haystacks: using LOCKSS plugins for web archive metadata extraction 13:30-14:00
Thib Guicherd-Callin, Stanford Libraries
As the volume of web archives have grown and web archiving has matured from a supplementary to an increasingly essential mechanism for collection development, there has been growing attention to the challenge of curating that content at scale. National libraries engaged in national domain-scale harvesting are envisioning workflows to identify and meaningfully process the online successors to the offline documents they historically curated. There is elsewhere increasing interest in the application of artificial intelligence to making sense of digital collections, including archived web materials. Where automated or semi-automated technologies are not yet adequate, crowd-sourcing also remains a strategy for scaling curation of granular objects within web archives.
The LOCKSS Program has developed significant expertise and tooling for identifying and parsing metadata for web-accessible objects in the domain of scholarly works: electronic journals and books, both subscription and open access, as well as government information. This is enabled by means of a highly flexible plugin architecture in the LOCKSS software, which augments the traditional crawl configuration options of an archival harvester like Heritrix with additional functionality focused on content discovery, content filtering, logical normalization, and metadata extraction. A LOCKSS plugin specified for a given publishing platform or content management system encodes the rules for how a harvester can parse its content, allowing for extraction of bibliographic metadata from, e.g., HTML, PDF, RIS, XML, and other formats.
While the LOCKSS software has always fundamentally been a web preservation system, it has largely evolved in parallel with the tools and approaches of the larger web archiving community. However, a major re-architecture effort is currently underway that will bring the two into much closer alignment. The LOCKSS software is incorporating many of the technologies used by the web archiving community as it is re-implemented as a set of modular web services. The increased participation of LOCKSS in the broader community should bolster sustainability, but a more promising possibility is for cross-pollination of technical capabilities, with metadata extraction a component of probable interest to many web archiving initiatives.
This presentation will detail the capabilities of the LOCKSS plugin architecture, with examples of how it has been applied for LOCKSS use cases, how it will work as a standalone web service, and discussion with the audience of where and how such capabilities might be applied for broader web archiving use cases.
Your web archives are your everything archives 14:00-14:30
Jefferson Bailey, Internet Archive
As the web continues to consume all media — from publishing to video to music — large-scale web harvests are collecting a rich corpus of news, government publications, creative works, scholarly research, and other materials that formerly had their own defined dissemination and consumption frameworks. Large-scale harvesting efforts, however, are intentionally designed for breadth, scale, speed, and “content agnostic” collecting methods that treat different types of works similarly. Relatedly, the URL-centric nature of web collecting and access can limit how we curate and enable discovery of the materials within our web archives. Web archives themselves, as they grow, are evermore an agglomeration of diverse material whose only shared trait is publication via the web. How can we increase our knowledge of what is contained within the web collections we are building? And how can we gain a better understanding of what we have collected in order to inform improved description, curation, access and collection strategies?
This presentation will detail a number of both in-production and research and development projects by Internet Archive and international partners aimed at building strategies, tools, and systems for identifying, improving, and enhancing discovery of specific collections within large-scale web collections. It will outline new methods to situate and enhance the valuable content already collected in general web collections and implement automated systems to ensure future materials are well-collected and, when possible, are associated with the appropriate context and metadata. This work spans search, data mining, identifier association, integration with publishers, registries, and creator communities, machine learning, and technology and partnership development. The talk will outline potential approaches for moving from a concept of “archives of the web” to one of “archives from the web.”
How to harvest born digital conspiracy theories: webarchiving Dutch digital culture in the Post-truth era 14:30-15:00
Kees Teszelszky, Koninklijke Bibliotheek
According to the Oxford Dictionaries, post-truth” is relating to or denoting circumstances in which objective facts are less influential in shaping public opinion than appeals to emotion and personal belief. These circumstances became especially present on the web in our time, as hate speech, extreme political beliefs and alternative facts are flooding web fora and social media. Jelle van Buuren (Leiden University) has done a PhD research on conspiracy theories on the web (2016). He addressed the question whether or not conspiracy constructions are boosting hatred against the political system. The online discourse in which these conspiracy theories get shape can be characterized as post-truth”.
The Koninklijke Bibliotheek – National Library of the Netherlands (KB-NL) has collected born digital material from the web since 2007 through web archiving. It makes a selection of websites with cultural and academic content from the Dutch national web. Most of the sites were harvested because of their value as cultural heritage. Due to these selection criteria of the past, many of the harvested websites are about history and heritage. As such, they were less suitable to use as primary sources for future historic research on our own digital age. This all changed by the new selection policy, which includes also conspiracy theories and other typical post truth phenomena from the Dutch web.
I will describe the methods and experience with selecting, harvesting and archiving websites of the Post-truth era. I discuss also the characteristics of web materials and archived web materials and will explain the use of these various materials (harvested websites, link clouds, context information) for digital humanities research.
Furthermore, I will describe the challenges of webarchiving in a country without a legal deposit. I will also argue that the combination of a growing variety of different kinds of digital materials and processes from the web calls for a reinterpretation of primary historic sources, stressing the question what can be regarded as an authentic born digital source of our time.
1:30 pm - 3:00 pm INSTITUTIONAL PROGRAM TRANSITIONS Chair: Nicholas Taylor
Arquivo.pt: taking a web archive to the next level 13:30-14:00
João Gomes, Arquivo.pt at FCCN-FCT
Arquivo.pt has now 10 years and contents from 22 years of the Portuguese web, with a mature technological architecture, a powerful full-text search engine and several innovative services available.
To reach all its potential and applicability, an archive needs to be live and widely used. In order to achieve this objective, Arquivo.pt has put in place several dissemination activities, to be known and used by the community, mainly among researchers, students and journalists.
This talk will focus on all the planned and executed activities of dissemination, advertisement and training, such as:
- Training sessions for journalists
- Digital Marketing campaigns
- Grants for researchers with works over Arquivo.pt assets
- Annual Prize Arquivo.pt 2018 with 15K€ on prizes
- Production of videos about best-practices
- Time travel on Media Sites anniversary
- Press Releases and Media Partnerships
These activities have improved a lot the awareness about the Arquivo.pt and it usage.
Expansion and exploration in 2018: processing the Library of Congress web archive 14:00-14:30
Abbie Grotke & Grace Thomas, Library of Congress
The Library of Congress Web Archive selects, preserves, and provides access to archived web content selected by subject experts from across the Library, so that it will be available for researchers today and in the future. The Library’s Digital Collecting Plan, produced in February 2017, calls for an expansion of the use of web archiving to acquire digital content. Our talk will focus on how the Library of Congress plans to accomplish this by expanding selective crawling practices and simultaneously scaling up technical infrastructure to better support program operations.
Web archiving efforts at the Library of Congress have, up to now, focused on highly selective collecting of specific, thematic collections proposed and described by subject specialists. An expansion of web archiving will require both enlisting additional subject specialists to engage in web archive collection development, and for those already engaged to broaden their web archiving selection to additional themes and subjects. The Library is also currently tackling the backlog of web archives in need of descriptive records for presentation on the Library’s website.
Expanding web archiving at the Library will also require finding solutions to analyze, process, and manage huge quantities of content. With over 1.3 PB of content and a current growth rate of more than 300 TB per year, the sheer size of the archive has begun to present technical challenges, particularly with rendering content on the public Wayback Machine and delivering research ready data to scholars. In 2018, new cloud infrastructure was made available to the Web Archiving Team for processing the archive. With this new capability, the team is exploring a variety of projects, including experimenting with alternate index models, generating multiple types of derivative files to gauge research engagement with the web archives content, and running analyses over the entire archive to gain deeper understanding of the content held within the collections.
In the coming months, the Library of Congress will ingest the web archive into the cloud and test new processes for managing the web archive at scale, and will be able to share stories of triumphs and challenges from this crucial transition with the greater web archiving community.
Working on a dream: the National Library of Ireland’s Web Archive 14:30-15:00
Maria Ryan, National Library of Ireland
At a time when the National library of Ireland (NLI) is undergoing a physical transformation as part of its re-imagining the library-building programme, it is also changing the way in which it develops its approach to its online collecting activities, including developing its web-archiving programme. The NLI has transitioned from a resource-limited model of selective only web archiving to a larger scale process that now includes domain web archiving. This presentation will examine the development of the web archive over seven years, which began as a pilot project in selective web archiving, to becoming an established collection within the library, while struggling with limited resources and inadequate legislation. It will also examine how the change in web archiving strategy has resulted in wider implications for the NLI web archive.
Despite the lack of adequate Legal Deposit legislation for web archiving or digital publications, the NLI has a mandate to collect and protect the recorded memory of Ireland, a record that is now increasingly online. In 2017, additional resourcing in the form of a full time web archivist was secured and the NLI launched a domain crawl of the Irish web. Working with the Internet Archive, a crawl of the Irish top-level domain, relevant domains hosted in Ireland and websites in the Irish language was carried out. In total 39TB of data was captured and is due for release in 2018. This data is a unique resource in Ireland and with the addition of a previous 2007 crawl which will also be made available, it offers the researcher a decade long view of the Irish internet. The NLI now intends, resources permitting, to carry out a domain crawl each year going forward.
The addition of domain crawling has allowed the web archivist increase the amount of material it is preserving but also, greater freedom to develop its selective web archive. The web archive policy in the past attempted to gather as much diverse material as possible as well as adequately reflect important current events. However, with limited resources, this often meant small collections being archived on several different topics. By carrying out domain crawling, it allows a greater representation of data to be captured but also allows selective collections be more focused on key events such as elections and referendums. The change in web archiving strategy has resulted in a revision of the NLI’s web archiving policy. This proposal will examine the development of the web archive and examine the challenges and consequences of developing web archiving in the National Library of Ireland’s own context of limited resources.
3:00 pm - 3:30 pm AFTERNOON TEA
No workshops in this session.
3:30 pm - 4:30 pm SESSION 9
3:30 pm - 4:30 pm TEXT MINING & INDEXING Chair: Kia Siang Hock
Néonaute: mining web archives for linguistic analysis 15:30-16:00
Peter Stirling & Sara AubryTiakiwai
Emmanuel Cartier, Laboratoire d’Informatique de Paris Nord – Université Paris 13
Peter Stirling, Bibliothèque nationale de France (BnF)
Sara Aubry, Bibliothèque nationale de France (BnF)
Néonaute is a project that seeks to study the use of neologisms in French using the web archive collections of the BnF. Initially a one-year project funded by the French Ministry of Culture, it uses a corpus drawn from the daily crawl of around one hundred news sites carried out by the BnF since December 2010. Building on the existing projects Neoveille and Logoscope which seek to detect and track the life-cycle of neologisms, Néonaute aims to use web archives to study the use of neologisms over time.
The objective of the project is to create a search engine, Néonaute, that allows researchers to analyse the occurrence of terms within the collection, with enriched information on the context of use (morphosyntactic analysis) and additional metadata (named entities, themes). Several specific use cases are included in the project:
analysis of the life-cycle of previously identified neologisms;
• comparative use of terms recommended by the DGLFLF, the body in charge of linguistic policy in France, versus terms already in circulation (especially Anglicisms);
• use of terms in feminine gender over the period.
The search engine interface is complemented with an interactive visualization module that allows users to explore the lifecycle of terms over the period, according to various parameters (themes of articles, journals, named entities implied, etc.).
Néonaute is based on the full-text indexing of the news collection carried out by the BnF, which represents 900 million files and 11TB of data. The presentation will discuss the technical challenges and the solutions adopted.
The project is one of an increasing number of uses of the BnF web archives that seek to use techniques of text and data mining (TDM), and the first to use linguistic analysis. Under legal deposit and intellectual property legislation only certain metadata can be used outside the library premises and this kind of project is carried out under research agreements that fix the conditions of use of the collections while respecting the relevant legislation. The presentation will also discuss the issues that the BnF faces in allowing researchers to use such methods on the web archives.
Demo of the SolrWayback search interface, tools and playback engine for WARCs 16:00-16:30
Anders Klindt MyrvollTiakiwai
homas Egense & Anders Klindt Myrvoll, Royal Library, Denmark
SolrWayback is an open source web-application for searching and viewing Arc/Warc files. It is both a search interface and a viewer for historical webpages. The Arc/Warc files must be indexed using the British Library Webarchive-discovery/Warc-indexer framework.
- Free text search in all Mime types
- Image search (similar to Google Images)
- Image search by GPS location using exif meta-data.
- Graph tools (domain links graph, statistics etc.)
- Streaming export of search result to a new Warc-file. Can be used to extract a corpus from a collection.
- Screenshot previews of a url over different harvest times.
- See harvest times for all resources on a webpage.
- Upload a resource (image etc.) to see if it exists in the corpus.
- Build in socks proxy to prevent leaking when viewing webpages.
- Out of the box solution for researchers to explore Arc/Warc files.
- Easy to install and use on Mac,Linux and Windows. Contains Webserver, Solr and warc-indexing tool. Just drop Arc/Warcs into a folder and start exploring the corpus.
SolrWayback on Github with screenshots: https://github.com/netarchivesuite/solrwayback
3:30 pm - 4:30 pm STUDYING WEB OBJECTS Chair: Paul Koerbin
What can tiny, transparent GIFs from the 1990s teach us about the future of access and use of web archives? 15:30-16:00
Grace Thomas & Trevor Owens, Library of Congress
While the size and invisible nature of this particular resource make it seem insignificant, single-pixel, transparent GIFs were an integral element of early web design. Their presence in web archives studied over time lends insight into the history of the web and can inform our future ability to come to understand our digital past.
For this case study, we decided to use a small, curated set of single-pixel GIF’s featured in Olia Lialina’s 2013 online art exhibit based on the GeoCities archive. In the exhibit, the ten transparent GIFs, two of which no longer resolve, are wrapped in frames and set against a tropical backdrop to show that, although invisible, they still take up space. We first assigned a digital fingerprint to the ten GIFs by computing an MD-5 cryptographic hash for each file discovering that, out of the original ten, there were only seven distinct files. We then used this unique identifier to search for the earliest appearances of the final seven files in the Library of Congress web archive and Internet Archive, respectively. In turn, Andy Jackson generously worked to trace the seven GIFs throughout the UK Web Archive, which resulted in fascinating trends over time.
Our study is key in beginning to understand how single-pixel GIFs were used across the web over time. However, we intend to present the paper, not only as a study of one resource, but also how the study of a singular resource revealed many more aspects of archiving and usage of born-digital materials. We will share our conclusions about hashing as a method of studying resources in the web archive, as well as skepticism of results being indicative of the true” live web at the time of archiving versus revealing collection practices. Additionally, we will express our call for more comprehensive notes on scoping and crawling decisions, content processing of an organization’s own web archive by its archivists, and the value of multiple web archiving initiatives collecting the same websites.
“Who by fire”: lifespans of websites from a web archive perspective 16:00-16:30
Russell Latham, National Library of Australia
Web archives provide a valuable resource for researchers by providing them with a contemporary snapshot of original online resources. The value of archived content increases as the original is altered, migrated or taken offline. These changes occur dynamically on the web. Sometimes change occurs quickly but sometimes it is a slow change with the website evolving through iterative updates until it eventually become unrecognisable from the archived copy. Some studies state that websites have a lifespan of only 40 to 100 days. If true, this would mean that most websites evolve very quickly after archiving. As any web archivist can tell you however, some websites have remarkable longevity and can remain live, accessible and little changed for many years. When can a website be considered gone or using an anthropomorphised term, ‘dead’? At its simplest, a website is dead when its URI (or domain) vanishes, along with all its hosted content. However, for many websites, the story is not this simple and they fall into a grey area. A URI may remain stable but the content on the website changes or, the content may remain but migrates to a new URI. In both these scenarios the website has changed substantially, but is this enough to say the website has ceased to exist and is ‘dead’?
In this presentation, I will look at the key characteristics in the lifespan of a website and its eventual ‘death’. It will also examine if a typology can be applied that will allow curators to identify websites that are most at risk of disappearing. I will do this by firstly examining a sample of the National Library of Australia’s twenty-two year old Pandora archive to find trends that lead to the end of its life. Second, I will apply this quantitatively to the NLA web archives to see if websites disappear by sudden death or slow decay.
By examining the lifecycle of a website we will better understand the critical junctures of a websites existence online and by virtue of this provide curators with greater understanding of the best timing when determining harvests schedules. Typologies assist curators to predict the likely future path of a website and allow them to make appropriate preservation actions ahead of time. As curators we are also always trying to determine what within our collection is unique material and what is no longer available and by being able to pinpoint an end point of a website we are better placed to answer this question.
4:30 pm - 4:45 pm PLENARY: CLOSING REMARKS
Steve Knight, National Library of New Zealand