The IIPC funds technical and educational projects based on the goals outlined in an annual Request for Proposals and the Strategic Action Plan. The consortium also collaborates on research and development projects by sharing data and testing tools. Task forces are formed to study and make recommendations on specific issues or problems. Working Groups and Portfolios also sponsor their own projects and work packages.
CURRENT PROJECTS
COMPLETED PROJECTS
Discretionary Funding Programme 2021-2022
Game Walkthroughs and Web Archiving
Project lead: Michael L. Nelson, Old Dominion University Department of Computer Science
Project partners: Martin Klein, Los Alamos National Laboratory (LANL) Research Library
Funding: 10,000 USD
The goal of this project is to explore possible synergy between gaming concepts, platforms, and technologies and those of web archiving.
Discretionary Funding Programme 2020-2021
Developing Bloom Filters for Web Archives’ Holdings
Project lead: Martin Klein, Los Alamos National Laboratory (LANL) Research Library
Project partners: National and University Library in Zagreb (NSK)
Funding: 24,741 USD
The aim of the project is to develop a framework for web archives to create Bloom filters based on their holdings of archived web resources. A Bloom filter can be thought of as a sitemap for web archives, listing all (or a subset of) URLs of which an archive has one or more archival copies.
Improving the Dark and Stormy Archives Framework by Summarizing the Collections of the National Library of Australia
Project lead: Michael L. Nelson, Old Dominion University Department of Computer Science
Project partners: Los Alamos National Laboratory Research Library & National Library of Australia
Funding: 50,000 USD
The Dark and Stormy Archives (DSA) project provides storytelling solutions to improve the understanding of web archive collections. Our goal is to provide a summary of a collection in the form of social media storytelling that describes a web archive collection sufficiently for a user to decide if that collection will likely contain pages of interest.
Discretionary Funding Programme 2019-2020
Archives Unleashed Datathon at the BnF – Cancelled
Lead Institution: Bibliothèque nationale de France (BnF)
Project partners (IIPC): KBR / The Royal Library of Belgium and National Library of Luxemburg (BnL)
Project partner: Archives Unleashed Project
Funding: 6,000 USD
The aim of the project is to promote the use of web archive collections among researchers. To achieve this goal, the partner institutions will organise a datathon on web archive collections coming from francophone national libraries with a legal deposit mission. Datathon will be led by Archives Unleashed and will use the datasets from BnF, KBR and BnL.
Asking questions with web archives – introductory notebooks for historians
Project lead (IIPC member institution): Andrew Jackson, British Library
Project co-lead & developer: Tim Sherratt, University of Canberra
Project partners: National Library of Australia & National Library of New Zealand
Funding: 3,500 USD
This project aims to create a set of Jupyter notebooks that will demonstrate how specific historical research questions can be explored by analysing data from web archives. The notebooks will be targeted at researchers who have limited understanding of, or interest in, the technology of web archives, but want to do more than simply browse snapshots.
LinkGate: Core Functionality and Future Use Cases
Project lead: Youssef Eldakar, Bibliotheca Alexandrina
Project partner: National Library of New Zealand
Funding: 24,439 USD
This projects aims to developing the core functionality of a scalable link visualization environment and documenting potential research use cases within the domain of web archiving for future development. While tools such as Gephi exist for visualizing linked data, they lack the ability to operate on data that goes beyond the typical capacity of a standalone computing device. This new link visualization environment would operate on data kept in a remote data store, enabling it to scale up to the magnitude of a web archive with tens of billions of web resources.
IIPC Tools Development Projects
BROWSER-BASED CRAWLING SYSTEM FOR ALL (2022-2023)
Project coordinators: Tools Development Portfolio Leads & IIPC SPO
Project leads: Anders Klindt Myrvoll, Royal Danish Library, Andrew Jackson, British Library, Ben O’Brien, National Library of New Zealand & Lauren Ko, University of North Texas
Project developer: Ilya Kreymer, Webrecorder.net
Funding: 30,000 USD (2022)
Development of the “User-Friendly High Fidelity Browser-Based Crawling System for All”, a flexible, browser-based high fidelity crawling system driven by a full-featured user interface and accessible to curators and web archivists at any institution. The crawling system will focus on enabling the capture of complex, dynamic websites.
SUPPORT FOR TRANSITIONING TO PYWB (2020-2021) – Completed
Project coordinators: Tools Development Portfolio Leads & IIPC SPO
Project lead and developer: Ilya Kreymer, Webrecorder.net
Funding: 30,000 EUR
This project supports transitioning to next generation web archive replay tools with Webrecorder pywb. Many institutions have long been relying on OpenWayback as a replay solution, however, as it has aged, it has not kept pace with the ability to faithfully render all aspects of the evolving web. This supported work on pywb, intended to ease a transition to the higher fidelity replay tool, aims to develop features that reflect functionality institutions have in their existing replay systems and to provide documentation to facilitate a migration. Work will be in three phases. The first phase provides support for migrating from common replay scenarios in OpenWayback to pywb. The second phase develops APIs to support modularity around index and WARC store solutions, implements various access controls, and ensures mulitilingual support. The final phase focuses on the user interface, providing guides on styling and embedding pywb, in addition to enhancing banner navigation and providing a calendar display.
Ongoing Projects
COLLABORATIVE COLLECTIONS
Project coordinators: Nicola Bingham, The British Library & Alex Thurman, Columbia University Libraries, Content Development Group Co-chairs
The IIPC Content Development Group will continue collaborative collections in 2023 via the IIPC Archive-It account. Current collections include new crawling for the ongoing Intergovernmental Organizations and Novel Coronavirus (COVID-19) collections as well as the War in Ukraine collection.
To subscribe to the CDG mailing list, please email communications@iipc.simplelists.com
TRAINING CURRICULUM DEVELOPMENT
Project coordinators: Lauren Baker, Library of Congress, Claire Newing, National Archives, UK and Maria Ryan, and National Library of Ireland, Training Working Group (TWG) Co-chairs
In 2022 the following activities will be supported by the IIPC:
- promoting the beginner module;
- organising a workshop;
- planning next module.
To subscribe to the TWG mailing list, please email communications@iipc.simplelists.com
IIPC TECHNICAL SPEAKER SERIES
Project coordinators: Jefferson Bailey, Internet Archive and Olga Holownia, IIPC SPO
The IIPC invites members (or member organisations) to present 30-60 minute online webinars on new, recent, or innovative technical projects within their organisations. The series is not intended to be training or workshop oriented, but instead provide an opportunity for members to disseminate information and showcase their work on internal technical projects that have relevance to the broader IIPC community. Speakers are selected through direct recruitment and a forthcoming open call for proposals. Small stipends are available for speakers, if needed.
IIPC RESEARCH SPEAKER SERIES
Project coordinator: IIPC SPO
Research Speaker Series focus on the research use of web archives. The webinars feature presentations of use cases, current collaborative projects and new tools for researchers.
Past Projects
MEMBERSHIP SURVEY
Project coordinators: Barbara Sierman, National Library of the Netherlands (Membership Engagement Portfolio Lead), Emmanuelle Bermès, National Library of France, Abbie Grotke, Library of Congress, Aija Vahtola, National Library of Finland and Peter Webster, Webster Research & Consulting
The Membership Engagement Survey, “Where can I find my IIPC friends”, was intended to foster collaboration between IIPC members, based on information related to their web archiving activities, staff and techniques. The results were presented at and used as input into the General Assembly in Wellington and in Zagreb. The survey was designed by Barbara Sierman, KB, and Birgit Nordsmark Henriksen, the Royal Danish Library, with inputs from the IIPC Steering Committee members and PCO.
PRESERVATION WORKING GROUP’S DATABASES
Project coordinators: Tobias Steinke, German National Library and Grace Thomas, Library of Congress, Preservation Working Group Co-chairs
The Preservation Working Group maintained a database of work packages on formats, software, web environmental scans and relevant bibliographies.
CROWDSOURCING WORKSHOP & USE CASES
The project aimed to investigate how crowdsourcing web archiving activities may begin to redress that balance and increase the amount of manpower available to throughout all stages of the web archiving workflow in member institutions.
The IIPC Harvesting Practices Survey was developed in order to understand, analyze and to collate the Internet archiving processes and experiences amongst IIPC members. The objective was to encourage and support memory institutions everywhere to address archiving and preservation of web resources by providing a benchmark and giving an overview of current web archiving practices.
The primary goal of the project was to evaluate the Twittervane – a prototype application, which is capable of analyzing Twitter feeds and determining which websites are shared most frequently around a given theme over a given time period.
HOW TO FIT IN? INTEGRATE A WEB ARCHIVING PROGRAM IN YOUR ORGANIZATION
This IIPC-sponsored workshop was held at the Bibliothèque nationale de France (26-30 Nov. 2012). The aim was to investigate the challenges and methods involved in implementing web archiving in all mainstream activities of a heritage institution: general institution strategy, acquisition practices, IT operations, preservation, access.
The overall goal of the project was to enhance existing tools in order to ease the adaptation of WARC as the preferred archiving format for digital preservation. In order to accomplish this, two applications were chosen which would cover the entire digital preservation workflow. The two applications chosen were: JHove2 and NetarchiveSuite.
The Live Archiving Proxy (LAP) project was a collaboration between Ina and Netarkivet.dk to build an HTTP proxy that would able to capture the traffic that flows trough it, and delegate the handling of the captured data to a writer using a simple network protocol. The goal was to be able to write the captured traffic into any kind of archive format using any computer language.
To goal of the project was to aggregate the metadata of the distributed archives of the IIPC, and to provide 1) Memento based access to the holdings of open archives, 2) knowledge of the holdings of restricted archives and 2) knowledge to IIPC members of the holdings of totally closed archives.
The University of North Texas College of Information sponsored a 3-year award to support doctoral studies in its Interdisciplinary Information Science Ph.D. Program.
The purpose of the project was to gather expert advice, assistance and guidance in the processes of migration from Heritrix 1 to Heritrix 3 and setting up distributed crawls with Heritrix 3.
STATISTICS AND QUALITY INDICATORS FOR WEB ARCHIVING
In 2009, the ISO Technical Committee 46 (Information and Documentation) decided to set up a working group on “Statistics and Quality Indicators for Web Archiving”. The group has delivered a Draft Technical Report (PDF) in 2013.
Prototype/Investigatory project by the British Library to use Twitter to build a web archive collection.
The main goal of the WARC Tools project was to facilitate and promote the adoption of the WARC file format for storing web archives by the mainstream web development community by providing an open source software library, a set of command line tools, web server plug-ins and technical documentation for manipulation and management of web archive files, or WARC files.
WAYBACK, HERITRIX AND NUTCHWAX DOCUMENTATION
2009 project led by the Internet Archive that documented NutchWAX, Heritrix, and Wayback.
WEB ARCHIVE PROFILING VIA SAMPLING
Research project looking at how archives respond to queries for archived content and over time build up a profile of the top-level domains (TLDs), Uniform Resource Identifiers (URIs), content language, and temporal spread of the archive’s holdings.