Working groups

IIPC members join working groups that engage in short and long-term projects to advance the practice of web archiving.

Active working groups

CONTENT DEVELOPMENT WORKING GROUP

Co-chairs: Nicola Bingham, The British Library & Alex Thurman, Columbia University Libraries

RESEARCH WORKING GROUP

Co-chairs: Ben O’Brien, National Library of New Zealand, Olga Holownia, IIPC & Grace Bicho, Library of Congress

TRAINING WORKING GROUP

Co-chairs: Lauren Baker, Library of Congress, Claire Newing, National Archives, UK & Kody Willis, Internet Archive

Past working groups

PRESERVATION WORKING GROUP

Chair: Tobias Steinke, Deutsche Nationalbibliothek (German National Library)

The Preservation Working Group (PWG) focused on policy, practices and resources in support of preserving the content and accessibility of web archives. The PWG aimed to understand and report on how approaches used for other kind of digital resources might be used with web archives, as well as the special characteristics of web archives that might require new approaches. It provided recommendations for additions or enhancements to tools, standards, practice guidelines, and possible further studies/research.

The preservation working group mandate

Characterize large scale web archives in order to
- Identify relevant approaches, standards and practices already used for preservation of other digital assets
- Report on how they might be used with archived web resources and/or
- Identify the gaps and promote new approaches.

Make recommendations for enhancements or additions to tools, standards, practices, guidelines, testing, and possible further studies/research. These recommendations may be intended for IIPC members, other working groups, institutions and members of the digital preservation community, or tools developers / vendors.
Design projects related to web archives preservation for IIPC funding to the Steering Committee.
Promote recognition of the unique requirements to preserve archived web resources not achieved by other preservation programs for digital assets.

Collaboration was key to the working group’s activities, working closely with other IIPC working groups and the community at large.

Documentation

Authors: Gina Jones, Clément Oury

PRESERVATION WORKING GROUP TERMS OF REFERENCE, JAN 2010

FORMAT IDENTIFICATION FOR WEB ARCHIVES, POSTER PRESENTATION, IPRES 2010

HARVESTING WORKING GROUP

Chair: Kristinn Sigurðsson, Landsbókasafn Íslands – Háskólabókasafn (National and University Library of Iceland)

The Harvesting Working Group’s primary focus was the development of web harvesting technologies, particularly around the Internet Archive’s Heritrix web crawler. The major areas of work include a smart crawler. Other areas of focus included:

supporting the open source Heritrix crawler,
development of a smart crawler and improving harvesting performance,
development and support of the WARC file format,
best practices and databases for sharing crawl information in bulk or selective harvesting,
feature requests for crawler,
harvesting the deep web,
harvesting video and streaming media.

ACCESS WORKING GROUP

Co-chairs: Nicholas Taylor (Stanford University), Daniel Gomes (Portuguese Web Archive at FCCN-FCT)

The Access Working Group (AWG) focused on issues relevant to providing access to web archives. The group consisted of individuals of IIPC members institutions who worked together towards solutions to common problems. The groups also aimed to provide a forum in which IIPC members could share their experiences, establish common goals and inform their own development. In addition to technical research and development, the group recognised the legal, ethical and economic aspects of access. Furthermore, end-user, administrative and curatorial access to web archives all formed parts of the group’s considerations.

The Group generally met twice a year, once at the IIPC General Assembly (May) and once during the International Conference on Preservation of Digital Objects (iPRES) (fall).

FOCUS AREAS

Understanding and defining user requirements for access
Resources discovery including full-text and innovative ways of searching web archives
Access to multimedia content within archived websites
Tools for analysis of structure and content of web archives
Identification and documentation of web archive use cases
Technology watch

PROJECTS

Collaborative collection on 2014 Winter Olympics
Following the success of previous efforts to preserve web content relating to the 2010 and 2012 Olympic Games, members of the IIPC are again working together to archive content relating to the 2014 Winter Olympics. The Games will be held in Sochi, Russia in February 2014 and the Paralympic Games in March 2014. The Internet Archive and the University of North Texas will support the project, by crawling the selected seeds and supporting the use of an online nomination tool. It is hoped that the project will enable institutions to continue to experiment with tools and processes that facilitate collaborative definition, collection and accessibility of web data and to create a data set that can be shared/accessed by all IIPC members.

COMPLETED PROJECTS

Collaborative gather and access to 2010 Winter Olympics. Lead: Kris Carpenter, Internet Archive
QA Assessment, Workflow optimization for Manual QA of Web collections, Lead: National Library and Archives of the Netherlands
Backwards compatibility for NutchWAX with .10 release. Lead: Aaron Binns, Internet Archive
Multi-lingual support in Nutch/NutchWAX for Japanese. Lead: Masayuki Asahara, National Diet Library, Japan
Collaborative gather and access to 2012 Summer Olympics. Lead: Helen Hockx Yu, British Library
Memento experimentation and service integration (Beta), Lead: Abbie Grotke, Library of Congress
Evaluation of alternative full-text search platforms. Lead: Kris Carpenter Negulescu, Internet Archive
Research use case of web archives. Lead: Claude Mussou, Ina