Working groups

IIPC members join working groups that engage in short and long-term projects to advance the practice of web archiving.

Active working groups

 CONTENT DEVELOPMENT WORKING GROUP

Co-chairs: Nicola Bingham, The British Library & Shereen Tay, National Library Board Singapore

 RESEARCH WORKING GROUP

Co-chairs: Ben O’Brien, National Library of New Zealand & Olga Holownia, IIPC

 TRAINING WORKING GROUP

Co-chairs: Lauren Baker, Library of Congress, Claire Newing, National Archives, UK & Kody Willis, Internet Archive


Past working groups

PRESERVATION WORKING GROUP

Chair: Tobias Steinke, Deutsche Nationalbibliothek (German National Library)

The Preservation Working Group (PWG) focused on policy, practices and resources in support of preserving the content and accessibility of web archives. The PWG aimed to understand and report on how approaches used for other kind of digital resources might be used with web archives, as well as the special characteristics of web archives that might require new approaches. It provided recommendations for additions or enhancements to tools, standards, practice guidelines, and possible further studies/research.

The preservation working group mandate

  • Characterize large scale web archives in order to
    • Identify relevant approaches, standards and practices already used for preservation of other digital assets
    • Report on how they might be used with archived web resources and/or
    • Identify the gaps and promote new approaches.
  • Make recommendations for enhancements or additions to tools, standards, practices, guidelines, testing, and possible further studies/research. These recommendations may be intended for IIPC members, other working groups, institutions and members of the digital preservation community, or tools developers / vendors.
  • Design projects related to web archives preservation for IIPC funding to the Steering Committee.
  • Promote recognition of the unique requirements to preserve archived web resources not achieved by other preservation programs for digital assets.

Collaboration was key to the working group’s activities, working closely with other IIPC working groups and the community at large.

Documentation

Authors: Gina Jones, Clément Oury

 PRESERVATION WORKING GROUP TERMS OF REFERENCE, JAN 2010

 FORMAT IDENTIFICATION FOR WEB ARCHIVES, POSTER PRESENTATION, IPRES 2010

HARVESTING WORKING GROUP

Chair: Kristinn Sigurðsson, Landsbókasafn Íslands – Háskólabókasafn (National and University Library of Iceland)

The Harvesting Working Group’s primary focus was the development of web harvesting technologies, particularly around the Internet Archive’s Heritrix web crawler. The major areas of work include a smart crawler. Other areas of focus included:

  • supporting the open source Heritrix crawler,
  • development of a smart crawler and improving harvesting performance,
  • development and support of the WARC file format,
  • best practices and databases for sharing crawl information in bulk or selective harvesting,
  • feature requests for crawler,
  • harvesting the deep web,
  • harvesting video and streaming media.

ACCESS WORKING GROUP

Co-chairs: Nicholas Taylor (Stanford University), Daniel Gomes (Portuguese Web Archive at FCCN-FCT)

The Access Working Group (AWG) focused on issues relevant to providing access to web archives. The group consisted of individuals of IIPC members institutions who worked together towards solutions to common problems. The groups also aimed to provide a forum in which IIPC members could share their experiences, establish common goals and inform their own development. In addition to technical research and development, the group recognised the legal, ethical and economic aspects of access. Furthermore, end-user, administrative and curatorial access to web archives all formed parts of the group’s considerations.

The Group generally met twice a year, once at the IIPC General Assembly (May) and once during the International Conference on Preservation of Digital Objects (iPRES) (fall).

FOCUS AREAS
  • Understanding and defining user requirements for access
  • Resources discovery including full-text and innovative ways of searching web archives
  • Access to multimedia content within archived websites
  • Tools for analysis of structure and content of web archives
  • Identification and documentation of web archive use cases
  • Technology watch
PROJECTS
  • Collaborative collection on 2014 Winter Olympics
    Following the success of previous efforts to preserve web content relating to the 2010 and 2012 Olympic Games, members of the IIPC are again working together to archive content relating to the 2014 Winter Olympics. The Games will be held in Sochi, Russia in February 2014 and the Paralympic Games in March 2014. The Internet Archive and the University of North Texas will support the project, by crawling the selected seeds and supporting the use of an online nomination tool. It is hoped that the project will enable institutions to continue to experiment with tools and processes that facilitate collaborative definition, collection and accessibility of web data and to create a data set that can be shared/accessed by all IIPC members.
COMPLETED PROJECTS
  • Collaborative gather and access to 2010 Winter Olympics. Lead: Kris Carpenter, Internet Archive
  • QA Assessment, Workflow optimization for Manual QA of Web collections, Lead: National Library and Archives of the Netherlands
  • Backwards compatibility for NutchWAX with .10 release. Lead: Aaron Binns, Internet Archive
  • Multi-lingual support in Nutch/NutchWAX for Japanese. Lead: Masayuki Asahara, National Diet Library, Japan
  • Collaborative gather and access to 2012 Summer Olympics. Lead: Helen Hockx Yu, British Library
  • Memento experimentation and service integration (Beta), Lead: Abbie Grotke, Library of Congress
  • Evaluation of alternative full-text search platforms. Lead: Kris Carpenter Negulescu, Internet Archive
  • Research use case of web archives. Lead: Claude Mussou, Ina