IIPC members join working groups that engage in short and long-term projects to advance the practice of web archiving.
Active working groups
Co-chairs: Abbie Grotke, Library of Congress; Alex Thurman, Columbia University Libraries
Co-chairs: Tobias Steinke, Deutsche Nationalbibliothek (German National Library); Grace Thomas, Library of Congress
Past working groups
HARVESTING WORKING GROUP
Chair: Kristinn Sigurðsson, Landsbókasafn Íslands – Háskólabókasafn (National and University Library of Iceland)
The Harvesting Working Group’s primary focus was the development of web harvesting technologies, particularly around the Internet Archive’s Heritrix web crawler. The major areas of work include a smart crawler. Other areas of focus included:
- supporting the open source Heritrix crawler,
- development of a smart crawler and improving harvesting performance,
- development and support of the WARC file format,
- best practices and databases for sharing crawl information in bulk or selective harvesting,
- feature requests for crawler,
- harvesting the deep web,
- harvesting video and streaming media.
ACCESS WORKING GROUP
Co-chairs: Nicholas Taylor (Stanford University), Daniel Gomes (Portuguese Web Archive at FCCN-FCT)
The Access Working Group (AWG) focused on issues relevant to providing access to web archives. The group consisted of individuals of IIPC members institutions who worked together towards solutions to common problems. The groups also aimed to provide a forum in which IIPC members could share their experiences, establish common goals and inform their own development. In addition to technical research and development, the group recognised the legal, ethical and economic aspects of access. Furthermore, end-user, administrative and curatorial access to web archives all formed parts of the group’s considerations.
The Group generally met twice a year, once at the IIPC General Assembly (May) and once during the International Conference on Preservation of Digital Objects (iPRES) (fall).
- Understanding and defining user requirements for access
- Resources discovery including full-text and innovative ways of searching web archives
- Access to multimedia content within archived websites
- Tools for analysis of structure and content of web archives
- Identification and documentation of web archive use cases
- Technology watch
- Collaborative collection on 2014 Winter Olympics
Following the success of previous efforts to preserve web content relating to the 2010 and 2012 Olympic Games, members of the IIPC are again working together to archive content relating to the 2014 Winter Olympics. The Games will be held in Sochi, Russia in February 2014 and the Paralympic Games in March 2014. The Internet Archive and the University of North Texas will support the project, by crawling the selected seeds and supporting the use of an online nomination tool. It is hoped that the project will enable institutions to continue to experiment with tools and processes that facilitate collaborative definition, collection and accessibility of web data and to create a data set that can be shared/accessed by all IIPC members.
- Collaborative gather and access to 2010 Winter Olympics. Lead: Kris Carpenter, Internet Archive
- QA Assessment, Workflow optimization for Manual QA of Web collections, Lead: National Library and Archives of the Netherlands
- Backwards compatibility for NutchWAX with .10 release. Lead: Aaron Binns, Internet Archive
- Multi-lingual support in Nutch/NutchWAX for Japanese. Lead: Masayuki Asahara, National Diet Library, Japan
- Collaborative gather and access to 2012 Summer Olympics. Lead: Helen Hockx Yu, British Library
- Memento experimentation and service integration (Beta), Lead: Abbie Grotke, Library of Congress
- Evaluation of alternative full-text search platforms. Lead: Kris Carpenter Negulescu, Internet Archive
- Research use case of web archives. Lead: Claude Mussou, Ina