Preservation Working Group
About the Preservation Working Group
The Preservation Working Group (PWG) focus is on policy, practices and resources in support of preserving the content and accessibility of web archives. The PWG aims to understand and report on how approaches used for other kind of digital resources might be used with web archives, as well as the special characteristics of web archives that might require new approaches. It will provide recommendations for additions or enhancements to tools, standards, practice guidelines, and possible further studies/research.
The Preservation Working Group Mandate
- Characterize large scale web archives in order to
- Identify relevant approaches, standards and practices already used for preservation of other digital assets
- Report on how they might be used with archived web resources and/or
- Identify the gaps and promote new approaches.
- Make recommendations for enhancements or additions to tools, standards, practices, guidelines, testing, and possible further studies/research. These recommendations may be intended for IIPC members, other working groups, institutions and members of the digital preservation community, or tools developers / vendors.
- Design projects related to web archives preservation for IIPC funding to the Steering Committee.
- Promote recognition of the unique requirements to preserve archived web resources not achieved by other preservation programs for digital assets.
The PWG will continue in its work until standards and best practices for the preservation of archived web resources are developed and implemented across institutions.
2009-2010 Work Packages
- WP1: Tools gap analysis for formats/Study Scalability. Goal:
- List the main formats available in web archives.
- Test the ability of the identification / characterization tools to handle them.
- Make tools enhancements recommendations for most important formats.
- WP2: Preservation Strategies. Goal:
- Analyze and compare different preservation strategies for web archives.
- Provide metrics and costs (time, machines, workforce…) and analyze results.
- WP3: Browsers/Dependencies: Goal:
- List and describe the main browsers and plug-ins over the course of time.
- Analyze their dependencies (OS, hardware).
- WP4: Software Documentation Harvesting. Goal:
- Harvest main software vendors’ websites to preserve information on how software should be installed and used (e.g. user manuals…).
- Test if software (if freely available) may be archived as well.
- Analyze if it is possible to do this in a collaborative way.
- WP5: Crawler Documentation.
Goal:
- List and describe the crawlers that were and are used to build web archives.
- Identify possible idiosyncratic features of the files they produce. (e.g. Heritrix website…).
- WP6: Viruses. Goal:
- Assess the risk of keeping viruses in web archives.
- Provide scenarios to identify and discard viruses.
- If recognized necessary, set up a project to encourage one or several AV tools to manage WARC files.
- WP7: Information Packages. Goal:
- Build scenarios to design information packages (in the sense of the OAIS model) according to institutions’ content policies and preservation goals. It will notably encompass what kind of metadata to use, their level of granularity, and their location.
- The goal is to collect different perspectives and create a generic model, similar to the preservation workflows work package.
- WP8: Risk Assessment Review. Goal:
- Update the PWG “table of threats”. Identify and evaluate specific risks for web archives preservation
PWG Documents/Documentation
 |