Foster WARC usage in scalable Web Archiving workflows using Jhove2 and NetarchiveSuite Context and baseline
Since May 2009, memory institutions and other digital archiving organizations can use the WARC (Web ARChive) file format, which was officially released as an international standard (ISO 28500:2009) to store and preserve documents harvested on the web. WARC is an extension of the ARC format, which has been extensively used since 1996 by the Internet Archive and by most members of the IIPC. These institutions recognized the need to extend the ARC format to add new capabilities, notably the recording of HTTP requests, the recording of local metadata, allocation of a unique identifier for every contained file, management of duplicates and migrated records, and the segmentation of records.
International standardization was a critical step towards the wide adoption of the WARC format. As part of this effort, IIPC also set up in November 2009 a WARC usage task force to write implementation guidelines, which were delivered and approved by the Preservation Working Group the following year. However today, because production and preservation workflows have recently been settled and are extensively used, many members are still using the ARC format for production purposes while acknowledging the need to transition to WARC.
The NetarchiveSuite community now proposes to develop the usage of the WARC format working into two directions: 1) give the ability to ingest WARC files into digital preservation workflows using JHOVE2, 2) study and implement WARC in a scalable production workflow using NetarchiveSuite as an example.
Part 1: WARC files into digital preservation workflows: JHOVE2 is an open source software for format-aware characterization of digital objects. JHOVE2 enables format identification, feature extraction, validation and assessment. The JHOVE2 project is a collaborative undertaking of the California Digital Library, Portico, and Stanford University. JHOVE2 is made freely available under the terms of the BSD open source license. This part of the proposal aims at providing JHOVE2 support for the following functions in order to make it a more useful tool for web archiving.
Module for the WARC format: Characterization performed at the record level, including both record headers and blocks (Warcinfo, response, resource, request, metadata, revisit, conversion, continuation). The proposal includes a significant amout of resources for developing this module. This will leave enough time to develop both the baseline WARC-module but also do advanced functionality based on input from the IIPC community
Integration of the ARC and GZIP modules developed by BnF into the core of JHOVE2. This project is to complement and continue the effort launched in 2010 to develop modules to the JHOVE2 project and software lead by the California Digital Library. BnF, one of the stakeholders of the present proposal, has been actively involved in this project for which it has spent a dedicated budget outsourced to a private company, which is in charge of building BnF's digital repository archiving and preservation system. ATOS and BnF have developed ARC and GZIP modules to Jhove2. This development took place in cooperation with CDL and with the support of IIPC Program Officer.
Part 2: Study and implement WARC in a scalable production workflow: the NetarchiveSuite environment The NetarchiveSuite is a complete web archiving open source software package. It gives the ability to prepare, schedule, run and monitor harvests of websites. It also enables to perform quality assurance and preserve harvested content. NetarchiveSuite is used for production purposes, developed and maintained by the NetarchiveSuite community which currently includes the State and University Library, Aarhus, Denmark, The Royal Library, Copenhagen, Denmark, the National Library of France and the National Library of Austria. The community hopes to extend to new partners in the future. This part of the proposal aims at:
- studying the implementation of the WARC format into the Heritrix web crawler in the light of the WARC standard and IIPC WARC implementation guidelines written by the WARC Usage task force,
- as the format may be revised within the ISO in May 2012, gathering possible fixes or evolutions needed by IIPC members and updating the guidelines if necessary (BnF, as convenor of the ad hoc standardization group at ISO and co-lead of the PWG, could help with this),
- studying and documenting the impact of WARC in harvesting and post-harvesting processes (such as indexing and feeding metadata into a curator tool), which would benefit all local curator tools,
- implementing WARC into NetarchiveSuite, while keeping ARC compatibility alive, - delivering a report based on the experience of the 3 partner institutions.