WARC Tools project

Status: Past Project

Project
documentation

Authors: Hanzo

 WARC TOOLS

SEARCH TOOLS

 

Background

WARC TOOLS PHASE I AND PHASE II

The main goal of the WARC Tools project is to facilitate and promote the adoption of the WARC file format for storing web archives by the mainstream web development community by providing an open source software library, a set of command line tools, web server plug-ins and technical documentation for manipulation and management of web archive files, or WARC files.

This project has delivered a core software library called “libwarc” and a set of end user command line tools, extensions to existing tools, and simple web applications for accessing WARC content. In addition all the libraries have APIs and dynamic language bindings. The library and tools are scriptable (command lines in shell scripts, dynamic language bindings to the library), and programmable (dynamic language bindings, Java packages, and the C library itself).

Together, these deliverables are known as “WARC Tools” and are available as free software from the WARC Tools Project Homepage: http://code.google.com/p/warc-tools/

In parallel, Hanzo have developed an extension to WARC Tools that provides full-text and metadata search of WARC files, known as “Search Tools”, which is also available as free software, from the Search Tools Project Homepage: http://code.google.com/p/search-tools/

Together, these projects provide a compelling implementation of the WARC standard, and provide a robust engineering foundation for further development of tools and applications centred around WARC files and their usage.

WARC TOOLS PHASE III

Following the development of libwarc and associated WARC Tools in phases I and II, Hanzo will implement a third development phase, WARC Tools Phase III. This phase will build upon the original libwarc, extending the collection of WARC Tools and implement a full migration application. Phase III will include community participation in the specification of the tools and applications, these will come from a number of International Internet Preservation Consortium (IIPC) member institutions, and similarly for testing.

Phase III implementation will follow the original philosophy of providing powerful tools to enable crawl engineers, web archivists, researchers and other WARC users to easily manipulate and explore collections of web archive content without needing to write complex low-level code.