Staff exchange

Status: Past Project




Migration to and distributed crawling with Heritrix 3

Proposing institution: National Library of the Czech Republic (NLCR)

Purpose of the project: To gather expert advice, assistance and guidance in the processes of
migration from Heritrix 1 to Heritrix 3 and setting up distributed crawls with Heritrix 3.

Description of the project: Onsite training of the NLCR’s crawl engineer at one of the IIPC member institutions which use Heritrix 3 for duration of one to two weeks in order to talk to experts from the host institution, observe their practices and get hands‐on experience as well as assistance and guidance in setting up and configuring Heritrix 3.

The goal of the visit is two‐fold:
(a) to migrate settings and profiles currently used by NLCR in Heritrix 1 to Heritrix 3, and (b) to prepare an optimal strategy and Heritrix set‐up for running distributed crawls over several machines.

Benefits to IIPC: The project will be helpful for these members who are evaluating migration to
Heritrix 3, giving them a better insight into the migration process, its duration and difficulties they
may encounter in the process. Both processes will be thoroughly documented and shared for the
benefit of the whole IIPC membership and archiving community in general.

Motivation: Although Heritrix 3 has been available for some time now, many IIPC members are still using Heritrix 1. A possible explanation for this may be the fact that the process of migrating settings and configurations from Heritrix 1 to Heritrix 3 is not an easy and straightforward task. NLCR has also been considering migration from already established archiving workflow based on Heritrix 1 to approach based on Heritrix 3. Any institution migrating to Heritrix 3 needs to transform its current crawl configurations used with the older Heritrix 1 version into current Heritrix 3 version which is based on quite a different configuration paradigm. Critical evaluation of good practices in crawl configuration is therefore crucial. It is also important to gain understanding of and skills in operating Heritrix 3, which is more code‐oriented as opposed to the UI‐based Heritrix 1.

Several IIPC institutions are also engaged, or consider engaging, in whole‐domain (large‐scale) harvesting. Many of these institutions, especially smaller or newer IIPC members, who already carry out broad‐crawls (including NLCR) do so using just a single machine while it is recommended to utilize parallel harvesting across multiple crawling machines. The reason is, again, lack of skills in setting up distributed crawling. However, such skills would help them accomplish whole‐domain crawling in shorter time and with less data redundancy.

Expected outcome: The output will be documented as a case study which will describe the processes of migration and setting up distributed crawls step‐by‐step, as well as pitfalls and potential problems. It will be shared with the IIPC community in the form of a document published on the member’s area forum and presentation at the GA in Washington, D.C. We also suggest making it available to the broader web archiving community and other interested parties as a journal article or conference presentation – to be discussed and agreed upon with the EdCom and the PO.

Evaluation process: NLCR will write an evaluation report to the EdCom on the success of the migration to Heritrix 3 in its production crawls and results of distributed crawling tests.