Status: Past Project
Project
|
Ā |
Migration to and distributed crawling with Heritrix 3
Proposing institution: National Library of the Czech Republic (NLCR)
Purpose of the project: To gather expert advice, assistance and guidance in the processes of migration from Heritrix 1 to Heritrix 3 and setting up distributed crawls with Heritrix 3.
Description of the project: Onsite training of the NLCRās crawl engineer at one of theĀ IIPC member institutions which use Heritrix 3 for duration of one to two weeks in order to talk to experts from the host institution, observe their practices and get handsāon experience as well as assistance and guidance in setting up and configuring Heritrix 3.
The goal of the visit is twoāfold:
(a) to migrate settings and profiles currently used by NLCR in Heritrix 1 to Heritrix 3, andĀ (b) to prepare an optimal strategy and Heritrix setāup for running distributed crawls over several machines.
Benefits to IIPC: The project will be helpful for these members who are evaluating migration to Heritrix 3, giving them a better insight into the migration process, its duration and difficulties they may encounter in the process. Both processes will be thoroughly documented and shared for the
benefit of the whole IIPC membership and archiving community in general.
Motivation: Although Heritrix 3 has been available for some time now, many IIPC members are still using Heritrix 1. A possible explanation for this may be the fact that the process of migrating settings and configurations from Heritrix 1 to Heritrix 3 is not an easy and straightforward task. NLCR has also been considering migration from already established archiving workflow based on Heritrix 1 to approach based on Heritrix 3. Any institution migrating to Heritrix 3 needs to transform its current crawl configurations used with the older Heritrix 1 version into current Heritrix 3 version which is based on quite a different configuration paradigm. Critical evaluation of good practices in crawl configuration is therefore crucial. It is also important to gain understanding of and skills in operating Heritrix 3, which is more codeāoriented as opposed to the UIābased Heritrix 1.
Several IIPC institutions are also engaged, or consider engaging, in wholeādomain (largeāscale) harvesting. Many of these institutions, especially smaller or newer IIPC members, who already carry out broadācrawls (including NLCR) do so using just a single machine while it is recommended to utilize parallel harvesting across multiple crawling machines. The reason is, again, lack of skills in setting up distributed crawling. However, such skills would help them accomplish wholeādomain crawling in shorter time and with less data redundancy.
Expected outcome: The output will be documented as a case study which will describe the processes of migration and setting up distributed crawls stepābyāstep, as well as pitfalls and potential problems. It will be shared with the IIPC community in the form of a document published on the memberās area forum and presentation at the GA in Washington, D.C. We also suggest making it available to the broader web archiving community and other interested parties as a journal article or conference presentation ā to be discussed and agreed upon with the EdCom and the PO.
Evaluation process: NLCR will write an evaluation report to the EdCom on the success of the migration to Heritrix 3 in its production crawls and results of distributed crawling tests.