Developing Bloom Filters for Web Archives’ Holdings

Project lead: Martin Klein, Los Alamos National Laboratory (LANL) Research Library

Project partners: National and University Library in Zagreb (NSK)

Funding:
24,741 USD

Brief description of the project
Goals, outcomes and, deliverables
How the project furthers the IIPC strategic plan
Detailed description of the project
Project schedule of completion
Project updates
Final report

Brief description of the project

We will develop a framework for web archives to create Bloom filters based on their holdings of archived web resources. A Bloom filter can be thought of as a sitemap for web archives, listing all (or a subset of) URLs of which an archive has one or more archival copies. Unlike sitemaps, however, URL strings are hashed before ingested into the Bloom filter, which means URLs are not shared in plain text. The filter does, however, allow queries for a URL to confirm if an archive indeed has one or more archival copies of that URL. Standard protocols such as Memento and various different implementations of CDX APIs provide related services but the key differences are that Bloom filters provide a way for archives to passively share information about their archival holdings and the filters are, compared to CDX indexes, small in size and can be transferred to an interested and trusted 3rd party. When, for example, implemented in a federated search service across multiple web archives, this solution has the potential to significantly reduce network traffic for unnecessary lookup requests against archives.

We propose a collaborative project between LANL and HAW (Croatian Web Archive (HAW) of the National and University Library in Zagreb (NSK), developed by NSK and University Computing Centre University of Zagreb SRCE) to develop software to create Bloom filters, evaluate the scalability of the approach, pilot a search service based on Bloom filters, and design a framework for archives to share their filters with trusted parties.

Goals, outcomes and, deliverables

The ultimate goal is to provide web archives with a software framework to generate Bloom Filters. These filters enable archives to passively share the URLs of their archival holdings. Having this information available makes a variety of 3rd party applications possible such as federated search services across archives, archive-agnostic cataloguing services of archived web resources, and synchronized crawling efforts avoiding unwanted duplication.
Specifically, the project deliverables are:

A software package that web archives can run on their end to create Bloom Filters,
A pilot federated search system for archived web resources based on Bloom Filters,
An investigation of scalability and optimization parameters for the creation of Bloom Filters,
A feasibility analysis of incremental Bloom Filters, and
The conceptual design of a framework for web archives to share Bloom Filters and their incremental updates.

How the project furthers the IIPC strategic plan

The project primarily supports the first and second of the three portfolio goals of the strategic plan. Specifically, it supports the goal of membership engagement and both short term actions as it helps to facilitate collecting information on member institutions’ holdings and shapes this information to be shared in a reusable form.
The project further supports the goal of tools development by fostering “…a rich and interoperable tool environment based on modular pieces of software and a consensus set of APIs…”. To the extent that the first deliverable mentioned above will be released to the community as open source and at this first stage be based on CDX indices, the project also supports the long term goal to build on standards and traditional core tools of the IIPC community and therefore allowing organizations to leverage their previous investment in their archival holdings and infrastructure. Part of the goal for future development is to expand the software to be usable for archives that do not maintain CDX files.

Detailed description of the project

Making web archives’ holdings public can lead to increased use, demonstrated value and relevance in the community and towards funding sources. It can further lead to painting an overall picture of what is (not) archived and therefore help optimize crawling efforts by avoiding undesired duplication. Previous IIPC-supported projects such as “Web archive profiling via sampling” underscore the potential of efforts in this realm.
However, from experience we know that, for various reasons, not all web archives are keen to proactively share information (URLs, seeds, or even metadata) of their index. In contrast, the success and wide adoption of interoperability standards such as the Memento protocol prove that web archives are willing to offer search capabilities of their index. As such, a user is able to locate archived copies of a web resource if she submits a URL of interest. This responsive (and therefore somewhat passive) delivery of information is common practice for most web archives around the world. LANL has utilized the power of Memento to establish a federated search across all Memento-compliant web archives available at http://timetravel.mementoweb.org/. To lower the false positives rate and therefore avoid unnecessary requests against web archives without archival copies of a submitted URL, we have established optimization methods such as machine learning models to anticipate archival holdings per archive. However, these models are imperfect as they are trained on log files.

Bloom filters are a promising approach to help archives share information about the URLs of which they have archival copies. In our context, Bloom filters can be utilized as a hash table that allows the precise lookup of URLs and returns a binary response (yes/no) as to whether archival copies of the URL exist. As such, we can think of it as a sitemap for web archives but since the content is hashed, it is not a “surrender of information” but can only passively respond to inquiries for URLs.

Applying Bloom filters in this context is a novel approach, so we have no baseline and very little data about scalability and performance in terms of reduction of false positives. This project therefore entails the implementation of software to create Bloom filters based on CDX files, a pilot to showcase its potential, and an evaluation of its performance. More specifically, we anticipate 4 phases in this project:

Phase 1 – Bloom Filter development
LANL and HAW teams will be responsible for:

Knowledge exchange of the archival environment and utilized software at the HAW
The development of software to create Bloom Filters
Testing and debugging of software in LANL and HAW environments
Investigating approaches to create incremental Bloom filters

Phase 2 – Scalability Analysis
HAW and LANL teams will be responsible for:

Measure scalability of the Bloom Filter approach with varying CDX index sizes
Evaluate multiple dimensions, for example, creation time of Bloom filters, resulting Bloom filter size, URL lookup time for different filter sizes or hash functions
Investigate confidence parameters for Bloom filters to achieve a target precision and avoid false positive responses, while taking the above dimensions into consideration

Phase 3 – Pilot Implementation
LANL team will be responsible for:

Implementing a pilot URL lookup service, similar to the currently available Memento TimeTravel service, based on newly created Bloom filters from the HAW
A preliminary comparative performance analysis based on number of search requests saved, represented as reduced false positive rate

Phase 4 – Bloom Filter Sharing
HAW and LANL teams will be responsible for:

Evaluation of approaches for the HAW and other web archives to reliably share created Bloom filters and their increments
Implementation and testing of workable framework

LANL and HAW as project partners will collaboratively manage the project utilizing the Open Science Framework platform and share software source code via GitHub or other Git platforms. The project leads, in collaboration with all project partners, will monitor progress and performance and keep track of milestones and key deliverables. We will measure software completion by closed issues on the Git platform and successful test runs in both environments (LANL and HAW). In addition, we will communicate our updates via presentations to the IIPC community, at international conferences and workshops, via blog posts, and social media.

We do not anticipate any significant technological risks in completing this work. However, the success of the framework is ultimately dependent on the adoption by web archives. The HAW has committed to piloting this work in the realm of this project and we are optimistic that our pilot to showcase the power of the implementation (2nd deliverable) as well as our performance and scalability analysis (3rd and 4th deliverables) will help convince others to adopt this approach.

Project schedule of completion

January to June: Phase 1 – Bloom Filter development

Project partner initial meetings to understand HAW’s web archiving environment, assess potential software and hardware constraints
Development work on Bloom filter creation software, based on local (LANL) CDX files, to be executed at HAW
Make software configurable via configuration file to adjust to typical web archiving environments such as that of the HAW
Software testing and debugging (ongoing beyond phase 1 if needed)
Investigate approaches to implement incremental Bloom filters

March to June: Phase 2 – Scalability Analysis

(in partial overlap with phase 1) Evaluation of scalability of approach
Design and conduct tests to measure impact of parameters such as filter size and hash functions
Adjustment of Bloom filter software based on analysis results, including further testing and debugging

July to October: Phase 3 – Pilot Implementation

Utilize created Bloom filter from HAW for pilot implementation of search system
Pilot design allows for a federation of searches across multiple archives, each with their own Bloom filter created
Pilot allows for use of newly shared Bloom filters from archives (hence overlap with phase 4)

August to October: Phase 4 – Bloom Filter Sharing

(in partial overlap with phase 3) Design and test of a framework to share created Bloom filters with trusted parties such as LANL

November to December: Wrap up