Web archive profiling via sampling

Status: Past Project

Project
documentation

Authors: Michael L. Nelson, Herbert Van de Sompel, David Rosenthal, Sawood Alam, Lyudmila L. Balakireva, Harihar Shankar and Nicolas J. Bornand

FINAL REPORT, 2016
ODU ARCHIVE PROFILING PROJECT PROGRESS
WEB ARCHIVE PROFILING THROUGH CDX SUMMARIZATION @ TPDL, 2015
PROFILING WEB ARCHIVES (slides) @ WADL, 2015
PROFILING WEB ARCHIVES (slides)
ARCHIVE PROFILE SERIALIZATION (slides)
ARCHIVE PROFILER (repository)

Project proposal

Objectives

Based on our collective experience mediating access to federations of web archives, we propose to research a variety of methods for determining the contents of an archive by sampling its contents. In particular, we will look at how archives respond to queries for archived content and over time build up a profile of the top-level domains (TLDs), Uniform Resource Identifiers (URIs), content language, and temporal spread of the archive’s holdings. We will provide a less expensive and less intrusive way for generating an archive profile that will converge over time to that generated by directly measuring an archive’s contents, define a format and mechanism for archival profile serialization and dissemination. The associated techniques and formats will be suitable for both IIPC and non-IIPC archives.

There are two motivations for this project. First is to build upon the experience in the prior Memento Archive Metadata Aggregation (MAMA) Project, lead by Los Alamos National Laboratory (LANL) from March 2011-2012. Robert Sanderson (LANL) presented the status of this project at the 2012 IIPC General Assembly. The design of the project called for shipping CDX files (i.e., metadata files that describe the WARC files generated by crawlers such as Heritrix) to LANL to analyze for each URI which archives have mementos (i.e., archived web pages) and provide this information to partners so they could coordinate coverage, tune their crawls, etc.

Transferring approximately five terabytes of CDX files and the subsequent processing turned out to have their own set of challenges. Even after these engineering problems were addressed, there was still the problem of the subsequent transfer and processing of files after the initial baseline. In summary, this solution to the problem of sharing archival coverage requires an extraordinary and continuing level of coordination, and the focus on individual URIs is probably the wrong level of granularity. In summary, through the MAMA project we discovered that collection profiling on the level of individual URIs in CDX files is difficult to scale, difficult to sustain, and requires a level of administrator resources that not all archives possess.

The second motivation for this project is our concern for the scale-out and design of the current aggregator architecture. This was the subject of our joint Memento panel at the 2013 IIPC General Assembly: Herbert covered the current state of Memento software and standards, Michael covered the empirical measurements on existing archives, and David provided a forward-looking assessment of whether or not Memento could survive its own “success”.

We studied the implications for Memento aggregator scale-out for our paper at the 2013 Theory and Practice of Digital Libraries conference (TPDL). The premise is that as the number of Memento-compliant archives grows, having the aggregator perform distributed searching for each URI lookup is not going to scale. Right now there are already over a dozen public web archives in the aggregator, and many have explicit collection policies. For example, most of the national libraries / national archives collect web sites from within their national TLD and sometimes those of countries affiliated by geography and/or shared language (e.g., Germany and Austria, Portugal and Brazil). While the Internet Archive’s Wayback Machine is likely to produce good results for most queries, the Portuguese archive, for example, might not return good results for URIs in the bbc.co.uk domain. Similarly, if we know the Icelandic archive did not begin its collection until 2005, then requests in the .is TLD for datetimes prior to 2005 are unlikely to produce results.

Scope

This is a research project, so the primary focus is on evaluating techniques and methods. Software will be developed to support our inquiry, but it will be research-quality software and not production-quality software. Informed by our research results, we will engage the IIPC community a serialization format, with the hope of starting a de facto standardization process, the full measure of which will likely extend beyond the one year scope of this project.

Deliverables

The profiles generated for the archives in the TPDL 2013 paper were generated from a variety of URI samples: full-text queries (in the archives that support this feature) in multiple languages, the IA query logs we possess, and the query logs of the Memento aggregator. For this project, we plan to take a more comprehensive approach. Below we outline the deliverables for this project. We plan to work extensively with the IIPC partners from the prior LANL project, as well as identify new partners (possibly from outside the IIPC) who will also have an interest in the outcome of these efforts. We will also investigate and document additional architectures, approaches, and assumptions regarding aggregation, and based on the lessons learned from this project we anticipate submitting a more expansive project proposal to the NSF or similar agency.

DELIVERABLE 1: STATIC PROFILING THROUGH CDX FILES

We will begin with the investment already made in the prior LANL project where the CDX files have been transmitted from the archives to LANL. We will begin our exploration of the options for profiling the archives from the CDX files we already have. We will also use these directly measured values as ground truth for the sampling techniques described in Deliverable 2.

In the later stages of this task, we will also provide interested IIPC members with small, easy to install and run scripts that will analyze the CDX files in situ and transfer the analysis results to a central location. This will be an optional phase for interest IIPC members, and recognizing that CDX files are likely in a hierarchy of storage it can be applied to as few or as many CDX files as the hosting archive wishes (depending on ease of access).

At this stage, we will investigate other dimensions for profiling, such as classification (e.g., government, news), popularity, archive performance, and other features as determined by the IIPC community and feasibility.

DELIVERABLE 2: USAGE PROFILING THROUGH SAMPLING LOG FILES

Of course, building up a profile based on CDX files is not always possible and some cases might not be necessary. Furthermore, we will have to periodically revalidate our profile to determine changes in archival coverage and crawling policy. For this purpose, we propose to use sampling techniques to determine the profile for an archive based on what people are actually looking for.

First, we will instrument the LANL Memento aggregator to monitor the requests that are sent to all repositories and the sizes of the TimeMaps that are returned. Together with the baseline information established in deliverable 1, we will model how quickly a profile (similar to the abstract example above) can be built over time. For example, can we build a profile that is 95% complete in 6 months and/or 10,000 requests? How often should we revalidate the profiles to reflect new holdings? For example, in November 2013 the Icelandic Archive announced their holdings now go back to 1996, so the 2005 boundary discovered earlier should automatically be moved.

To capture traffic that the Memento aggregator does not see, as well as accommodate dark archives (i.e., archives that do not participate in the Memento aggregator because of intellectual property restrictions), we will develop a script that can be installed at a site that has access to HTTP logs and samples, anonymizes, and transfers the results to a secure location where we can process it. We will work with interested IIPC members to develop and evaluate these scripts.

DELIVERABLE 3: CONTENT PROFILING THROUGH SAMPLE URIS

The next sampling technique is to have an evolving, representative list of URIs and terms that we use to query different archives (most archives other than the Internet Archive support keyword search). Our initial URI sample for the TPDL paper was selected from URIs at dmoz.org, queries (in multiple languages) to archives that support full-text queries, and from access logs from the Internet Archive and the Memento aggregator. Despite a grand total of more than 70,000 URIs, we are not sure if this sample is large enough to allow IIPC partners to exchange useful information about their coverage.

In particular, we need to revisit the language and geographic coverage. For example, does the Portuguese archive really have more .com URIs than .pt URIs? Or is this simply because we did not uncover enough .pt URIs while creating our sample? The CDX files from deliverable #1 will inform our choices, but we will synchronize with IIPC members to ensure adequate coverage for particular domains and languages.

In our JCDL 2011 paper, we found widely varying archival coverage depending on where one sampled. Sampling from dmoz.org favors archives, like the IA, that use dmoz.org as a seed list for their crawls. We should also sample from social media, such as links shared in Twitter and Facebook, shortened by bit.ly, etc. as well as sample from search engine results — and not just Google and Bing, but also regional Google sites, Baidu, Europeana, and other popular and culturally important sites outside of North America. Unfortunately, CDX files do not contain language information, so language profiles will have to be generated solely from sampling and not from CDX baselines.

For dark archives, we again create scripts that will allow interested members to perform the sampling via an “internal crawler” that will synchronize with URI and term queries seen our test collection, run them internally, and push the results (via HTTP) to a place where we can retrieve them.

We will work with IIPC members to determine what sites from which to sample URIs. That is, instead of asking for specific URIs to include in the collection, we will ask for guidance for the regionally appropriate search engines, social media, directories, and other places from which to sample to ensure we have an evolving collection.

DELIVERABLE 4: PROFILE SERIALIZATION AND TRANSFER

The example profiles described at the beginning of this section are abstracted for illustrations; the actual profile serializations will most likely be in JSON, possibly an interlinked set of files using JSON linked data (JSON-LD). We will investigate multiple serialization formats and synchronization strategies for suitability. Even though the formats are for machines to read and exchange, we intend to retain at least some human-readability for ease of inspection. To define the semantics and syntax for the format we will work with members of the IIPC community. We will also investigate placing the individual archive profiles in a versioning system, such as GitHub, where they can be shared and versioned.

Project Initiators

Michael L. Nelson, Old Dominion University

Herbert Van de Sompel, Los Alamos National Laboratory

David Rosenthal, Stanford University

Project documentation