Improving the Dark and Stormy Archives Framework by Summarizing the Collections of the National Library of Australia

Project lead: Michael L. Nelson, Old Dominion University Department of Computer Science

Project partners: Los Alamos National Laboratory Research Library
National Library of Australia

Funding:
50,000 USD

Brief description of the project
Goals, outcomes and, deliverables
How the project furthers the IIPC strategic plan
Detailed description of the project
Project schedule of completion
Project updates
Final report

Resources:

GitHub repository (DSA)
Archive-It Utilities (AIU)
Off-Topic Memento Toolkit (OTMT)
MementoEmbed, Raintale
Hypercane

@StormyArchives

Brief description of the project

Our goal is to develop the Dark and Stormy Archives (DSA) toolkit to the extent so that it is generally applicable to all Memento-compliant web archives. We plan to pilot these modifications with the Memento-compliant archive of the National Library of Australia (NLA). The DSA Toolkit improves the summarization, abstracting, understanding, and sharing of collections of archived web pages. This extends work begun in the (now complete) IMLS grant “Combining Social Media Storytelling With Web Archives”. The five tools that make up the DSA toolkit are: Archive-It Utilities (AIU), Off-Topic Memento Toolkit (OTMT), MementoEmbed, Raintale, and Hypercane. Our pilot with the NLA will serve as a launching point for future collaborations with other interested organizations.

The DSA framework’s premise is to employ “storytelling” techniques to summarize archival collections by choosing a small number of exemplar pages from the collection that best demonstrate what the collection is “about”. This approach is beneficial over providing metadata about all pages in the collection since if a collection has 100s of seeds and 100s of mementos per seed, the collection quickly overwhelms conventional UIs. Rather than deploying esoteric or custom interfaces for summarizing the collection, we have demonstrated the potential of our approach of summarizing the collection with a small number of pages selected from the collection. The varying story and collection types influence which pages are selected and how they are arranged. We are optimistic that our collaboration will provide additional insight necessary to develop future standards that enable sharing collection information.

Goals, outcomes and, deliverables

The software suite is the core deliverable.

Archive-It Utilities (AIU) is a Python library that gathers seed URLs and metadata from web archive collections.
The Off-Topic Memento Toolkit (OTMT) identifies off-topic mementos in a collection.
MementoEmbed produces surrogates (e.g., cards, screenshots) of individual mementos.
Raintale leverages MementoEmbed to produce full stories of mementos suitable for publishing via static files or social media.
Hypercane provides tools for intelligently sampling mementos from a collection.

While updating AIU for NLA collections, the lessons learned will provide insight for future information standards for web archive collections, eventually making AIU obsolete.

Raintale and Hypercane are command-line applications. We will develop graphical user interfaces so future archivists can use these two tools without needing a Unix command-line background.

The second deliverable will be an online location for storing the visualizations produced by this toolset. We will help NLA to integrate this location with their existing infrastructure.

How the project furthers the IIPC strategic plan

This project primarily supports the second of the three goals in the IIPC strategic plan: “To foster the development and use of common tools, techniques, and standards that enable the creation of international archives.” Specifically, under “short term actions” under “tool development”, it supports “To foster a rich and interoperable tool environment based on modular pieces of software and a consensus set of APIs for the whole web archiving chain”.

These tools are in development at ODU and LANL, but they are of interest to the broader community. They have been extensively evaluated and presented in GitHub, our research blog, and various papers and presentations frequented by IIPC members and affiliates.

This project will broadly support the first (“enable the collection…”) and third (“encourage and support…”) of the IIPC goals as well. A variety of quality, de facto standard tools/formats/methods for creating web archive collections already exist (e.g., Heritrix, Open Wayback, WARC), and there are emerging tools for processing and introspecting on collections (e.g., Archives Unleashed and ArchiveSpark). This project focuses on continuing the development of these open-source tools that increase the utility of existing web archiving collections, allowing organizations to leverage their previous investment in creating their collections.

Detailed description of the project

We will share software project source code via GitHub, binaries via PyPi and DockerHub (where appropriate), and any resulting datasets via Open Science Framework, Figshare, or GitHub. We will communicate our updates via presentations at conferences and workshops, blog posts, and social media. We will measure software completion by closed GitHub issues, and memento selection will be measured via ongoing Mechanical Turk experiments (cf. CIKM 2019 preprint).

As we improve each tool, we will produce incremental releases that include updates to documentation. For Raintale, MementoEmbed, and Hypercane, we will create our documentation using reStructuredText and publish it for the community via the Read the Docs platform.

Python is the base language for all software products in this suite, but we are not limiting the suite to tools from the Python ecosystem. Where possible, we want to incorporate other best-of-breed tools such as NodeJS, the Archives Unleashed Toolkit, Stanford CoreNLP, NLTK, spaCy, and Squidwarc.

Project schedule of completion

Phase 1 Work – Understanding the NLA Infrastructure as a Pilot Web Archive

Shawn Jones, Martin Klein, Michael Nelson, Michele Weigle, Paul Koerbin, and Alex Osborne will be responsible for:

deciding which selection algorithms provided by Hypercane might work best for different types of collections
deciding which new file formats need to be supported by the DSA tools – currently only HTML pages and their embedded images are processed
determining which visualizations provided by Raintale will best meet the archivist’s goals
- social media works best for advertising and awareness but is only temporarily visible
- static resources, such as HTML output, provide more permanent content
designing rough sketches of the graphical user interface for Hypercane and Raintale to be used by archivists
determining possibilities for storing and displaying the resulting storytelling visualizations

By the end of Phase 1, we will have produced requirements and design documentation that will serve as a plan for the work to be completed in phase two.

Phase 2 Work – Development

Shawn Jones, Martin Klein, Michael Nelson, Michele Weigle, and a to-be-selected Old Dominion University graduate student will be responsible for:

developing the agreed upon graphical user interfaces
- if web-based, such a GUI may need to address security concerns, with tools such as authentication, to prevent third parties from creating summaries that were not approved by the archivist
- a GUI would also provide support for all existing Hypercane/Raintale functionality, including the selection of templates and algorithms
- such a GUI should utilize a design pattern such as MVC that separates GUI presentation from the logic already present in the DSA Toolkit to preserve the existing separation of concerns and ensure that the command-line interface still functions
developing and testing the improved installation of the DSA Toolkit
- currently, the toolkit supports Docker, but not all users have access to Docker
- outside of Docker, some tools do not install well; for example, OTMT has installation issues on Microsoft Windows
- we must ensure that the installation is smooth for upgrades to new versions, which will be necessary for Phase 3 as the NLA reports issues and ODU/LANL addresses them with updated software
updating other aspects of the DSA Toolkit to support the NLA, including:
- accounting for different memento presentations
  - all of NLA’s mementos reside behind HTML frames – we will need to update Hypercane, MementoEmbed, and the OTMT to account for this during processing
  - these changes will help future DSA Toolkit users who want to summarize web archives that present their mementos using this framing method
- updating AIU to provide NLA collection metadata
- updating Hypercane to accept NLA collection identifiers as input for discovering mementos
- updating Raintale, if needed, to support this newly available metadata
- we will need to develop Raintale templates to format the output in the formats requested by the NLA
- improving Raintale’s video storytelling capabilities
- updating MementoEmbed to support generating visualizations for file formats other than HTML
updating Hypercane with the discussed selection algorithms for different types of collections
- Hypercane already implements AlNoamany’s Algorithm, but many others are possible
- Hypercane separates the algorithmic primitives so that users can create many different types of algorithms – we would need to develop “wrappers” for these steps
- based on feedback from the NLA, we may need to create new primitives
development testing with public NLA collections
development regression testing with currently supported web archives, like Archive-It, the Internet Archive, UKWA, Arquivo.pt, and others to ensure that adapting the tools to work with NLA does not create problems for working with other web archives

By the end of Phase 2, we will have improved software products to share with the community.

Phase 3 Work – Integration at the NLA

Paul Koerbin and Alex Osborne will be responsible for:

providing the agreed-upon visualization target location, including, but not limited to tasks like setting up permissions and allocating disk space
providing a place to run the DSA Toolkit tools for NLA personnel
exercising the DSA toolkit within the NLA infrastructure and submitting issues against the DSA toolkit for ODU/LANL to evaluate
negotiate the priority of these issues with ODU/LANL
final acceptance testing based on feedback from ODU and the resolution of these issues

Shawn Jones, Martin Klein, Michael Nelson, Michele Weigle, and a to-be-selected Old Dominion University graduate student will be responsible for:

keeping a backlog of issues to be addressed
evaluating the issues submitted by the NLA
negotiate the priority of these issues with the NLA
update the software within the DSA Toolkit to fix these issues
ensuring that fixes for NLA do not impact current functionality for other Memento-compliant web archives

Phase 4 – Wrap Up

We will synthesize the community input, finalize the software, and prepare relevant publications for venues such as Web Archiving and Digital Libraries Workshop (WADL), and IIPC WAC.

PROJECT SCHEDULE OF COMPLETION:
Phase 1 – Understanding the NLA Infrastructure and Needs – January to March

NLA/ODU/LANL will conduct meetings on different aspects related to this project
decisions about the target installation environments to support NLA while still supporting the existing Docker capabilities
meeting to discuss the potential Hypercane algorithms that could apply to NLA collections
meeting to discuss of Raintale visualization targets, including social media targets and output file formats
offline discussions about the Raintale and Hypercane GUI designs that might work best for NLA administrators
preliminary meeting to coordinate aspects of Phase 3 so Phase 2 can proceed with this in mind
meeting to record overall requirements and agree upon work for Phase 2

Phase 2 – Development – March to August

developing and testing of the improved installation of the DSA Toolkit – this must happen first because it will be used by the rest of Phase 2 and Phase 3
development of graphical user interfaces for Hypercane and Raintale
resolution of issues encountered during development
testing to ensure that new NLA work does not impact the existing functionality of the DSA Toolkit
updates to the DSA Toolkit to address NLA’s mementos
development of templates for Raintale that address NLA’s needs
updates to AIU to provide NLA collection metadata
updates to MementoEmbed to support file formats other than HTML, such as PDF
improvements to Raintale’s video storytelling capabilities, if necessary
preliminary integration testing to detect easily diagnosed issues before Phase 3
meetings with NLA to discuss and demonstrate project progress and receive preliminary feedback
final integration testing before moving to Phase 3

Phase 3 – Integration with the NLA – August to November

NLA personnel test the updated DSA Toolkit and submit issues
ODU/LANL developers prioritize and address issues by negotiating priorities with NLA
final acceptance testing by NLA

Phase 4 – Wrap Up – October to December